Saturday, December 05, 2009
The Blade Itself
Persistent Dictionaries (a.k.a. disk backed key-value store)
[for those who absolutely have to know why I needed it to be on disk, it was because my program was a web service that will get loaded in memory to satisfy a request, then dies, and I didn’t want to load the whole set of elements [stop word list for those that love details and yes, what I really really needed was a set data structure not a dictionary] every time. But really, there are many different scenarios where you would need such a thing. Imagine a really huge dictionary, … I mean REALLY huge. Imagine a dictionary that you want to share between different processes.]
I knew vaguely that the python community had several solution, and they do. They mainly have an interface that can do this backed up with various database options, usually, simple, small and embedded DBMSs like dbm or Berkley DB, and the DBMS is responsible for indexing and fast retrieval.
But I needed that in C#, or at least .net.
And I thought that I can do the same, I can use some embedded database (e.g. sqllite, BDB, or Microsoft’s SQL Server compact) and wrap it in a dictionary interface:
Dict. Add(key, value);
Dict.Get(key);
But I still thought that employing a DBMS is too much, like killing a bee with a bazooka, and introduces external dependencies.
So I looked some more …
I searched for things like: “persistent hashtable” , “persistent map” , “persistent data structures” , “file indexing” , etc.
… and I found this: http://www.developer.com/java/other/article.php/600531
A java code sample that uses the filesystem itself as the indexing structure:
So Dict.Add(key, Value) will actually create a file (in a specified folder) with the name=key and the value written inside.
And Dict.Get(key) will simply ask the filesystem to open the file(key) and deserialize its contents.
Quite devious!!! I loved the idea. But it was a bit shabby, and it assumed that the filesystem is an efficient one that handles folders with thousands of files efficiently and guarantee fast lookup, which is probably the case with some filesystems (reiserFS perhaps? EXT3? ZFS?) but not with NTFS, at least I wasn’t confident enough.
What I needed was a disk-backed balanced tree! And surprisingly, when I searched for that I found exactly what I wanted:
http://bplusdotnet.sourceforge.net/
A B-Plus tree implementation that support exactly the interface that I want, well, actually it supports the more elegant interface:
Dict[key] = value;
Value = Dict[key];
Turns out that the “disk-backed” is the right term to describe what I wanted, not “persistent”, for “persistent data structures” has a completely different meaning that has to do with mutability, as opposed to “data persistence” which means disk-backed
Other notable libraries found in the way:
C5 collection library from the University of Copenhagen
http://www.itu.dk/research/c5/
Which has a persistent red-black tree implementation that almost fooled me.
Wintellect’s PowerCollections for .Net
http://powercollections.codeplex.com/
I am a tomb raider ...
Friday, December 04, 2009
العائد
لم يكن يدري على ما أظن أني لم أكن أقصد السلطان محمد الظاهر، فلم يكن خبر وفاته إنتشر بعد حيث ابحرنا. كان هذا قبل أيام قلائل من وصولنا إلى ميناء الزرقاء في عاصمة السلطنة الألفية. ودعت البحارة على عجل وحزمت حقائبي وكتبي وأخذت أول مركبة متجهة ناحية القصر السلطاني. لم أنزل بالقصر ذاته وانما تركت حاجياتي ببيت لي على مقربةٍ من القصر وسرت راجلاً دون أنا أغير ملابسي أو اغسل عني أثار السفر حتى وقفت أمام بوابةٍ صغيرة بالجانب الخلفي من القصر، لم يلبث حارس البوابة أن رأني حتى فتح فاه أثر الدهشة. أمهلته إلى أن انقضت دهشته ثم فتح لي الباب مسرعاً كأنه أفاق من حلم مررت به مسرعاً ورباط على كتفه ولكنه لم ينطق بكلمة واحدة.
مررت من عدة ممرات شجرية حتى وصلت إلى الباب الخارجي لمطبخ القصر. دخلت ومررت بالمطبخ المزدحم بالناس والروائح دون أن يعيرني أحداً أي انتباه. قلما ينتبه الناس إلى دخيل إلا إذا كان يمشي كدخيل، أما أنا فقد كنت أمشي في المطبخ كأحد عماله كما كنت أمشي في الحديقة كى أحد أهم الضيوف. أظن بعضهم كان ليتعرف وجهي إن حاول.
وصلت إلى القاعة المركزية بالمستوى الأرضي ووجدتها خالية على غير العادة. سلكت أحد الدهاليز ومنه إلى دهليز فرعي أخر وطرقت خمس طرقات في توال سريع. فتح لي شيخ قصير إختبأ نصف وجهه خلف لحية بيضاء. وقفت مبتسماً كالابله لا أعرف ماذا أقول. "هأنذا قد جئت" قلت أخيراً بعد فترة صمت لا بأس بها. أخذ الشيخ شهيقا وزفره ثم تحركت يده في لمح البصر لتلطمني على وجهي لطمة لم اتوقعها و لم اتوقع قوتها.
"سحقاً لك، أين كنت كل هذه السنين؟ فيم غيابك وفيم رجوعك؟"
Suffix Tree Clustering
Web Document Clustering: A Feasibility Demonstration , by Zamir and Etzioni. PDF here
The paper presents an attractive technique for clustering documents using Suffix Trees, the merits of which can be summarized as:
- The algorithm is incremental, meaning that documents can be added at any time and clusters can be re-evaluated accordingly
- Does this in linear complexity O(n)
- Deals with documents as strings, not as Bags of Words, which allows for accounting for context
- The algorithm actually produces sequences of words that represent each cluster (with no extra cost)
- Can produce clusters that are not mutually exclusive (fuzzy clusters)
- and finally the authors claim it can work efficiently on search engine snippets
Suffix Trees are data structures that are very suitable for indexing text documents (for search purposes), and I think it very devious to exploit the same structure for clustering.
Two Technical Lectures
I try to watch a technical lecture or two every weekend, this week I actually stumbled upon two good lectures.
http://videolectures.net/is04_
About succinct data structure, … how to index stuff without using much memory, … in other words, … compressing trees.
http://videolectures.net/
Spatial data structures for fast and exact nearest neighbor queries, in high dimensional spaces. I was originally interested in special data structures men ayam el physics engine wel collision detection in games, but they can also be used to speed-up some data clustering algorithms.
Dangerous Beans
“I thought, … there would be an island”, said Dangerous Beans (a rat) in a crestfallen and resigned tone, and I started to cry, weep and howl openly. I was driving to the CMIC Maadi office. Listening to an audio-book, written by Terry Pratchett. In less than ten minutes the book (The Amazing Maurice and His Educated Rodents) made me laugh again, an effect only Sir. Charles Spencer Chaplin achieved before. And in my opinion, one of the marks of true craft and art.
“I thought there would be an island” reverberated in my heart as the stood for e very forlorn dream, every dying hope, and every fallen idealist. I wept, not only for Dangerous Beans, I wept for the whole world (and obviously myself before all).
Steven Spielberg usually gets such an effect with his climax scenes at or near the end of the movie, (e.g. Schindler’s List, E.T., and Empire of the Sun) but there is as well another way to touch the audience, another form of high art, the kind that makes you on the brink of tears from the beginning to the end of the work, without actually having to have a climax, or needing to make you cry. Examples of this latter are “The Way we Were”, “Slumdog Millionair” and the book “Doublestar” for Robert. A. Heinlein. In “Slumdog Millionair” I literally had tears in my eyes for the whole movie. And in “The Way we Were” I felt a state of elation/heartache the whole length of the 2 hours+ movie. The state was not bound to a certain scene, but to the theme of the movie. Robert Heinlein is a totally different story, .. better saved for later.