Saturday, December 05, 2009

Persistent Dictionaries (a.k.a. disk backed key-value store)

I recently faced the situation where I needed to employ an on-disk dictionary that supports O(n) or O(log n) lookup complexity. The need for a dictionary (aka map, associated array, etc.) is clear enough and very common, I can hardly remember any piece of software I wrote that was completely free of one, but the requirement of this data structure to be on disk is the less frequent one. So I set out to look for solutions.

[for those who absolutely have to know why I needed it to be on disk, it was because my program was a web service that will get loaded in memory to satisfy a request, then dies, and I didn’t want to load the whole set of elements [stop word list for those that love details and yes, what I really really needed was a set data structure not a dictionary] every time. But really, there are many different scenarios where you would need such a thing. Imagine a really huge dictionary, … I mean REALLY huge. Imagine a dictionary that you want to share between different processes.]

I knew vaguely that the python community had several solution, and they do. They mainly have an interface that can do this backed up with various database options, usually, simple, small and embedded DBMSs like dbm or Berkley DB, and the DBMS is responsible for indexing and fast retrieval.

But I needed that in C#, or at least .net.

And I thought that I can do the same, I can use some embedded database (e.g. sqllite, BDB, or Microsoft’s SQL Server compact) and wrap it in a dictionary interface:

Dict. Add(key, value);

But I still thought that employing a DBMS is too much, like killing a bee with a bazooka, and introduces external dependencies.

So I looked some more …

I searched for things like: “persistent hashtable” , “persistent map” , “persistent data structures” , “file indexing” , etc.

… and I found this:

A java code sample that uses the filesystem itself as the indexing structure:

So Dict.Add(key, Value) will actually create a file (in a specified folder) with the name=key and the value written inside.
And Dict.Get(key) will simply ask the filesystem to open the file(key) and deserialize its contents.

Quite devious!!! I loved the idea. But it was a bit shabby, and it assumed that the filesystem is an efficient one that handles folders with thousands of files efficiently and guarantee fast lookup, which is probably the case with some filesystems (reiserFS perhaps? EXT3? ZFS?) but not with NTFS, at least I wasn’t confident enough.

What I needed was a disk-backed balanced tree! And surprisingly, when I searched for that I found exactly what I wanted:
A B-Plus tree implementation that support exactly the interface that I want, well, actually it supports the more elegant interface:

Dict[key] = value;
Value = Dict[key];

Turns out that the “disk-backed” is the right term to describe what I wanted, not “persistent”, for “persistent data structures” has a completely different meaning that has to do with mutability, as opposed to “data persistence” which means disk-backed 

Other notable libraries found in the way:

C5 collection library from the University of Copenhagen
Which has a persistent red-black tree implementation that almost fooled me.

Wintellect’s PowerCollections for .Net


Anonymous said...

Nice brief and this mail helped me alot in my college assignement. Say thank you you seeking your information.

Anonymous said...

If you are using .NET on Windows then the ESENT PersistentDictionary class does what you want -- it implements IDictionary so you can use it in most places that you use a .NET dictionary.

Anonymous said...

Nice fill someone in on and this fill someone in on helped me alot in my college assignement. Say thank you you seeking your information.

Anonymous said...

Well I agree but I dream the list inform should have more info then it has.

Anonymous said...

Easily I assent to but I contemplate the collection should prepare more info then it has.

Anonymous said...

Approvingly your article helped me truly much in my college assignment. Hats incorrect to you enter, intention look progressive in the direction of more interrelated articles promptly as its sole of my choice subject-matter to read.