I recently faced the situation where I needed to employ an on-disk dictionary that supports O(n) or O(log n) lookup complexity. The need for a dictionary (aka map, associated array, etc.) is clear enough and very common, I can hardly remember any piece of software I wrote that was completely free of one, but the requirement of this data structure to be on disk is the less frequent one. So I set out to look for solutions.
[for those who absolutely have to know why I needed it to be on disk, it was because my program was a web service that will get loaded in memory to satisfy a request, then dies, and I didn’t want to load the whole set of elements [stop word list for those that love details and yes, what I really really needed was a set data structure not a dictionary] every time. But really, there are many different scenarios where you would need such a thing. Imagine a really huge dictionary, … I mean REALLY huge. Imagine a dictionary that you want to share between different processes.]
I knew vaguely that the python community had several solution, and they do. They mainly have an interface that can do this backed up with various database options, usually, simple, small and embedded DBMSs like dbm or Berkley DB, and the DBMS is responsible for indexing and fast retrieval.
But I needed that in C#, or at least .net.
And I thought that I can do the same, I can use some embedded database (e.g. sqllite, BDB, or Microsoft’s SQL Server compact) and wrap it in a dictionary interface:
Dict. Add(key, value);
But I still thought that employing a DBMS is too much, like killing a bee with a bazooka, and introduces external dependencies.
So I looked some more …
I searched for things like: “persistent hashtable” , “persistent map” , “persistent data structures” , “file indexing” , etc.
… and I found this: http://www.developer.com/java/other/article.php/600531
A java code sample that uses the filesystem itself as the indexing structure:
So Dict.Add(key, Value) will actually create a file (in a specified folder) with the name=key and the value written inside.
And Dict.Get(key) will simply ask the filesystem to open the file(key) and deserialize its contents.
Quite devious!!! I loved the idea. But it was a bit shabby, and it assumed that the filesystem is an efficient one that handles folders with thousands of files efficiently and guarantee fast lookup, which is probably the case with some filesystems (reiserFS perhaps? EXT3? ZFS?) but not with NTFS, at least I wasn’t confident enough.
What I needed was a disk-backed balanced tree! And surprisingly, when I searched for that I found exactly what I wanted:
A B-Plus tree implementation that support exactly the interface that I want, well, actually it supports the more elegant interface:
Dict[key] = value;
Value = Dict[key];
Turns out that the “disk-backed” is the right term to describe what I wanted, not “persistent”, for “persistent data structures” has a completely different meaning that has to do with mutability, as opposed to “data persistence” which means disk-backed
Other notable libraries found in the way:
C5 collection library from the University of Copenhagen
Which has a persistent red-black tree implementation that almost fooled me.
Wintellect’s PowerCollections for .Net