
Diskhash – Disk-based, persistent hash tables - tekacs
https://metarabbit.wordpress.com/2017/06/30/ann-diskhash-disk-based-persistent-hash-tables/
======
bane
One of the nice things about using mmap'd data structures is that you don't
have to slurp the entire thing into memory to work on it. The OS's virtual
memory system will only load a few pages at a time for whatever's needed. This
means that in practical applications you're likely to be within an order of
magnitude of RAM speed if you're using an SSD.

Secondly, the other nice bit is that you can use structures larger than the
physical RAM on the system. It's kind of like getting a system with enormous
amounts of slow-ish RAM.

Thirdly, you can build a whole bunch of these for different purposes and just
open the ones you need, and not bother with the rest, again like having even
more RAM.

Finally, you can multi-process trivially on these as the OS sorts out the mess
and suddenly it's like having a slightly slow multi-million dollar multi-tb
shared memory supercomputer from 5-8 years ago at your disposal. Works great
on Beowulf clusters if your disk storage is shared on fast links and can
handle the IOPS.

The number of applications you can do with a 5-8 year old super computer are
vast. I've seen 20tB classifiers built on more or less commodity hardware
(<$100k for a cluster of cheap hardware and one big disk) that were previously
thought to require a $5 million dollar supercomputer just because of the large
shared memory. You can probably build an equivalent machine for <$10k these
days off of Newegg and some decent NAS boxes.

~~~
wfunction
> One of the nice things about using mmap'd data structures is that you don't
> have to slurp the entire thing into memory to work on it.

Right, but one of the not-so-nice-things is that you can't do the I/O
asynchronously, so you can end up with poor performance depending on the
access pattern. (I guess you _can_ , with another thread running in the
background touching pages, but it's more of a pain and you're not really
guaranteed the data will stay in memory until you use them.) [Edit: Actually I
guess if you touch by _writing_ to those pages rather than just reading from
them then they'll have to stay in memory... though do note that I'm assuming
no swap here.]

~~~
luispedrocoelho
At some cost to portability, IIRC, on Linux you can ask the operating system
to keep pages in memory. Other systems will probably have similar
functionality.

~~~
otterley
They can remain paged in, but unlike on FreeBSD, you can't control when dirty
pages are flushed to the backing filesystem on Linux. Specifically, the
MAP_NOSYNC option doesn't exist on Linux (see
[https://www.freebsd.org/cgi/man.cgi?sektion=2&query=mmap](https://www.freebsd.org/cgi/man.cgi?sektion=2&query=mmap)
for a description).

------
otterley
How does it compare to mdbm and lmdb?

mdbm: [https://github.com/yahoo/mdbm](https://github.com/yahoo/mdbm)

lmdb: [https://symas.com/lightning-memory-mapped-
database/](https://symas.com/lightning-memory-mapped-database/)

~~~
fanf2
It sounds like djb's cdb to me
[http://cr.yp.to/cdb.html](http://cr.yp.to/cdb.html)

~~~
otterley
cdb isn't memory mapped, and you can't update individual keys.

------
gravypod
The project's readme on Pypi reads _" The wrappers follow similar APIs with
variations to accomodate the language specificity."_.

Beautiful. There are so many times in my career that I've had to use poorly
hobbled together wrappers for other languages that I now really appreciate it
when someone takes their time to make their API "Fit" with the language.

------
SwellJoe
While this is pretty neat, it reminds me a _lot_ of Perl tie, which is kinda
considered an anti-pattern in modern Perl (though there are places where it
still makes sense and is still used). Any data structure in Perl can be
connected to a wide variety of backing stores using tie:
[http://perldoc.perl.org/perltie.html](http://perldoc.perl.org/perltie.html)

It was introduced in Perl 5.0, so it's been around for about a quarter
century, and it's generic in the sense that there are tons of backends that
use the tie interface. BerkeleyDB and gdbm and flat files and probably any DBD
SQL database are among the backing store options. But, again, a lot of people
think it should be avoided in the general case, even though it _seems_ like a
really nice simple solution to a common problem (I've got these
hashes/arrays/whatever, and I want to save them on disk so they're persistent
and shareable).

This implementation can't delete keys (yet?) which seems to limit its utility
for general purpose tasks. I also wonder about reliability. When I first
started reading I assumed it'd be used for small things, but he's working with
a 32GB file shared across multiple processes. That's a lot of opportunity for
data corruption.

~~~
mycelium
You insinuate without providing the reasons behind it. I say this because
diskhash and tie look like extremely useful things used carefully. This sort
of thing solves exactly a problem I have right now.

> Perl tie, which is kinda considered an anti-pattern in modern Perl ... a lot
> of people think it should be avoided in the general case, even though it
> seems like a really nice simple solution to a common problem

Why? Any links to blog posts?

> When I first started reading I assumed it'd be used for small things, but
> he's working with a 32GB file shared across multiple processes. That's a lot
> of opportunity for data corruption.

I've used something similar and data corruption because of size is low on my
list of concerns. If your storage layer fails in this manner silently, you
have much bigger problems.

~~~
SwellJoe
It's not an opinion I hold strongly, it's just something I've noticed. I've
seen more than one talk where someone apologized for implementing something
with 'tie'.

Here's one blog post about it (still opinion):

[http://www.skrenta.com/2007/05/tie_considered_harmful.html](http://www.skrenta.com/2007/05/tie_considered_harmful.html)

And, this article from the Modern Perl site itself has mixed opinions on the
subject, but comes down on the side of not using it, in the general case, but
knowing how and when to use it can be really valuable (Tie is the last subject
covered on the page):

[http://modernperlbooks.com/books/modern_perl_2014/11-what-
to...](http://modernperlbooks.com/books/modern_perl_2014/11-what-to-avoid-in-
perl.html)

I'm not trying to say there's no place for it. Tie in Perl has been used for
really cool stuff.

~~~
mycelium
Appreciate the response and reasoning!

------
raarts
Interesting. I used a disk-based hash table very similar to this one in DOS
Point-Of -Sale applications in 1989. Goal was to have consistent lookup time
when scanning barcode labels. Of course my average lookup times were much
slower due to low seek times.

------
rkv
Nice and simple. I tend to use a wrapper around BSD tree.h and SQLite for
persistence. Longer insert/update times but quick access and you get the SQL
language and tools for free.

~~~
fnl
Quick access? Tree lookup is O(log n) on a rb-tree. Hashtables are O(1). And
this Diskhash thing claims to scale to billions of entries. Not sure your
options are comparable, therefore.

~~~
fleitz
Claims is the operative word. Features generally aren't free. If mmap'ing
files was some sort of performance godsend don't you think most DB
manufacturers would just rip out all their custom caching code and replace it
with mmap?

~~~
fnl
Absolutely correct! I guess a comparison with various levels of "rows" and
concurrent access might be needed before judging any way...

------
rurban
I usually use gperf created tables or cdb tables for that purpose.

~~~
fnl
Neither of those seem to easily scale to billions of keys, though.

~~~
rurban
For gperf I had to add >int support by myself.
[https://github.com/rurban/gperf/commits/long_keys](https://github.com/rurban/gperf/commits/long_keys)

cdb is limited to 2GB, yes.

~~~
fnl
Thanks for the work on and link to gperf! Looks interesting, to say the least.

------
aboodman
This is a similar idea to what nbs uses internally:

[https://github.com/attic-
labs/noms/tree/master/go/nbs](https://github.com/attic-
labs/noms/tree/master/go/nbs)

... but the idea is developed to support fast mutations as well as other
related datastructures (sets, lists, etc).

------
dward
What happens if the process can't shut down cleanly and writes are in flight?
Is there a way to check for hash table corruption? Is there a plan to support
deletion/compaction/defrag? Why would I use this over something like LevelDB?

~~~
jitl
I think the use-cases are different here. This library is set up for write-
once, then read-only access later. LevelDB is all about mutability. I think
the use-cases here is for constant-time lookups on fixed-size data sets.
LevelDB is a tree-based map, where this is a hashmap, which may give better
performance for the author's purposes.

From the second paragraph:

 _My usage is mostly to build the hashtable once and then reuse it many times.
Several design choices reflect this bias and so does performance. Building the
hash table can take a while. A big (roughly 1 billion entries) table took
almost 1 hour to build. This compares to about 10 minutes for building a
Python hashtable of the same size._

------
GhotiFish
This is such a clean concept.

If you have a tiny application, doing a tiny thing, do you need redis? Do you
need sqlite? do you need to write some custom-text-file-ma-bob? nah. I feel
like simple simple little tools like this are in short supply.

~~~
luispedrocoelho
OP here.

Use redis if you can. Use something like diskhash if you cannot afford the
extra overhead of sqlite/redis.

In my case, the alternative we were using before is an in-memory hash table
built at startup from a text-based representation of the data. It was pretty
good and worked very well up to millions of entries, but at the current scale
we work with, it takes least 10 minutes at startup were necessary to build up
the thing and it used ~200GB.

------
ensiferum
Without going into to the code itself I'm always wary of these hobby projects
that then get used at work when they are by the same author. Who now really
owns the IP?

~~~
luispedrocoelho
Look at the source code license, it's MIT licensed.

We are a publicly funded research organization, so even our "work code" is
open source.

------
nkristoffersen
Ok, I know this doesn't help the conversation, but I kept reading "Dickhash -
Dick-based, persisted hash tables".

I was so confused.

------
ape4
sqlite gives you a bit more.

~~~
luispedrocoelho
Yes, it does give you a lot more and is great.

Unfortunately, on my initial tests, it just was not fast enough.

