

SharedHashFile - s1m0n
https://github.com/simonhf/sharedhashfile
Share Hash Tables Stored In Memory Mapped Files Between Arbitrary Processes &amp; Threads
======
checker659
If this your project, good job dude. Reminds of of LMDB (symas.com/mdb/).

~~~
hyc_symas
Aside from the fact that both use mmap, there's little resemblance. LMDB does
zero-copy reads and writes, this does not. LMDB is a persistent data store,
this is not. B+trees are inherently very memory efficient, hash tables are
not.

This is limited to the size of RAM, LMDB is not - LMDB is only limited to the
size of the processor address space. This uses locks for reads, LMDB does not
- LMDB reads will scale to an arbitrary number of CPUs, perfectly linearly,
this will not.

Hash tables are great for small data sets, horrible for larger data.

~~~
s1m0n
Here's a link [1] which tests LMDB on a Rackspace server with 16 vCPUs. To
make the test fairer to ShardHashFile then the LMDB data file is stored in
/dev/shm so that the disk does not get in the way of the test. The test first
inserts 70 million keys (tried but failed to insert 100 million keys; how to
do that?) using 16 processes (one for each CPU), then it reads the 70 million
keys again using the 16 processes, then it updates 2% of the keys while
reading the other 98% of the keys again, again using the 16 processes.

Read performance without any writing does seem excellent at 6.x million reads
per second across the 16 processes. However, insert speed is very slow at only
0.1 million inserts per second. But then LMBD does not claim to be fast at
writing. Unfortunately the mix 2% update, 98% read brings the read performance
down from 6.x million ops per second to only 0.7 million ops per second. So
LMDB seems like an excellent solution if one hardly ever wants to insert.

I would also be very happy if anybody can find ways to optimize the test since
I could not find a tutorial on programming with the LMDB API. For example, is
it necessary to always use a txn when putting and getting? I also couldn't
figure out how to insert 100 million keys because in order to make the map
size big enough then mdb_env_set_mapsize() always complained when giving it
super large values. Is this a limitation of LMDB or how else to increase the
map size so that 100 million keys (or more) can be mapped? And as a side
question: Inserting the keys is so slow: Is there a faster way to initially
insert all the keys in order to speed up the performance test?

I was also surprised at how big the LMDB data.mdb file gets with 70 million
keys & values. The keys & values are 4 byte + 4 byte, so 8 bytes each.
However, the data.mdb file ended up as 1.8GB which works out to about 27.6
bytes per key,value pair... which does not seem that good compared to a hash
table, or?

[1]
[https://gist.github.com/simonhf/9677776](https://gist.github.com/simonhf/9677776)

~~~
hyc_symas
re: your env size problem, sounds like you've built a 32-bit binary, so you
can't use more than 2GB.

Also, please read the LMDB docs. Note that it is a single-writer design, so
spreading writes across 16 processes will be quite poor.

------
iwasphone
Ambitious undertaking, possibly transformational.

