
Writing a Very Fast Hash Table with Tiny Memory Footprints - ingve
http://www.idryman.org/blog/2017/05/03/writing-a-damn-fast-hash-table-with-tiny-memory-footprints/
======
pubby
Not to disgrace the author (it's a good article), but I've spent a lot of time
studying hashtable performance recently and I groan whenever I see benchmarks
against libraries like std::unordered_map or libcuckoo. No shit your hash
table performs better; it's like claiming you're good at running because you
can beat a crippled child in a race.

For a test to be meaningful it needs to include comparisons to linear probing
(both regular, robin-hood variants, and SSE variants), and other low-cost hash
tables like coalesced hashing or separate chaining with a memory pool.

Speaking of robin hood hashing, pretty much all of the existing articles about
it are missing some crucial facts about it. Don't take what blog authors say
for granted.

~~~
johnfn
> pretty much all of the existing articles about it are missing some crucial
> facts about it.

Would you mind expanding on those facts?

~~~
pubby
1) The linearly-probed robin hood variant is just a sorted array. That's it.
It's just a sorted array. I haven't seen an article explain it so succinctly
though.

2) The linearly-probed robin hood variant is only faster than regular linear
probing for searches that fail. Successful searches have the same average
probe count. Too many articles bullshit about how robin hood reduces variance
(which is irrelevant 99% of the time) and then go on to ignore the fact that
they cannot beat the performance of linear probing when it comes to successful
searches. One guy I saw spent hours trying to optimize his hash table for the
K-nucleotide benchmark without realizing that a trivial 5-minute
implementation of linear probing would beat it because searches in that
benchmark never fail.

3) There's a simple 40 year old bitwise trick by Knuth which allows one to
combine the probe distance and the hash into a single value. This saves up to
4 bytes per table entry, but I haven't seen anyone use it.

There's also a few articles that get deletion straight-up wrong, but meh. That
doesn't bother me too much.

Also, for double hashing I'm pretty sure a SSE/AVX variant is optimal. I
haven't tried implementing one though.

~~~
dryman
1) I don't see why a linear probing robin hood is a sorted array? Can you
explain more?

Robin hood hashing doesn't limit which probing scheme you use. I end up with
quadratic probing which gives me both good cache locality and good probe
distributions. The probing schemes I tried was omitted in this post because it
would bring too much noise. But I can give you some quick summary here:

1\. linear probing: probing distribution has high medium and high variance.
Performance is not that great either because of the high probing numbers.

2\. quadratic probing: probing distribution is not the best, but the medium
sicks to 2 probes. Since the first two probes are very close in quadratic
probing, its overall performance wins.

3\. When probing, hash the key with the probe added to the seed. This gives
very good hash distribution, but hash on each probe is very slow. Also you
cannot do deletion using this scheme.

4\. rotate(key, probe). This is somewhat like rehash, but way faster. The
probe distribution is also very good, but items goes too far away so we lost
the cache locality.

5\. Combination of different schemes listed above. Still, the quadratic
probing gives me best performance.

I also tried to use gcc/clang vector extension to speed up probing, but it
actually slows down for 30%! I guess I have to hand tune the SSE intrinsics
and measure it carefully with IACA to get the optimal performance.

Deletion itself is quite complicated and deserves its own post.

~~~
valarauca1
It's a sorted array because your sorting by 2 keys (A,B). A is the hash,
normally a u32 or u64. B is the probe count, also normally the same sized b/c
system ints are easy.

Strictly speaking the Robin Hood "sort" isn't purely lexiconally ordered. A
larger A value maybe replaced a smaller A, with a much larger B.

But this relation nonetheless is just a weird solution to build a cmp function
one. One that arguably doesn't work in a strict sense. But one that _nearly_
works.

~~~
dastbe
So if A is the hashed-to index into your table, then B is a function of A.
When you are doing your linear probe and you see a value along with its probe
count, it either

1\. Has a higher probe count, this means it's hash is further away than yours
=> less than yours

2\. Has a lower probe count than yours. This means its hash is closer than
yours => greater than yours

3\. Is equal in probe count to yours. That means it's in the same bucket =>
equal to yours.

------
dgacmu
One nit, as one of the authors of libcuckoo: Of the tested tables, libcuckoo
is the only one that's a concurrent hash table. I'm glad that without
modification, our single-threaded performance is still doing well, but this
isn't the primary use case of libcuckoo.

If you want to compare it against a thread-unsafe table, it's a bit more
honest to use the locked_table class, which grants ownership to all locks in
the table in advance. It's a fair bit faster.

It doesn't change the memory use - I've created a stripped-down version in the
past that elides all the locks, which is better for memory efficiency - but
that's not really our goal.

For more on using the locked_table interface, see:
[https://github.com/efficient/libcuckoo/blob/master/tests/uni...](https://github.com/efficient/libcuckoo/blob/master/tests/unit-
tests/test_locked_table.cc)

(Also, for hashing strings, please make sure you've got libcuckoo configured
to use cityhash for string hashing, as you had "opic", to make the comparison
equivalent.)

In fairness to the OP, however, it's worth noting that this is a pretty new
addition to libcuckoo (February), and we haven't documented or pushed it hard.

~~~
dryman
Thanks for pointing out the thread-unsafe alternative. I'll update the
benchmark with it and double I configured it with cityhash. (will update in a
day or two).

Speaking of cityhash, have you tried farmhash as well? I'm not sure what I did
wrong, but farmhash's performance wasn't good as cityhash in my hash table.
Did you experience the same problem?

~~~
dgacmu
Thanks! Appreciate the responsiveness. :)

We haven't - I had some other issues with Farm. I tried for a while to see if
I could get FarmHash included as the standard hash for TensorFlow instead of
the murmur-like homebrewed one it uses, but the cross-platform build issues
became too complicated to justify it. We ended up sticking with the murmur-
like one for "fast but unsafe", and providing SIP via the highwayhash package
as the "strong hash".

------
lorenzhs
If you're interested in _truly_ space-efficient hash tables—as in "at every
point in time, at least 97% of the space of the table is used for
elements"—check out this paper by a colleague of mine:
[https://arxiv.org/pdf/1705.00997.pdf](https://arxiv.org/pdf/1705.00997.pdf)
(the relevant plots are on pages 10 and 11). He's going to publish the code
soon-ish as well.

~~~
dryman
In my experiments using linear probing in robin hood doesn't give good results
as well. I hope the author has tried robin hood with quadratic probing. But
it's a good reference. I didn't dive deep in the cuckoo route. This paper
gives me more confidence to try it out.

~~~
dgacmu
If you're curious about blending some of the "don't increase by a power of
two" memory efficiency gains, I implemented this in a cuckoo hash table for
TensorFlow, also using Lemire's trick:
[https://github.com/tensorflow/tensorflow/blob/master/tensorf...](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/presized_cuckoo_map.h)

I never back-ported it to libcuckoo, because it adds a fair bit of complexity
(and requires an extra bit), but if there's demand for non-2^k tables, I'd
love to know more, either externally or internally. (dga@, for various values
of cs.cmu.edu and google)

~~~
dryman
Lemire's fast range works perfectly with cuckoo, as it doesn't require probing
(?). The fast mod and scale trick was invented to address this issue.

It's actually pretty funny. I didn't know Lemire's trick until I start to
write this article. But I my first try on fixed point arithmetic was exactly
the same as his work. Then figured out the lower bits would get omitted with
fast range (I named it scaling). Finally I came up with this fast mod and
scale idea.

I don't know for other peoples need on non power of 2 tables. My target is to
build large scale hash tables and other data structures. The memory and data
compactness is quite critical for this goal.

------
skolos
I see often discussions about hash maps on HN. However they mostly focus on
performance (memory and speed) of one huge hash map with millions of records.
In my numerical calculations I often need lots of small hash maps (say up to
100 elements). Currently I use unordered_maps from C++ standard library. Does
anyone know what is recommended hash map implementation for my scenario?

~~~
nom
For very small number of elements, an array or a list can be more efficient
than hash maps. My rule of thumb is to avoid hash maps if you know you'll
never have more than 100 entries, and then try to find the bottleneck later
on. Using hash maps for everything isn't always the best solution.

~~~
oconnor0
Using a sequence instead of a map is fine if you know it will _never_ be more
than some small number of entries, but so often we are wrong about how many
entries will be in a table that the N-squared use of a sequence as a map kills
performance unexpectedly.

~~~
adrianN
But that's what profiling is for.

~~~
usefulcat
Profiling won't tell you what the chances are that the collection will someday
need to contain more than N elements.

~~~
adrianN
You don't replace every map in your code with a distributed Key-Value store,
in case it might need to store terabytes in the future. Most of the time you
have a pretty good idea whether a map has to store a hundred, or a hundred
thousand elements.

------
glandium
I only skimmed over, but about memory overhead:

 _> The input is 8 M key-value pairs; size of each key is 6 bytes and size of
each value is 8 bytes. The lower bound memory usage is (6+8)⋅2^23= 117MB_

In many case, hash table implementation won't (can't) assume fixed size of
keys and values, and use pointers. On 64-bit architectures, this can mean
there's an uncompressible overhead of 8 * 2 = 16 bytes (one pointer for each
key and value) for each item.

In fact, a quick look at the OP's benchmark code shows he's using std::string
as keys and uint64 as values. With e.g. std::unordered_map, while pointers
won't be used for the values, there obviously will be pointers for the key.

It's actually worse than pointers, since the std::string memory layout is a
size_t and a char* pointer. And it looks like his hash table essentially uses
the equivalent of char[6] as keys. So a fairer memory overhead comparison
should use char[6] as keys, instead of std::string...

~~~
jsnell
Small string optimization should kick in here, so you're probably not looking
at a size_t + pointer + a object with the 6 bytes of payload data. Instead
it'd be a single object of about 32 bytes (depending on the implementation).

But yes, this means that the author's own hash table implementation is getting
a roughly 4-5x density benefit from using a different string representation.
That totally invalidates the memory usage testing, and probably also has a
major effect on the speed tests.

~~~
glandium
Thank you for making me look at the implementation of std::string in GCC's
libstdc++... The "small string optimization" indeed makes every std::string at
least 1 size_t + 1 pointer + 16 bytes large (for a total of 32 bytes on
64-bits platforms).

Strings smaller than 16 bytes are stored in the 16 bytes inline buffer, which
means strings larger than that actually take 32 + their length + 1 bytes in
memory on 64-bits platforms.

I guess things are not packed further to allow to just directly use the
pointer instead of having to do some computation for each access.

~~~
jsnell
> I guess things are not packed further to allow to just directly use the
> pointer instead of having to do some computation for each access.

There are C++ standard libraries that pack things further. I think clang has a
24 byte std::string, which can store strings of up to 22 bytes inline.

------
faitswulff
This note about his other project (that the robin hood hash table depends on)
is quite curious:

> This robin hood hashing is implemented using my project Object Persistence
> In C (OPIC). OPIC is a new general serialization framework I just released.
> Any in-memory object created with OPIC can be serialized without knowing how
> it was structured. Deserializing objects from OPIC only requires one mmap
> syscall. That’s say, this robin hood implementation can work not only in a
> living process, the data it stored can be used as a key-value store after
> the process exits.

> Right now, the throughput of OPIC robin hood hash map on small keys (6bytes)
> is 9M (1048576/0.115454). This is way better than most NoSQL key-value
> stores. The difference might come from write ahead logs or some other IO?
> _I’m not sure why the performance gain is so huge. My next stop is to
> benchmark against other embedded key-value store like rocksdb, leveldb and
> so forth._

~~~
dryman
Thanks for raising this up. I spent way way more time on OPIC (almost a year)
than robin hood hash map (2 weeks to complete the POC). Glad to see people
noticing this project :)

The other comment pointed out MDBM, which I didn't know about. From their
performance number I think this may show that why OPIC robin hood is quite
optimal. [https://yahooeng.tumblr.com/post/104861108931/mdbm-high-
spee...](https://yahooeng.tumblr.com/post/104861108931/mdbm-high-speed-
database)

MDMB gives users raw access to the mmaped data and pointers. And from its
benchmarks it results 10x faster than rocksdb and leveldb. The design of OPIC
has even less overhead (may not be a good thing) than MDBM, and it also works
on a mmaped file (or anonymous swap). There's no lock, transaction, or WAL in
OPIC. OPIC just brings you the raw performance a hash table can gives you.

------
luckydude
Look up MDBM. I spent a lot of time with a logic analyzer watching the cache
misses go across the bus. I'd be pretty surprised if someone has done better.

I can find the code and repost it.

~~~
dryman
Thanks for the reference. I'm collecting a list of embedded key-value store to
benchmark against. I'll tryout with this one first!

~~~
corysama
Fun note: Because of this line in the article

> Because MDBM gives you raw pointers into the DB’s data, you have to be very
> careful about making sure you don’t have array over-runs, invalid pointer
> access, or the like.

Someone wrote a Rust binding [http://erickt.github.io/blog/2014/12/13/rust-
and-mdbm/](http://erickt.github.io/blog/2014/12/13/rust-and-mdbm/)

~~~
dryman
Wow that's a very interesting post! Maybe I should try to bring a rust binding
to my project as well..

------
wolfgarbe
How does it compare to the "Fastest Hashtable" by Malte Skarupke? (
[https://probablydance.com/2017/02/26/i-wrote-the-fastest-
has...](https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/) )

