
Robin Hood Hashing should be the default hash table implementation (2013) - doomrobo
http://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/
======
Bognar
The post mentions using tombstones for deletion, but I was under the
impression that the variance benefits of Robin Hood hashing weren't realized
unless you performed a backwards shift on the remaining elements instead of
using a tombstone.

See these posts for more information:

[http://codecapsule.com/2013/11/11/robin-hood-
hashing/](http://codecapsule.com/2013/11/11/robin-hood-hashing/)

[http://codecapsule.com/2013/11/17/robin-hood-hashing-
backwar...](http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-
shift-deletion/)

Unfortunately there's not great information around the overall CPU performance
impact of using a shift vs. tombstone.

~~~
mcguire
I don't know the actual performance impact, but using backwards shifts and
removing tombstones leaves the code much simpler.

[https://github.com/tmmcguire/rust-
toys/blob/master/pony/anag...](https://github.com/tmmcguire/rust-
toys/blob/master/pony/anagrams/robinhood.pony)

------
doomrobo
A variant of this is the default for Rust's HashMap.

[http://codecapsule.com/2013/11/17/robin-hood-hashing-
backwar...](http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-
shift-deletion/)

~~~
_nalply
See: [http://cglab.ca/~abeinges/blah/robinhood-
part-1/](http://cglab.ca/~abeinges/blah/robinhood-part-1/) about the
implementation in Rust.

~~~
Gankro
This post only touches on the theory, not the implementation. I never got
around to writing about the implementation as there was some pending churn at
the time.

~~~
_nalply
This is a pity. If you could please continue. I am hugely interested.

------
stcredzero
Neat idea! It's using the same generalized strategy as a Splay Tree. Not
exactly the same of course, but there's an analogy here. The algorithm
achieves a fast amortized (and average) time by continually shortening the
"search paths." (In the case of the RH Hash, it's not actually a path, but a
sequence of slots.)

~~~
dlbucci
Found the CMU grad! I think this is very different from a splay tree, though.
Open-address tables take advantage of the cache by removing linked lists from
the data structure, which is why I think they tend to have good real-world
performance.

Splay trees, on the other hand, can only really be implemented using pointers
and a linked structure (splaying a tree that's been implemented with an array
sounds absolutely miserable). They don't have good cache usage, which is why
they generally perform poorly outside of big-O notation.

~~~
jabl
Also, splay trees turn every read into several memory writes. Maybe not a good
idea if the data structure is read by multiple threads (cacheline ownership
bouncing back and forth all the time, rather than each core having it's own
cached copy).

~~~
stcredzero
Binary Trees are a bad idea from the POV of cache use, and from your analysis,
Splay Trees are much worse. However, this has nothing to do with the
generalized meta-strategy of continually reducing the amount of work it takes
to fetch items. It has to do with why Binary Trees are bad from the POV of
cache use.

------
nikic
One consideration to take into account when choosing your hash table
implementation is how much it is susceptible to the quality of the used hash
function. Hash tables using chaining are pretty robust against low-quality
hash functions. Using the same with a Robin Hood table will land you in a
catastrophic case very quickly.

This is the issue I ran into when investigating use of RH hashing in PHP --
the average case had good performance, but there were pathological cases
exhibiting extremely high clustering in a mostly empty table. The reason was
the low-quality hash (for integer keys the identity function).

------
jzebedee
I was comparing hash performance for a hobby project when I first discovered
Robin Hood hashing. I ended up making a demo implementation of the backwards-
shifting variant in C# [1] and found that it wasn't the best performer for the
load case I needed. I still think it's an incredibly clever algorithm and I
would love to find a project where I could use it again.

[1]
[https://github.com/jzebedee/rhbackshiftdict](https://github.com/jzebedee/rhbackshiftdict)

~~~
Bartweiss
I'm curious - what was your deletion strategy?

The CodingCapsule links in the comments here suggest that tombstone deletion
actually breaks the low-variance promise from the original paper, and only
back shift deletion produces the theoretical performance.

------
jabl
Another slightly fancier than the usual textbook approaches is cuckoo hashing.
Has anyone done a comparison with cuckoo vs. robinhood?

~~~
pkhuong
Lookups in cuckoo tables hit two uncorrelated ("random") locations in the hash
table. That's not a problem for hardware implementations that can just read
from two different SRAMs. On commodity hardware with virtual memory, it
depends. If the table fits in your TLBs, you're probably OK… but at such small
sizes, the complexity of Cuckoo hashing and its sensitivity to bad hash
functions might not be ideal.

If, however, both accesses are expected to incur TLB misses, you're very
likely screwed: handling the misses takes a lot of resources and the vast
majority of (commodity) CPUs can only handle one TLB miss at a time. After
hashing, the time for lookups is now dominated by two TLB misses; compare to
one TLB miss with linear probing. You can also play the odds and only do the
second lookup conditionally; that's still ~1.5 random accesses on average
(versus 1).

Cuckoo hashing can get you 95-99% occupancy versus 80-90% with simpler
schemes. Unless you have specialised size or cost requirements, it seems to me
the extra 10-20% space usage is a better choice than double the random
accesses.

~~~
nkurz
I lost you in the last paragraph. Are you maybe missing a word, or saying
cuckoo where you mean Robin Hood?

Separately, aren't TLB misses pretty fast compared to hitting RAM? I'd think
they would only dominate if the TLB itself it too large for cache, which I
don't think is common. And if you are using so much memory that this is
happening, moving to GB Hugepages would solve it

The "sensitivity to a bad hash function" seems like an odd weakness. For a
given quality of hash, wouldn't cuckoo tend to be less susceptible than any
single hash approach?

I ask because I'm currently bullish on d-ary cuckoo hashes, and think they'd
be a good fit for the fast gather on Skylake.

~~~
pkhuong
My last paragraph says you're probably better off with a solution that doesn't
achieve the same density but avoids the uncorrelated accesses. In other words,
stick to some form of local probing and take the hit in density.

Why would TLB misses be small compared to RAM latency? A TLB miss must be
serviced by reading more memory, _and_ misses aren't handled in parallel,
unlike random reads to RAM. Sure, you could use very large (1G) pages, but
that's a pretty specialised setup that's not available on every platform and
tends to require a reboot to enable/tweak. Not something we want to rely on in
general.

Cuckoo is particularly sensitive to bad hash functions: if a few elements
always hash to the same pair of values, you're screwed. That's particularly
problematic with the usual interfaces that don't let the hash table specify a
seed to the hash function and expect a machine word: we have to map values to
hashes, and remix or split the hashes (in theory, that's defensible as long as
the intermediate hash is strong and its codomain is at least (hash set
size)^2), but there's nothing to remix away if we have too many values that
map to the same intermediate hash. If the hash table specifies calls the hash
function with two different seeds, that means double the time spent in
hashing, and that overhead can cover for a lot of linear probing.

Simple deterministic probing technique just take a hit with a bigger cluster
than expected; a performance degradation, but the table still _work_. You can
also see theoretical analyses that'll lead you to similar conclusions if you
look at the k-independence needed for each hashing technique.

Finally, I don't think gathers are faster than independent memory accesses
unless everything is in the same cache line. You don't need SIMD instructions
to expose memory level parallelism, you just need independent dependency
graphs in your serial computation (until you're bottlenecked on an execution
resource that's not duplicated and rarely pipelined, like the TLB miss logic).

~~~
nkurz
_My last paragraph says you 're probably better off with a solution that
doesn't achieve the same density but avoids the uncorrelated accesses. In
other words, stick to some form of local probing and take the hit in density._

Thanks, I was misreading.

 _Why would TLB misses be small compared to RAM latency?_

Because for recent CPU's (post-P5 for Intel) the page walks to service a TLB
miss use the standard data caching mechanisms, thus for a frequently used hash
table that is reading only a couple cachelines per lookup, the page tables
usually remain in cache:
[http://electronics.stackexchange.com/a/67985](http://electronics.stackexchange.com/a/67985).

So while the TLB miss requires a lookup, this lookup frequently doesn't
require hitting RAM. My recollection is that this means a TLB miss usually
costs only the relevant cache miss plus ~10 cycles. But this does require
certain assumptions about the access pattern, and I've been meaning to retest
this on recent hardware to be sure.

 _misses aren 't handled in parallel_

Based on your earlier phrasing you probably realize, but in case others don't,
since Broadwell Intel CPU's do handle two page walks in parallel:
[http://www.anandtech.com/show/8355/intel-broadwell-
architect...](http://www.anandtech.com/show/8355/intel-broadwell-architecture-
preview/2).

 _Cuckoo is particularly sensitive to bad hash functions: if a few elements
always hash to the same pair of values, you 're screwed._

Yes, although if you can choose a good hash function this should be rare. And
there are variations of cuckoo hashes that are much less susceptible to this.
The first either increases the number of hashes (d-ary), and the second adds
multiple "bins" as described by 'cmurphycode' in another comment. Then you can
add a "failsafe" by adding a "stash" of last resort:
[https://www.eecs.harvard.edu/~michaelm/postscripts/esa2008fu...](https://www.eecs.harvard.edu/~michaelm/postscripts/esa2008full.pdf)

 _If the hash table specifies calls the hash function with two different
seeds, that means double the time spent in hashing, and that overhead can
cover for a lot of linear probing._

If you can choose your own hash function, the hashing cost should be minimal
even for a "perfect" hash. And a SIMD approach usually means that you can
create 2, 4, or 8 hashes using different seeds in exactly the same time that
you can create a single hash:
[http://xoroshiro.di.unimi.it/xoroshiro128plus.c](http://xoroshiro.di.unimi.it/xoroshiro128plus.c)

 _Finally, I don 't think gathers are faster than independent memory accesses
unless everything is in the same cache line._

They weren't any faster until Skylake, but they are significantly faster now:
[https://github.com/lemire/dictionary](https://github.com/lemire/dictionary)

------
jlebar
The overhead of storing hashed values inside the hashtable is not to be
neglected.

If my key type has a fast hash function, maybe I don't need to store the
hashed value. But if e.g. I have a hashset of char*s, I definitely don't want
to rehash. Storing those hashes alongside the pointers effectively halves my
load factor, from a cache-friendliness point of view. There goes the whole
load-factor advantage.

Maybe Robin Hood hashing still wins by letting me use linear probing, which is
more cache-friendly than (say) the quadratic probing you'd normally have to
do. But this is getting sketchier...

In addition, it's not clear to me that the cost of the swaps is entirely
negligible in the common case when the elements of the hashtable are more than
one or two words wide. Certainly not if you have to run a constructor to do
the move instead of just memcpy'ing the elements.

I like the idea, but it is not (to me) obviously a knockout win.

~~~
pkhuong
If you don't need to store the hash value because your hash function is
simple, you don't… Nothing about the (linear) robin hood collision handling
strategy forces an explicit representation of the hash value. However, it is
true that C++ makes some data structures slower than they should be.

------
spike021
My data structures and algorithms course was unfortunately taught by a pretty
awful professor.

But out of the entire course, he did have us implement a hash table using
Robin Hood Hashing for a homework assignment once. It's the one part of the
course that I enjoyed outside of my issues with it, and it's the one that, a
couple years down the road, I still like to think and talk to coworkers about
because it's an interesting topic.

------
raymondh
How is this different from "Brent's variation of Knuth's Algorithm D"?

* [https://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA532#v=on...](https://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA532#v=onepage&q&f=false)

* [http://maths-people.anu.edu.au/~brent/pd/rpb013.pdf](http://maths-people.anu.edu.au/~brent/pd/rpb013.pdf)

~~~
jaswilder
The author's thesis has some comparisons to Brent's method. See P51-53
[https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf](https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf)
for conclusions.

------
qwertyuiop924
RHH is neat, and useful, but I don't know about making it your go-to hash
implementation: If you're writing the code yourself, and don't have to worry
about the cost of the occaisional cache miss, the simplicity of List or B-Tree
chaining is a pretty big win.

HoweverIf you're going to implement a general-purpose hashtable, or are in an
environment where you really cannot afford that cache miss, RHH or similar
seems like the right way to go.

------
praxulus
> Just modify the standard search algorithm to ignore empty slots (rather than
> terminate the search) and keep going until you’ve probed longer than the
> known maximum probe length for the whole table.

Why does it need to keep searching after it hits an empty slot? Can't it
terminate the search if it hits the max probe length _or_ an empty slot?

Edit: Ah, that's part of the "even better version" shown later.

------
xedarius
He also wrote a follow up article addressing performance issues surrounding
probe size increasing when erasing elements.

[http://www.sebastiansylvan.com/post/more-on-robin-hood-
hashi...](http://www.sebastiansylvan.com/post/more-on-robin-hood-hashing-2/)

------
diziet
There is also a great resource with more up do date stuff on hashing and
benchmarks right here:
[https://github.com/rurban/smhasher/](https://github.com/rurban/smhasher/)

~~~
erichocean
That compares hash _functions_ , not hash _tables_ (aka dictionaries, maps,
key-value stores, etc.).

Hash functions convert sparse keys to dense keys (with potential for
collisions). Hash tables look up and store values by key using their hash
(usually in-memory only, although this is not a requirement _per se_ ).

~~~
Gankro
It's also a _terrible_ comparison of hash functions for the purpose of
hashmaps, because it tries to reduce the performance characteristics to a
single number. The performance characteristics of a hash function varies over
the size of the input. FarmHash is super complicated because it's basically a
huge stable of hash functions that are selected based on input size.

For instance, looking at these numbers you might conclude XXHash is strictly
faster than FNV, but in reality FNV performs much better for the more common
case of small inputs (breaking even around 16-32 bytes).

~~~
diziet
I disagree, as this simple test suite is not meant to be a definitive guide on
which hash function is the best.

I would venture to say that your criticism is similar to saying that the
Hutter Prize[1] is a terrible idea for someone who will need to do compression
on a dataset as they will choose to use PAQ and it will run really slowly.
Yes, there are trade-offs and optimizations for different use cases, but to
say that it's a terrible comparison is unfair.

1) [http://prize.hutter1.net/](http://prize.hutter1.net/)

