
I Wrote My Fastest Hashtable - ingve
https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/
======
ot
The author seems to ignore that GCC's implementation of std::hash is the
_identity_ on integral types: [https://github.com/gcc-
mirror/gcc/blob/cf8c140a8b8131b3ab4f5...](https://github.com/gcc-
mirror/gcc/blob/cf8c140a8b8131b3ab4f500346dc4c004c899163/libstdc%2B%2B-v3/include/bits/functional_hash.h#L113)

This is an incredibly poor design choice (at Facebook we'd say "clowntown") as
it makes it unusable with any sophisticated hashtable data structure that
assumes reasonable randomness. That's what causes the quadratic growth in the
sequential example with dense_hash_map.

The solution is to use better hash functions, not prime hash table sizes.
Prime modulus will only make the problems harder to notice, but if you have a
poor hash function, you're giving up every complexity guarantee. Note that
even if you reduce modulo a prime > n, you'll still have the identity over the
first n numbers.

Personally I just use folly::Hash
([https://github.com/facebook/folly/blob/da08e63ce3bac9196b18d...](https://github.com/facebook/folly/blob/da08e63ce3bac9196b18db82e64db8ecbb44d0ed/folly/Hash.h#L352))
as it uses sane defaults for hash functions, and contrary to std::hash it's a
polymorphic functor so you can just drop it as template argument without
having to specify the key type: dense_hash_map<int, int, folly::Hash>.

The other reason dense_hash_map uses power-of-2 sizes (other than to avoid the
modulo) is that it is sufficient to guarantee that the quadratic probing
schedule is transitive, that is, it can reach all the slots of the table.
Quadratic probing is used because it's very effective for unsuccessful
lookups. You can find a table of expected probe counts at different load
factors here:
[https://github.com/sparsehash/sparsehash/blob/4cb924025b8c62...](https://github.com/sparsehash/sparsehash/blob/4cb924025b8c622d1a1e11f4c1e9db15410c75fb/src/sparsehash/internal/densehashtable.h#L77)

Hash tables are a fine craft. While it's interesting to gain these experiences
first-hand, this is an amateurish post and it should be taken with a grain of
salt.

~~~
dzdt
I think "amateurish" is overly harsh here. He sets out a very professionial
plan of measuring and testing. It is a nice post.

Yes, probably true that requiring a non-broken hash function is strictly
better than his preferred prime modulus workaround. Okay, so? If you notice,
the whole post keeps comparing a power-of-two version of his hashtable as
well. Other than the one graph showing pathological behaviour of dense_hash
with in-order inputs, everything in the post applies assuming a reasonable
hash function.

~~~
ot
I didn't mean "amateurish" in a judgmental way, it's a matter of fact that the
author is a novice when it comes to hash tables. He himself admits that he
doesn't understand some of the details ("Why do prime numbers help? I can’t
quite explain the math behind that"), so he gets some of the conclusions
wrong.

> Other than the one graph showing pathological behaviour of dense_hash with
> in-order inputs

That's a pathological behavior of std::hash, not of dense_hash.

> everything in the post applies assuming a reasonable hash function.

No, see this paragraph:

    
    
        All of these problems are solvable if you’re careful about choosing a hash
        function that’s appropriate for your inputs. But that’s not a good user
        experience: You now always have to be vigilant when using
        hashtables. Sometimes that’s OK, but sometimes you just want to not have to
        think too much about this. You just want something that works and doesn’t
        randomly get slow. That’s why I decided to make my hashtable use prime
        number sizes by default and to only give an option for using powers of two.
    

The belief is that prime number sizes will defend you against poor hash
function. That's a very dangerous advice.

------
_benedict
This is great, and I don't want to take away from it, but the idea that
limiting the number of probes before a rehash is a new contribution is
mistaken.

I've certainly done this in hash tables I've implemented at times, and I've
seen this in library hash tables too.

One public well-known example that springs to mind is Cliff Click's
NonBlockingHashMap in Java.

~~~
3pt14159
Why is it mistaken?

~~~
tossaway1
I feel like you misread his comment if you're asking that. His next sentence
explains why it's mistaken...

~~~
3pt14159
I missed the word _new_ , thanks. Deleted the comment.

Edit: Apparantly I can no longer delete the comment. Stupid HN.

~~~
dpark
You cannot delete a comment once someone replies to it.

~~~
dkersten
Nor should you. It breaks the conversation for anyone reading it afterwards
and makes the replies essentially off topic.

------
jlrubin
Author may be interested in [http://lemire.me/blog/2016/06/27/a-fast-
alternative-to-the-m...](http://lemire.me/blog/2016/06/27/a-fast-alternative-
to-the-modulo-reduction/) as a way of avoiding modulo altogether.

~~~
79d697i6fdif
The author seems mistaken about a couple things relating to hash tables. He
doesn't seem aware that, by far, the main reason for using power-of-two hash
table sizes is so you can use a simple bitshift to eliminate modulo entirely.

He is also going to run into modulo bias since using the remainder to
calculate a slot position is biased toward certain positions for every table
size that isn't a power of two. (see [https://ericlippert.com/2013/12/16/how-
much-bias-is-introduc...](https://ericlippert.com/2013/12/16/how-much-bias-is-
introduced-by-the-remainder-technique/) for cool graphs) Prime number table
sizes do nothing to fix these issue. The power-of-two size with a bitshift is
not just faster, it gets rid of the modulo bias.

Fastrange is much faster than modulo but if his goal is to build the fastest
hashtable it's stupid to use a table size of anything except a power of two.

Source: I also wrote a hashtable :)

~~~
anon1385
>The author seems mistaken about a couple things relating to hash tables. He
doesn't seem aware that, by far, the main reason for using power-of-two hash
table sizes is so you can use a simple bitshift to eliminate modulo entirely.

From the article:

>This is the main reason why really fast hash tables usually use powers of two
for the size of the array. Then all you have to do is mask off the upper bits,
which you can do in one cycle.

~~~
hinkley
Yeah I'm gonna need an explanation too. Even with hashes hardened against DOS
attacks, the lower bits tend to vary with the input far faster than the upper
bits.

For instance, several popular hash algorithms (of the form 31 * hash + value)
will have the same upper bits for any two strings of the same length, up to
about 12 characters in length when the seed value finally gets pushed out.
Unless you're still using 32 bit hashes for some reason and then it's most
strings under 6 characters long.

Multiply by prime and then add or xor the next value guarantees that the
bottom bits are less biased and so fits with any hashtable function that uses
modular math, even if the modulus is taken by bit masking. Getting the upper
bits to mix would take a different operator or larger operands.

I know I've encountered at least one hash function that uses a much larger
prime (16 or more bits) as its multiplier, so it's not unheard of, but it's
certainly not universal.

------
jlrubin
I'd be interested to see a comparison of cuckoo hashing v.s. robin hood on
these benchmarks.

I chose to use Cuckoo in my implementation of a high performance fixed-size
cache for bitcoin signature validation (see
[https://github.com/bitcoin/bitcoin/blob/master/src/cuckoocac...](https://github.com/bitcoin/bitcoin/blob/master/src/cuckoocache.h)).
Cuckoo is great in my experience if you want to use memory fully and can
afford > 3 high quality hashes per entry. I also like cuckoo because you're
able to use it pretty well for lockfree/high concurrency use cases (can make
concurrent erases and reads lock free for example), which made it a good fit
for the bitcoin workload. It seems that this hashtable's approach isn't quite
amenable to concurrent operations, but perhaps there are some tricks that
could work for that.

~~~
rurban
Cuckoo uses 2x memory, with constant cache trashing. It can be made fast with
concurrent maps, but there Cliff Click's or Preshing's Leapfrog approaches are
better.

~~~
jlrubin
What makes you say that cuckoo uses 2x memory?

~~~
rurban
The most commonly used variant uses 2 arrays, indexed by the two hash
functions. Or one array twice the size.

While the current trend is to reduce the size of this array by using size-
compressed indirect indices, e.g. the new ruby/python hashes.

------
PeterisP
I wouldn't ever have thought that replacing the generic integer modulo
operation with a switch statement and a modulo operation with a particular
hardcoded denominator could possibly cause a performance improvement.

Especially for primes. I can imagine that various power-of-two related values
could have some bit-twiddling ways to make them much faster, but assuming that
the compiler will be able to improve x mod 711 to something better than using
the CPU modulo operation seems weird... but it somehow works.

~~~
Sharlin
Yep, there's pretty much a trick for every reasonably small divisor. For
instance GCC 4.9 translates division by 711 as follows (dividend in edi):

    
    
            mov     eax, edi
            mov     edx, -1202107583
            imul    edx 
            mov     eax, edi
            sar     eax, 31
            add     edx, edi
            sar     edx, 9
            sub     edx, eax
            mov     eax, edi
            imul    edx, edx, 711
            sub     eax, edx
    

This is, uh, pretty deep magic. Apparently this sequence of ops, including
those two imuls, is still faster than a single idiv.

~~~
tehwalrus
Is this the trick where you multiply by the two's-compliment inverse instead
of dividing? That was the coolest thing I learned when I spent a few days
coding assembly.

I can't see the shift that you normally need after pulling the result out of
the overflow register though. My asm is a little rusty though.

~~~
tomsmeding
It doesn't need the shift because it retrieves the result of the first imul
from edx, which contains the upper word of the result. The lower word is
placed in eax, which this code completely ignores.

~~~
tehwalrus
Yes but even in the upper register you still need to lshift by one or two bits
depending on the constant IIRC. Five had me shifting by (1 or 2), and seven by
(2 or 1) I seem to recall. I can't remember what property of the number
determines the size of the shift though.

------
gpderetta
Hum, the nice thing with Robin Hood hashing is that you get close to 90% load
factor without pathological performance degradation. If you have a fixed probe
count, wouldn't you be better just grouping your buckets in fixed size sub
buckets and doing (trivially parallelizable) linear searches for lookups and
to find empty insert positions?

On a related note, I have in general concerns about latency spikes due to
rehashing. Is there a good, general, incremental rehashing scheme?

~~~
_benedict
Well, if you're willing to incur a slight latency penalty on each access,
incremental rehashing is terrifically easy. Just keep both the old and new
backing arrays around, and consult a pointer to determine how far in the first
backing array has been invalidated.

If you follow a Shalev/Shavit hash-ordered-lists scheme the rehash doesn't
even need to move anything, and can be done with a straight memcpy, or in a
separate thread. The link below is an implementation that was intended for
Cassandra (so is simplified and append-only, but also a concurrent map - which
could be relaxed). It has the nice property that it will "fix" the backing
array on-demand, so straight-up copying the existing contents of the backing
array into the new region of the backing array, once doubled, will set things
up for a gradual transition.

Similarly, it's trivial for the backing array to instead grow incrementally,
and for an extra level of indirection to prevent ever having to reallocate
older regions (since their contents never changes on rehash).

[https://github.com/belliottsmith/cassandra/blob/47d970ead690...](https://github.com/belliottsmith/cassandra/blob/47d970ead69035f5cd294ebae536dd3d5e985666/src/java/org/apache/cassandra/concurrent/NonBlockingHashOrderedMap.java)

~~~
gpderetta
keeping the old array around is the obvious solution, but the problem is that
you might need to rehash the second array as well. Do you need to keep an
arbitrary number of arrays around or there are hashing schemes that guarantee
that you'll be done with rehashing by the time you need to rehash the second
buffer? I'll read up on hash-ordered-lists.

~~~
_benedict
Well, depending on your rehash trigger (i.e. if it is dictated by the load
factor, not the number of reprobes), it can be guaranteed that the new array
does not need to be rehashed before the in-progress migration, if you migrate
at least one bucket per write operation. If you were to (sensibly) migrate
many (say, >= 16) with each write, the likelihood of this happening is very
low with any reasonable hash function, since the work of rehashing will
typically be done many times faster than the skew can retrigger your rehash.
This at least makes the incidence of a latency-inducing event of waiting for
the prior rehash to complete very low.

If you rehash based on something like "number of reprobes" then it's probably
impossible to absolutely guarantee that a rehash cannot be triggered before a
prior one begins, but the linked approach would permit growing the backing
array discontiguously (assuming an extra level of indirection to the backing
array). But this would obviously incur some extra overheads, and is not
entirely dissimilar to hash-tries, which also avoid any costly global rehash.

I think the class-level comment in the code I linked is perhaps the best
introduction to split-ordered lists I know of (not hash-ordered; they're
sorted by the reverse bitstring of the hash), since they're not widely
discussed and not immediately intuitive.

------
nialv7
Growing hashtable based on number of unsuccessful probes is essentially the
same as grow based on load ratio (α).

Because the expected number of probes can be calculated from it, which is 1 +
1/(1 − α).

~~~
_benedict
Well, you're assuming a perfect uniform distribution of your values. Skew in
data is common, since cryptographic hash functions are uncommon (since they're
expensive), and any way we generally store collections vanishingly smaller
than infinity, in which case even a perfect distribution will produce clusters
with varying frequency; and each will degrade differently under these
situations.

At the extreme, if every item produces the same hash code, growing on reprobes
will result in exponential insert complexity (since every insert will trigger
a grow). Growing on load factor will just maintain O(n) insert/lookup
complexity (or the complexity of the hash bucket, if not open-addressing).

With less extreme skew, growing on load factor will simply use less memory
than growing on reprobes, while reprobes will have a slightly lower median
(and possibly mean) insert/lookup costs.

------
amelius
What objects were actually stored in the hash table for these tests?

Are they fixed size, non-heap-allocated?

And would the performance improvement still be relevant if this is not the
case, e.g. for variable size strings?

~~~
ZoFreX
Some of the hash tables were heap allocating others were in place. They tested
a variety of sizes.

------
titzer
Integer modulus is one of the slowest CPU operations, especially on modern
CPUs (since most other operations have been heavily optimized, and there are
many execution units available for them). IDIV on Nehalem can literally take
up to 100 cycles
([http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)),
and it varies _a lot_.

Power of 2 hashtables are so much faster because indexing uses a mask. The
problem is that, as others have mentioned, powers of 2 suffer a lot with bad
hash functions, and people tend to write pretty bad ones, so it's generally a
good idea to salt or rehash hashcodes before dropping the upper bits with a
mask.

In fact it looks like the results from the author's bear that out, as the
flat_hashmap_powers_of_two is almost always the fastest.

------
Dowwie
Raymond Hettinger made an interesting talk about Python's dict (hash table)
and how Python's core development team evolved it over time. He's giving the
polished version of talk at PyCon this May but if you can't wait for it,
here's a rough version:
[https://www.youtube.com/watch?v=p33CVV29OG8](https://www.youtube.com/watch?v=p33CVV29OG8)

------
silok
Have you compared it to this implementation?
[http://www.ilikebigbits.com/blog/2016/8/28/designing-a-
fast-...](http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-
table)

------
charisma_stupid
This is good work. I wish it was extended to work under multiple threads so it
could be used in databases.

------
markcerqueira
A lesson in "Tone is everything" perhaps? You make excellent points and taught
me a bunch (thanks!), but just because the author admits he doesn't understand
hash tables inside and out does not make calling his work "amateurish" or
"clowntown" okay.

~~~
ot
If you read carefully you'll see that "clowntown" referred to GCC's
implementation of std::hash, not the post. I think it's natural to have high
standards for GCC.

I don't think my tone was too aggressive, maybe made exception for the word
"amateurish"? I explained above what I meant, if you replace that word with a
better one does it change the tone of the post?

~~~
astine
"if you replace that word with a better one does it change the tone of the
post?"

Yes, it does.

'Amateurish' is often used as an insult to imply that someone needn't have
tried at all. If you replaced it with 'novice,' I think people would have
understood your meaning without taking offense.

~~~
ot
I see. I didn't know "amateurish" had such negative connotations.
Unfortunately HN doesn't let me edit the post anymore.

~~~
q3r3qr3q
Your tone was fine. Don't pander to these people, please.

~~~
mamp
For a post on HN the tone was ok but not great, if one of my team spoke like
that in a meeting I would commend them on their technical knowledge but remind
them that their effectiveness and influence depends more on tone than
technical skill, because people.

------
Ono-Sendai
Well I think my hashtable is probably faster.

It's also faster than dense_hash_map.

The author claiming their hash table is the fastest ever is kind of
ridiculous.

~~~
elmigranto
It's fastest of all _author 's_ implementations, which is stated in first
paragraph.

~~~
Ono-Sendai
Yes but the title make a claim that it's the fastest.

~~~
elmigranto
And it is, though, not among a set of hashtable implementations you assumed.

It is quite similar with how you wrote "fastest" instead "fastest among all
existing publicly available implementations". Nothing wrong with that, since
it is obvious from context, very much the same way it is obvious from article
what author means.

~~~
attractivechaos
In the text, it is fine to say "fastest" because there is a context, but in
the title, "fastest" means above everything else in the world because the
context has not been established yet to readers. If you read news with
"fastest" in the title (e.g. fastest CPU or fastest compiler), it really means
the fastest in the world at the time of publication.

The author could just say "very fast" or "faster than common hash table
libraries" in the title. The current title, IMHO, undermines the quality of
the post (which is actually good).

~~~
cat199
But _I 'm_ the context! Geez.

~~~
attractivechaos
The title of this post has been changed to "my fastest" after I wrote my
previous comment. The original post is still "the fastest". There was no
context.

