
Why Hashbrown Does a Double Lookup - mbrubeck
https://gankro.github.io/blah/hashbrown-insert/
======
cbhl
If memory overhead (load factor) isn't a big issue, readers may also find
cuckoo hashing rather interesting. (Theoretically, it has a worst-case
constant lookup time. Learned about it in one of my university algorithms
classes, but have yet to see an implementation in practice.)

[https://en.wikipedia.org/wiki/Cuckoo_hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing)

~~~
Diggsey
This seems to assume that all information needed to determine if two keys are
equal is incorporated into the resulting hash value.

It's premised on the idea that you can dynamically change the hash function
such that _eventually_ you will find one with fewer than some fixed number of
items sharing the same hash value.

There are two problems: 1) It completely undermines the "worst-case constant
lookup", because there is no upper bound on the number of times you might have
to change the hash function and rebuild the table before you find one with
sufficiently few collisions.

2) It is a much stronger requirement on the keys than other hash tables have.
With other hash tables I can simply omit "difficult to hash" data from my hash
function, on the basis that _enough_ other data is hashed to avoid a
significant performance penalty. With this implementation, the entire hash
table will simply fail.

This additional failure mode is also a security concern - if an untrusted user
can affect what data is added to the table, it could cause a severe DoS where
the table will either enter an infinite loop trying to find a non-conflicting
hash function, or have an unexpected failure mode (most people don't expect
their hash tables to randomly fail). Even if it is somehow designed such that
a hash function can always be found, the attacker could spend time up-front
finding values that take a long time for that hash function to be reached,
causing numerous re-hashes of the table.

~~~
laughinghan
What are you talking about?

> It's premised on the idea that you can dynamically change the hash function
> such that eventually you will find one with fewer than some fixed number of
> items sharing the same hash value.

Why would you think there are any fixed numbers? The OP is about a stdlib hash
table (for Rust). Comparable hash table implementations all resize the table
when necessary to decrease the load factor. Performance with respect to load
factor still matters, of course, less resizing is better, but there are no
fixed numbers of anything.

> It completely undermines the "worst-case constant lookup", because there is
> no upper bound on the number of times you might have to change the hash
> function and rebuild the table before you find one with sufficiently few
> collisions.

Why would you need to change the hash function or rebuild anything during
lookup? It's lookup. Nobody is changing anything.

> It is a much stronger requirement on the keys than other hash tables have.

What hash table doesn't require keys to be hashable?

------
kibwen
The author is also in the thread on /r/rust if you have any questions for
them:
[https://www.reddit.com/r/rust/comments/b38cwz/why_hashbrown_...](https://www.reddit.com/r/rust/comments/b38cwz/why_hashbrown_rusts_new_hashmap_implementation/)

~~~
mabbo
I went to university with him. Really nice guy. Every time we talk, I say
something like "any day now dude, I'm going to really get into Rust" and then
I never do because I'm awful.

~~~
shriek
I've been saying the same since last year. I'm finally picking it up just
trying to use it for regular everyday scripting work. It's been slow but
atleast I'm starting now.

~~~
tudelo
Slow indeed. Too slow...

------
prudhvis
This is pretty amazing, i really like the fact that a fastest hash table
implementation is going to be part of the Rust's stdlib.

~~~
haggy
Where does it say that this is the fastest hash table implementation?

~~~
rurban
It's accepted knowledge that the Google Swisstable is currently the fastest
hashtable around. (The C++ implementation) But there are three variants, and
this looks like the slowest of the three, but the best for dynamic languages.

~~~
asveikau
It should depend a lot on what you put into it and the access pattern, no? You
can talk about averages, about complexity, about benchmarks, but I feel like
"fastest hashtable around" is a very odd expression, as if to desire the world
to be a lot less complex than it is and for there to be exactly one best.

~~~
rurban
As I said there are two best, plus the worse third variant. Not implemented by
google, only described, but now apparently implemented in Rust. Haven't
checked closely, because Rust is kinda unreadable compared to C what exactly
is going on. Need to see the SIMD assembly also. You cannot really trust much
what's going on with Rust, as there's too much hype and lies, but it
undeniably got better recently.

------
shittyadmin
So - what's the motivation to use "open addressing" vs chaining, which I
thought was the more common approach to solving this.

I assume there must be a substantial performance gain for this to be used as
it seems significantly more complicated, any information on how much better it
is?

~~~
blattimwind
All* high performance hashtables use open addressing, because chaining tends
to mean (multiple) indirection to addresses outside the table.

* not sure if that's _literally_ true, but I've never seen anyone do chaining in performance-sensitive applications, and all the papers on fast hash tables use some way of open addressing.

~~~
shittyadmin
Interesting, both the Qt and Java ones seem to use chaining, but I guess
they're not designed with these sorts of extremely demanding applications in
mind.

~~~
amalcon
One thing to note is that this performance tradeoff is actually the reverse of
what it was ~20 years ago. This is because CPU speeds have improved
dramatically more than RAM speeds over a similar period.

Open addressing does do more comparisons than chaining. It used to save time
to traverse the linked list rather than to spend CPU cycles on those
additional comparisons. Now, because CPU cycles are relatively cheaper, the
opposite is true.

The reason the Qt and Java hashtables use chaining is simply that the code in
question was initially written back when the tradeoff ran the other way.

~~~
barrkel
If hash codes are stored inline alongside keys, code comparisons can be made
without an indirection while key comparison (often strings, seldom anything
without an indirection) usually needs an indirection. Hash codes are very
cheap to compare.

These indirections have always been costly on any machine with virtual memory.
TLB misses aren't free. I'd have put any estimate on the tradeoff being the
other way around to more than 25 years.

------
ChrisSD
Answer: SIMD.

What looks like "two loops" is really "do two simple SIMD operations".

~~~
scotty79
I think rather the answer is:

Tombstones in place of deleted elements mean that finding out that key is not
in the table requires going to a different place than finding the right place
to finally put it. Hence two lookups. SIMD just makes keeping two lookups
separate economical.

------
btym
_> And so our "two loops" are usually just "do two simple SIMD operations".
Not quite so offensive when you say it like that!_

No, it's not quite so offensive, but this doesn't explain why it's the _best_
option. Is there no equally-fast way to write the first-tombstone
implementation with SIMD instructions? The answer seems to be in the sketch of
the implementation, which I'm having trouble understanding.

EDIT: I'm watching the original SwissTable talk now... would it really have
been worse to use 2 bits for empty/tombstone/full/sentinel, and 6 bits for
hash prefix?

EDIT 2: More implementation info. Tombstones are actually rare, because if any
element in your 16-wide chunk is empty, you don't have to create a tombstone.
In the very best case (a perfectly distributed hash function), your hashmap
has to be ~94% full before it's even possible to fail this. Because tombstones
are so rare, it's better to save the single bit for extra confidence in the
hash prefix.

So, here is my understanding of the implementation and its rationale:

* Every bucket in the backing array has a corresponding byte in a metadata array

* 1 bit of this byte stores whether the bucket is empty, the other 7 bits are for a hash prefix

* SIMD instructions search 16 bytes at a time, checking: this bucket is not empty, this bucket's key matches the first 7 bits of my key

* Since 16 buckets are checked for emptiness at the same time, you can avoid creating a tombstone for a bucket if any of the other 15 buckets are empty (just set it to empty, i.e. set the first bit to 0)

* This means that tombstones are _very_ unlikely- you'll probably rehash before you get to the load factor where you start seeing tombstones

* Since tombstones are so unlikely, it's more valuable to add an extra bit to the hash prefix than it is to quickly find tombstones

My question remains: why can't the first search return the offset of the first
empty bucket? In this loop, why is there not an else that saves the offset?:
[https://github.com/abseil/abseil-
cpp/blob/256be563447a315f2a...](https://github.com/abseil/abseil-
cpp/blob/256be563447a315f2a7993ec669460ba475fa86a/absl/container/internal/raw_hash_set.h#L1652)

~~~
btym
Ok, I got it. They're exactly the same. Either way you'd need to do a second
search, because you're trying to differentiate between 3 states: "probably a
match", "empty", or "deleted". A much better way than stealing a bit from the
hash prefix is using a special value that represents "empty or deleted", and
that's exactly what SwissTable does: [https://github.com/abseil/abseil-
cpp/blob/256be563447a315f2a...](https://github.com/abseil/abseil-
cpp/blob/256be563447a315f2a7993ec669460ba475fa86a/absl/container/internal/raw_hash_set.h#L253-L278)

------
zamalek
All good points in the article.

> it was doing something that was so offensive to people who care about
> collection performance

Hmm. It also helps here to go back to academia. Big O notation doesn't usually
express coefficients/constants, it usually only deals with exponents.[1] The
Wikipedia page has a good explanation as to why.

Opinion: coefficients/constants are, however, useful if you're running over a
network or some other latency-bound operation.

[1]:
[https://en.wikipedia.org/wiki/Big_O_notation#Properties](https://en.wikipedia.org/wiki/Big_O_notation#Properties)

~~~
dbaupp
The whole point of the article is that constant factors work in Hashbrown's
favour.

------
the8472
What's the memory efficiency compared to the previous implementation? AIUI
robin hood hashing with backwards shift deletion could achieve rather high
load ratios, and thus keep memory footprint small. What I read about tombstone
based open addressing suggests that it requires a lower load factor and thus
more memory.

------
norswap
Another alternative: Robind Hood Hashing

[http://norswap.com/robin-hood-hashing-jvm/](http://norswap.com/robin-hood-
hashing-jvm/)

(Might actually be the same technique, it's not quite clear!)

~~~
erichdongubler
Uh...this is what Rust uses currently. ;)

------
jeanmichelx
So how does this degrade for bigger hash tables? Surely the two loops
implementation is less efficient since your cache is trashed by the time you
do the second look up

~~~
mbrubeck
This varies not with the size of the whole hash table, but with the distance
from any given index to the first empty bucket. By keeping the load factor
constant, you can grow the hash table as much as you want without degrading
the expected performance.

~~~
jeanmichelx
Good point, thanks!

------
linsomniac
TL;DR: You first need to look for the hash "tombstone", then you need to find
the actual insertion which may be very far away.

Raymond Hettinger (HN commenter and all around Smart Dude (tm)) has a good
talk from PyCon 2017 about the evolution of Python dictionaries that goes into
tombstones and so much more:
[http://www.youtube.com/watch?v=npw4s1QTmPg](http://www.youtube.com/watch?v=npw4s1QTmPg)

------
wyldfire
> . For instance, when compiled with sse2 support it can search 16 buckets at
> once.

Does Neon get similar order of magnitude benefit?

------
aswanson
I wonder if there is any research into hardware architectures optimized for
Rust.

~~~
devit
The things Rust wants more than other languages are support for fast array
bounds checks and fast integer overflow checks.

Unfortunately none of the current popular architectures seem to have either.

~~~
hyperman1
x86 seems reasonable to me:

Fast array bounds checking is, handling the lower bound by unsigned integer
math :

    
    
      cmp index,bound
      ja crash_handler
    
    

Overflow handling is mostly

    
    
       jo crash_handler
    

That's 1 or 2 instructions and the jump has perfect prediction. I'd assume the
cost is negligible compared with jump mispredicts and cache misses.

I read somewhere the main problem is LLVM not being quite capable of
optimizing these decently: It has to handle all these extra jumps which causes
lots of extra basic blocks.

~~~
simcop2387
I seem to recall, that at least jo has terrible pipelining implications
because of the dependency on the flag register that way, no idea about the
bounds check (might be a better way)

~~~
temac
I don't see why JO would be any different from other Jcc, especially since JL
is SF ≠ OF. Maybe you were thinking of the problem with INC/DEC and CF, but
even then that's a problem of INC/DEC, not of Jcc.

~~~
nkurz
It's not usually a big difference, but on modern Intel there is a some
difference in performance between the different CMP/JCC options. The more
common ones will "fuse" with the CMP and be executed as a single µop, but the
rarer ones (like JO and JS) do not fuse with CMP, and thus can add a cycle of
delay (and have the overhead associated with executing another µop). The
optimization is called "macro op fusion". Details here
[https://en.wikichip.org/wiki/macro-
operation_fusion](https://en.wikichip.org/wiki/macro-operation_fusion) and
here
[https://www.agner.org/optimize/microarchitecture.pdf](https://www.agner.org/optimize/microarchitecture.pdf)
(pages 108 and 125).

~~~
khuey
Fixing that wouldn't seem to require any _architectural_ changes in x86
though, it's just that Intel hasn't cared enough about JO and friends to
optimize them this way.

