
Designing a fast hash table - vasili111
http://www.ilikebigbits.com/blog/2016/8/28/designing-a-fast-hash-table
======
markpapadakis
I recommend Robin Hood open addressing scheme. It's particularly trivial to
implement, and because it relies on linear probing, so it's memory-cache
friendly.

See more here: [http://www.sebastiansylvan.com/post/robin-hood-hashing-
shoul...](http://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-
your-default-hash-table-implementation/)

Jeff Preshing has written many interesting blog posts about the subject. For
example, see his blog post on Leapfrog Probing(
[http://preshing.com/20160314/leapfrog-
probing/](http://preshing.com/20160314/leapfrog-probing/) )

------
panic
Yeah, most (all?) implementations of unordered_set are really slow. My
understanding is that the iterator invalidation behavior specified by the
standard forces implementations to use chaining, which means allocating
constantly on insertions and chasing pointers on every lookup.

~~~
amelius
Yes, I guess the designers of the standard library envisioned that programmers
can easily make mistakes with the lifetime of iterators, and this is their
best way to safeguard against it. Rust enforces the constraints on iterator
usage, so this is a clear example where Rust would be a superior language.

~~~
gpderetta
My guess is that the hash_map of the original STL had that iterator
guarantees. While hash_map wasn't standardized, it was de-facto available on
many standard libraries. When the committee standardized unordered_map, they
tried to make it as much of a drop-in replacement for hash_map as they could
and subtly different invalidation rules would have been extremely hard to
catch.

~~~
Ono-Sendai
I've heard it was due to std::unordered_map trying to be a drop-in replacement
for std::map.

------
munificent
> The advantage of using power-of-two is that we don't > need to look for
> prime numbers, and we don't need to > do an expensive modulo operation (we
> just need a > dirt-cheap bit mask).

I'd heard that point about using a mask instead of modulo to fit the hash into
the table range before, but never paid much attention to it. I figured with
all of the other work going on in a hash table, the difference between one of
two arithmetic operations would be negligible.

Right now, I'm working on a book on programming language interpreters. That
includes implementing a hash table from scratch. I used "%" because I felt it
made the behavior easier for readers to understand.

After I got it all up and running, I discovered that in some microbenchmarks
were spending something like 10% of the total execution time on that "%". I
changed it to use a mask and "&" and it made a dramatic performance
difference.

So I think I'll be explaining both approaches in the book. The former for
clarity and the latter as an example of an optimization.

------
lorenzhs
The post is very light on details about the measurements, and only compares to
libc++'s std::unordered_{map,set} (no version number provided). There is no
mention of space usage.

It's well known that std::unordered_map is slow (see panic's comment). Some
more worthy opponents would be Google's dense_hash_map [1] or sparsepp [2]

[1]
[https://github.com/sparsehash/sparsehash](https://github.com/sparsehash/sparsehash)

[2]
[https://github.com/greg7mdp/sparsepp](https://github.com/greg7mdp/sparsepp)

------
faragon
Some questions:

\- Which CPU was used? Which compiler optimization flags? (2 million inserts
per seconds seems slow, for a std::unordered_map, using 64 bit data)

\- How does HashSet read operations perform vs std::unordered_map?

\- How does external fragmentation handling in HashSet compares to
std::unordered_map? (e.g. writing and deleting random keys, continuously)

~~~
faragon
Amend: I meant internal fragmentation, and not external fragmentation
(allocated hash buckets left empty after deleting elements)

------
mamcx
A bit off-topic, but I have wonder lately if I new language start today, what
could be the very best (overall) foundations for it.

For example, I read here
([https://cr.yp.to/critbit.html](https://cr.yp.to/critbit.html)) that critbit
tries could be a better base structure for a language (ie: like dict and list
are for python).

So, if we start today, what hash to use? What will be a better default
structure (my understanding is that hash-maps is/was the way to go)?

~~~
chubot
I'm generally a big fan of DJB, but I see a few problems with this scheme.

First, it seems like the prefix-free requirement means that the strings can't
contain internal NUL bytes. In Python you can store 'a\0' as well as 'a\0b' in
a hash table -- how do you handle this with crit-bit trees?

Second, in Python, unlike JavaScript, arbitrary types can be hashable. Tuples
are hashable. How do you store ('a', 0) and ('b', 1) as keys in a crit-bit
tree?

~~~
eutectic
Just store a bit (for a set) at each internal node, denoting whether a string
ends there. For a map you also need a value pointer.

Hashing keys is also easy: add a pointer to a linked list of key-value pairs
to each leaf node. To save space you could even use a flag bit and then a
pointer to a single key-value pair, or a linked list in case of collision.

Note that you can store tuples directly as the concatenation of their fields;
no need to use hashing.

~~~
socmag
What are you talking about. How do tuples preclude the requirement of a way to
hash?

That's a very naive way to map using the delimeter of the tuple parts. You
might have interior elements that are 10k long.

~~~
eutectic
And strings can also be long.

------
Ono-Sendai
There's another design decision not considered in this article: using a
separate bucket state table, or using sentinel values inside the buckets to
represent 'empty' etc.. I've measured sentinel values to be faster.

------
socmag
It is an interesting idea.

There just seems a lot wrong with the presentation of the benchmark
comparisons...

Specifically where is a full suite of comparisons as compared with other
libraries for time and space, insert and retrieval. Hell, I make those graphs
for other people's code, let alone my own.

Usually when you come up with a new algorithm it is typical to be so over the
moon to show how it compares all around. I'm not seeing this here and that is
a red flag.

I'll definitely give it a whirl though as it sounds promising, although I'm
personally more interested in order preserving hashes.

------
bogomipz
The author states:

"They allow you to insert, erase and look up things in amortized O(1)
operations (O(√N) time)." Are they staying that the average time is (O(√N) and
that is amortized to ~ O(1)?

I assume (O(√N) is the same as .5 multiplied by N? Is that correct? I don't
think I've seen a square root in Big O notation before.

~~~
knappa
No O(√N) is O(N^0.5)

~~~
bogomipz
Sorry yes of course, that's what I meant. But O(1) is the amortization with
the average being O(N^0.5)

~~~
njaremko
If you read the linked blog post about cache speed
([http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-
ram-p...](http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-part-i))
it'll make a lot more sense.

~~~
bogomipz
Oh thanks, also a good read. The author in this post however states:

"That I use Big O to analyze time and not operations is important."

This strikes me as an odd statement though as Big O is generally meant to
describe a run time or space not operations. What am I missing?

~~~
njaremko
Big O is really just saying that a function f(n) belongs to O(g(n)) if 0 <=
f(n) <= c*g(n) for some C >= 0.

When he says he's using big O for time and not operations, he's saying that n
is time (as opposed to operations)

~~~
bogomipz
Sure, I just thought it was an odd statement to be explicit about as it the
usual case to have it express time and not operations.

~~~
lorenzhs
Not quite. You're usually measuring time _in a specific model_. Most often,
that's the RAM model, where memory access and arithmetic operations take
constant time (defined as one time step). The author is using a different
model, but doesn't formally define it and instead tries to make it fit wall-
clock time. That requires far more elaborate modelling, ends up being
extremely complicated, and usually doesn't yield all that much additional
insight. That's why computer scientists usually use the RAM model - it
captures the most important aspects of computers, but omits others as they're
hard to model accurately. The memory hierarchy is one of these.

Some models where such things are studied in more details include the
_external memory model_ (which has a two-level hierarchy - cache and main
memory, or main memory and disk) and _cache-oblivious models_.

Most of the time, you don't need that extra information, though, and the RAM
model suffices.

~~~
bogomipz
Thanks for clearing that up. I'm catching up on the other discussion now : )

------
socmag
The only hash table worth competing with is Cedar. That can even be better,
but the standard distribution is pretty damn awesome as it is IMHO, and I've
tested a lot.

