
Myths about Hash Tables - zengr
http://hughewilliams.com/2012/10/01/five-myths-about-hash-tables/
======
antirez
Inside the Redis distribution you can find "dict.c" that is a very easy to
understand hash table implementation that uses incremental rehashing, that is,
when it needs to switch to a bigger or smaller table this happens
incrementally in three ways:

1) At every operation you perform there is a rehashing step.

2) You have a function that you can call to perform a rehashing step.

3) There is an higher level function where you can say "rehash for N
milliseconds" in case you have some idle time to spend for a good cause inside
your program.

Another uncommon thing of dict.c is that it supports the "pick a random
element" operation with guaranteed O(1) behaviour.

EDIT: Now that I look at the source, clearly while the code is easy to follow
a design section should be present in the dict.c top-comment. I'll make sure
to add this, but long story short:

* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.

* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).

* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).

* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.

Well these are the main ideas I think.

~~~
aaronblohowiak
<https://github.com/antirez/redis/blob/unstable/src/dict.c>

~~~
antirez
Thanks, perhaps the design is even more evident here ->
<https://github.com/antirez/redis/blob/unstable/src/dict.h>

------
tptacek
A good article, but two quick thoughts:

1\. With a good hash function, it's hard to accidentally stumble on the worst-
case O(n) lookup cost. But against an adversary, it is usually very easy to
have that problem. This has been a well-known security issue for over a
decade, and over the last 4 years we're starting to see more exploitation of
it.

2\. The "little trick" at the end of the document that bumps matching items to
the head of their bucket chain will usually provide only a marginal speedup
(if many of your lookups hit chains longer than 1, your table is too small or
your hash function is broken), and that optimization comes with a cost: lookup
now modifies the underlying hash table, which is sometimes very undesirable.

~~~
kibwen
Do there exist good hash functions that are resistant or immune to the attacks
you mention? For example, does Python 3.3's new hash randomization provide
adequate protection?

~~~
pjscott
This is what SipHash was made for: it's a cryptographic MAC that's competitive
in speed with other hash functions meant for hash table use. The paper
describes its design, and the weaknesses with competitors like Python's
randomized hashing:

<https://www.131002.net/siphash/siphash.pdf>

Some code for it here:

<https://www.131002.net/siphash/>

~~~
haberman
Interesting, though at 140 cycles to hash 16 bytes, it's much slower than
state-of-the-art hash functions, which can hash 16 bytes in just over 16
cycles. If you seed your non-cryptographic hash function with a
cryptographically random number, you can get the best of both worlds: fast
hashing and collision resistance against an adversary.

~~~
pbsd
Not really. The SipHash page contains proof-of-concept code that breaks common
good non-cryptographic randomized hashes, i.e., it generates collisions
_regardless_ of the chosen seed.

Regarding performance, SipHash is essentially as fast as Python's randomized
hash, and quite faster for long inputs (>16 bytes), cf.
<http://bench.cr.yp.to/results-auth.html#amd64-sandy> .

~~~
haberman
That code appears to rely on having direct access to the hash function; this
seems very unlikely to be leaked to an attacker. More interesting would be
code that gets the secret with only access to ordered iteration on the table,
which is more likely to leak.

I don't know anything about Python's hash function, all of those benchmarks
are still an order of magnitude slower than CityHash (for example), which
peaks at 0.16 cycles/byte.

I would be interested to see SipHash put through SMHasher, which is the most
comprehensive hash function test suite I know of:
<http://code.google.com/p/smhasher/>

~~~
dchest
Seed recovery (which requires access to hash function) is for Python hash. For
CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Of course, it passes SMHasher with 10 score -- it's a _cryptographic_ PRF.

~~~
haberman
> For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Are you referring to the "Advanced hash flooding" section of the paper, or am
I missing this in another place? That analysis appears to rely on the hash
function being highly linear/periodic (ie. you can always add "l" to get back
the same hash value). I don't think that City, Spooky, or Murmur are nearly
this linear. For example, City is based on CRC32 (ie. polynomial division).

Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic
hash. But modern non-cryptographic hash functions are so fast (especially when
they're CPU-accelerated, like City is on Intel thanks to the CRC32
instruction) that I don't think SipHash can claim to be the obvious choice for
all hash tables.

I am interested to know though if the random seed + City/Spooky/Murmur
approach really does have practical hash flooding attacks.

~~~
pbsd
It's pretty quick to generate a lot of collisions, and the set of inputs you
get will collide independently of the seed. Look (CityHash64):

$ time ./citycollisions 1000000 > /dev/null

real 0m0.904s

user 0m0.439s

sys 0m0.008s

$ ./citycollisions 3

128-bit key 741b848b0065aec04bb25b26ca2c3e9

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = d3290c3d600b8add

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = d3290c3d600b8add

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = d3290c3d600b8add

$ ./citycollisions 3

128-bit key 5e2a23015c89da98172fbcb77ad98468

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = 59d0b041b16594b

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = 59d0b041b16594b

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = 59d0b041b16594b

~~~
haberman
Wow, collisions against the best non-cryptographic hash functions that are
independent of seed... color me impressed! This does then seem to strongly
suggest that a cryptographic hash function is the only way to defend against
hash flooding attacks. Props to the SipHash authors, this is some ground-
breaking work.

------
kabdib
I love hash tables. They're my favorite data structure, maybe.

A story:

Back in the 70s it appeared that every major state university had to do their
own version of PL/1. At the University of Maryland it was called PL/UM. I
don't know why, maybe it was a testosterone thing ("IBM took a zillion man-
years to write PL/1, but look what WE did with three starving grad students, a
half-soused half-professor and a keypunch operator!") (Yeah, another version
of PL/1, and that doesn't necessarily make the world a better place. Anyway
--)

PLUM was notorious for being slow (I don't know about bugs, but they were
probably there, too), and it was pretty common for students using PLUM to have
to grovel for more seconds of CPU time.

Five or six years into PLUM's use as a student language, someone discovered
that all of the symbols were being hashed into the same bucket.

PLUM got amazingly fast (well, faster, at least) after that. But by then
anyone with a clue had managed to scam an account on one of the bsd Vaxen and
write their stuff in C.

Don't take hash tables on faith. Measure, measure, measure.

------
recursive
The first two myths aren't really myths.

In "The worst case is terrible", the author goes on to explain that the worst
case is unlikely. That may be true, but it's still O(n).

In "Hash tables become full, and bad things happen", if I'm reading it
correctly, the author shows how you can make each hash table bucket a linked
list so that each bucket never becomes full. But what happens when you have
1000 items in each bucket? If you keep adding elements and you want to keep
reads fast, sooner or later you're going to have to add more buckets, at which
time you need to re-hash every element in the table.

I'm a fan of hashtables, but there's some rather suspicious handwaving in this
article.

~~~
marshray
_goes on to explain that the worst case is unlikely. That may be true, but
it's still O(n)._

It's not O(n), it's more like O(1 + max(0, n - m)) where n is the number of
elements in the table and m is the number of buckets. The table can grow up to
about n == m and it's still approximately O(1).

 _But what happens when you have 1000 items in each bucket?_

Don't do that. Rather if you do that, then you don't get to complain that the
performance of your data structure is that of the fallback strategy instead of
a properly sized hashtable.

You should have begin executing your growth strategy back at about 0.5 items
in each bucket.

 _If you keep adding elements and you want to keep reads fast, sooner or later
you're going to have to add more buckets,_

Sooner or later you're going to have to put more RAM in the machine too. But
in the meantime, your trusty old hash-based data structure will give you
reliable service.

 _at which time you need to re-hash every element in the table._

If latency is a concern this re-hashing can be done incrementally.

If the hashing of elements itself is expensive, you can store the resulting
hash value for each element.

Note that there are three zero bits at the bottom of each 64-bit-aligned
pointer value. You could use these to store three additional bits of the hash
value, allowing you to grow your table by a factor of 8 without rehashing.

~~~
tptacek
_It's not O(n), it's more like O(1 + max(0, n - m)) where n is the number of
elements in the table and m is the number of buckets. The table can grow up to
about n == m and it's still approximately O(1)._

Isn't that just the long way of saying "O(n)"? It's a _worst-case bound_.

~~~
marshray
I'm guilty of sometimes using big-O notation when discussing amortized average
costs as well.

It seems nonsensical to discuss worst-case behavior of hash-based data
structures since we can (and should) make it arbitrarily unlikely.

In other words, if O(n) is the limit as n goes to infinity then the
probability of actually encountering worst-case beahvior on a hashtable
operation is 0 (assuming a non-broken hash function). Near-worst case behavior
becomes similarly improbable.

This is different than many algorithms like, say, classic quicksort which have
worst-case behavior on values that naturally appear in practice.

~~~
rm999
It depends on the application of the hash table, but malicious agents could
'attack' code by purposefully creating collisions if the author isn't careful.
An algorithm expected to run in O(1) that runs at O(n) could be catastrophic.

~~~
marshray
But the solution to that problem (using a cryptographic hash function
randomized with a secret) doesn't change the complexity of the algorithm.

~~~
rm999
I'm just pointing out that the worst case time complexity is an issue that
can't always be cast aside. Aren't cryptographic hash functions generally
slower?

------
bunderbunder
Just to throw it out there, here's a 6th myth to consider:

> Hash tables are fast

When you're dealing with lots of keys, they're usually the best option out
there. But if you're dealing with a small number of keys, a simple array can
often be faster. A hash table lookup involves a constant amount of overhead
resulting from calculating the key, and possibly also added indirection
depending on whether the implementation uses open addressing. If the amount of
time that would be spent on a simple linear or binary search of the array is
less than (or even not much more than) the time that would be spent on that
overhead, then the array will be faster.

Of course, the point where the hash table overtakes the array depends on a lot
of factors. As usual the only sure-fire way to know which option is better in
a particular situation is to measure.

~~~
fusiongyro
I recently read a story by one of Wirth's students, reminiscing about working
on the Oberon compiler. They had several important principles for developing
the Oberon compiler that drove development, one was something to the effect
that anything you add must pay for itself or improve the speed of compilation
--the benchmark being the whole Oberon system. The students had implemented a
very fancy data structure, I think a balanced tree, to use with register
allocation. Wirth found out and ordered them to remove it and replace it with
a linked list. They groaned, but complied--and the compiler was both smaller
and faster as a result, despite worsening the worst case.

~~~
dchest
"I still vividly remember the day that Wirth decided to replace the elegant
data structure used in the compiler’s symbol table handler by a mundane linear
list. In the original compiler, the objects in the symbol table had been
sorted in a tree data structure (in identifier lexical order) for fast access,
with a separate linear list representing their declaration order. One day
Wirth decided that there really weren’t enough objects in a typical scope to
make the sorted tree cost-effective. All of us Ph.D. students were horrified:
it had taken time to implement the sorted tree, the solution was elegant, and
it worked well – so why would one want to throw it away and replace it by
something simpler, and even worse, something as prosaic as a linear list? But
of course, Wirth was right, and the simplified compiler was both smaller and
faster than its predecessor."

<http://www.ics.uci.edu/~franz/Site/pubs-pdf/BC03.pdf>

~~~
fusiongyro
Yes, that's it! Thank you! I apparently misremembered the purpose, but at
least I got the sorted tree part right. :)

------
CoolGuySteve
Hash tables are fantastic, but:

Depending on the hash and hash table implementation, large hash tables where
elements are spread across several pages in memory can pathologically thrash
your caching. You can get similar behaviour to randomly accessing elements in
a large array. A cache miss from L2 to RAM is extremely slow and can be about
a million times slower if you have to page fault all the way into swap.

This is a good example of a case where the standard model of computer that's
used in CS proofs starts to fall over because it ignores hierarchical memory
and cache locality.

Hand wavily, accessing every element once in a large naive hash table will
give you O(n) page faults on average whereas accessing every element
sequentially in a vector will give you O(n/d) page faults where d is your page
size. While asymptotically similar, in real life that constant d can be
gigantic.

That being said, I'm not sure how the C++ unordered_map handles this, it's
just something I experienced when running cachegrind on someone's hand rolled
implementation in the past. (I 'fixed' it by making the templated class store
pointers, and then locality was greatly improved because malloc's default
behaviour.)

~~~
marshray
* A cache miss from L2 to RAM is extremely slow*

Yes, but a properly-sized hash table will result in an average of 1 to 2 cache
misses per lookup regardless of how many elements. This is far better than
what you get with other structures, except perhaps in specialized cases that
happen to follow tree traversal.

 _and can be about a million times slower if you have to page fault all the
way into swap._

Disk based data requires specialized algorithms.

 _accessing every element once in a large naive hash table will give you O(n)
page faults on average whereas accessing every element sequentially in a
vector will give you O(n/d) page faults where d is your page size. While
asymptotically similar, in real life that constant d can be gigantic._

Sure, a vector is better when you only ever access elements in a single
sequential order and don't add or remove elements anywhere except the end.

 _That being said, I'm not sure how the C++ unordered_map handles this_

C++ defines a method of iterating over all the elements of an unordered_map. I
think a typical implementation will just treat the hash memory as a vector and
return values from the occupied buckets.

------
chojeen
> Engineers irrationally avoid hash tables...

Really? That seems about as likely as the statement 'carpenters irrationally
avoid using screws'.

~~~
aidenn0
Indeed, I find academia is averse to hashtables (because of the poor worst-
case performance) but engineers think they are the best thing since sliced
bread.

~~~
marshray
This "worst-case" issue seems silly to me.

If all your elements magically hash to the same bucket somehow and you end up
having to make a list or a tree out of them, you're still no worse off than if
you used a list or a tree in the first place.

~~~
aidenn0
Most hash-table implementations either re-hash or use an O(n) lookup
structure. And replace "magically" with "a malicious attacker" and you see it
does come up more often.

------
brey
Misses the point about B-trees.

They're not just 'good for inexact matches' - they're optimised for things
like database indexes, to reduce the number of seeks over a block-based device
(eg a filesystem) required for a index lookup.

~~~
tptacek
External tables are an entirely orthogonal issue to this discussion. He's
comparing hash tables to in-memory balanced binary trees.

~~~
brey
third myth in the article compares B-trees to hash tables:

"Trees are better In general, trees aren’t faster for searching for exact
matches. They’re slower, and here’s peer-reviewed evidence that compares
B-trees, splay trees, and hashing for a variety of string types. So, why use a
tree at all? Trees are good for inexact matches — say finding all names that
begin with “B” — and that’s a task that hash tables can’t do."

~~~
alayne
It looks like the paper doesn't refer to B-trees. It refers to binary search
trees.

~~~
brey
and so the article confusing B-trees with binary search trees doesn't exactly
increase confidence here ...

~~~
tptacek
What "confidence" do you need for a straightforward comparison of hash tables
and binary trees? These aren't Apple rumors; this is CS 101.

~~~
brey
my point is merely that as written, the article is confusing.

to paraphrase: "hashtables are faster. here's some research showing b-trees
are slower than hashtables. why would you ever use a tree? prefix matching,
that's why."

which misses the point that the primary reason for using a B-tree as a general
purpose database index is not about prefixes, it's about block devices
(prefixes is an important reason, but it's not the main one)

if the purpose of the article was to ignore implementation and spinning disks
and just look at in-memory performance, then don't mention B-trees. they're an
implementation-specific optimisation, and it just confuses.

------
jchonphoenix
I'm not really sure this article explains any myths about hashtables at all.

Aren't all of these statements obvious is you actually understand what a
hashtable is?

------
fusiongyro
The reason functional programmers prefer trees is because it's much easier to
make them persistent (in the functional sense).

------
afc
Seems somewhat vague, I think it could be a lot more accurate. I'm a bit
surprised to see it here.

It claims that it's a myth that "trees are better" than hash tables and just
describes trees as better for "finding inexact matches", as in the example
"finding all names that begin with B". That's too vague. It's misleading
because:

1\. Trees can't just answer any random vague questions, such as "Find all
names that have a P anywhere". That doesn't really describe the advantages of
trees. I think a more accurate formulation would be "iterating over a range of
values" or "finding values based on some ordering" (eg. find the smallest
element bigger than X). Or for walking the entire contents in some predefined
order. Sure, this is not an article about trees, but it doesn't take more than
two sentences to describe that in a much better manner.

2\. For any of the former query patterns, trees _are_ actually much better
than hash tables. That's definitely not a myth, which this article would sorta
have you believe.

> Chained hash tables are a very good idea. Problem solved.

Wrong. If you don't resize your hash table, the complexity of accesses _will_
degrade linearly as it exceeds its size. If your hash uses, say, 100 buckets
each with a linked lists and stores N > 100 values, an access has complexity
O(N) (sure, O(N/100) in the best case (assuming perfect spread of hash keys),
but that's really O(N)).

Finally, "The worst case is terrible" myth part seems somewhat naive. I think
it should at least mention the idea of hash table collision DOS attacks. "This
doesn't happen in practice" unless there's some attacker out to get you by
triggering that. There's plenty of examples of that happening in practice, of
systems vulnerable to that. See example presentation here:
[http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1](http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1)
Sure, that's not a reason to switch away from hash tables, but it's definitely
something to consider when using them. I get a bad flavor from this
simplification of "yeah, this doesn't happen in practice, at least as long as
you know what your keys are".

------
16s
C++11 has hash backed containers now if you prefer them. std::unordered_set
and std::unordered_map are identical to std::set and std::map except they use
hash tables rather than red black trees.

~~~
malkia
Isn't that implementation specific?

~~~
marshray
The standard requires amortized O(1) insert and lookup operations, so it's in
practice going to be a hashtable.

~~~
aidenn0
Any Trie varient will have that as well though.

~~~
marshray
Really? Looks to me like the lookup function is invoking recursion here:
<https://en.wikipedia.org/wiki/Trie#Algorithms>

I would think the amortized cost would be O(log_b n), with b being something
like the Shannon entropy of each symbol.

~~~
aidenn0
Well in that case the cost of a hash-function is also not O(1) since the
minimum length of a n distinct keys is log(n).

Really both a trie and a hash-function boil down to O(m) where m is the length
of the key.

------
ggchappell
There are some good ideas here, but I must take issue with a number of the
statements the author makes.

To begin with, it is not a "myth" that hash tables have poor worst-case
performance. It is true that this matters little for most applications, and
not at all for some, but the worst case is still there. So I would avoid using
hash tables, e.g., in safety-critical systems: medical, military, etc. (Or, as
I tell my students: don't program a pacemaker in Python.)

And we should note that the excuse that datasets leading to poor performance
are very rare, went out 20 years ago. When the person providing your dataset
is a teenager who's trying to break your system just for the fun of it, rarity
does not mean quite what you might think.

Second, hash tables do indeed have performance problems when they fill up.
True, there are ways around this, and any decent implementation of a mutable
hash table will include a strategy for handling these situations.

And that suggests a myth the author of this article might believe: that any
ol' programmer can & should write their own hash table. I don't agree. Don't
write a hash table for use in production code unless you're sure you know what
you're doing. And even if you do, someone else's polished & tested hash table
is likely to be better than what you're going to come up with.

Lastly, balanced search trees do have better worst-case behavior. That mostly
doesn't matter, but sometimes it does. (You have my permission to use a Red-
Black Tree in your pacemaker code.)

------
zippie
I don't believe most engineers brush off hash tables because of the worst case
search. Or even storage costs or implementation complexity as the author
suggests.

My reason for shrugging off hash tables for _most_ requirements:
cost/complexity of inserts and deletes. Hash tables are amazing when you do
offline index generation (circa Google "non-realtime" search).

However, if you ever need to have frequent updates, deletes, or inserts, you
need to grow the hash table and in doing so managing the hash table usurps any
search gains. For some use cases that's fine - perhaps you only optimize for
reads. Hash tables are awesome for read heavy loads. Hash tables are sub-
optimal for mixed read/write work loads.

On that note, I believe the Red-Black btree is far superior to hash tables for
mixed work loads. RBtree's tend to be more predictable and when your
application needs the insert/delete (ie, for real time updates) you don't have
to worry about having to find a new data structure.

On a related note, XFS had to switch their hash over to an rbtree to fix a
performance issue related to hashes falling over in costs of insert/deletes:

<http://oss.sgi.com/archives/xfs/2010-09/msg00203.html>

Yes there are myths about hash tables but article doesn't touch on the real
truths of why engineers avoid hash tables.

------
dspeyer
So we have two ways of dealing with overfullness. We can use linked-list
buckets, which are expected O(n) or we grow and sometimes have to rehash the
_entire table_ as part of an add.

------
norswap
I'm jumping on the occasion to ask a review of my own hash table
implementation:

<https://github.com/norswap/nucleotides/blob/master/src/map.c> (think also to
peruse the documentation and the tests!)

I'm open to any remarks and criticisms: performance, code style, etc. Since
the implementation is meant to be simple and educational I may not act on all
of them, tough.

------
artsrc
Hash Tables make iteration order hard to predict.

Hash functions rarely are meaningful outside of the context of the collection.

The tree interface leads to a simpler world.

------
vii
Hashtables are great until they grow to a significant fraction of available
memory. Then there is no way to avoid the cost of rehashing or giving up
constant time access - and if your program isn't using most of memory then you
can run it on a smaller computer!

------
ishbits
As a developer I tend to use RB trees in the prototype phase then migrate to a
hash table if the performance difference matters.

------
philluminati
<http://global.sitesafety.trendmicro.com/>

------
cmccabe
I really disagree with this article.

"The worst case is terrible" is not a myth, it's a reality-- and it's spawned
several major security vulnerabilities recently. CVE-2012-1150 for example.

"Hash tables become full, and bad things happen." Yes. Yes they absolutely do.
It depends on your implementation exactly what will happen, but in every case,
you are in for some pain. If you choose the "linked list per element" route,
congratulations-- your hash table will gradually approximate the lookup
behavior of an unsorted linked list as the number of elements gets larger. If
you choose rehashing, you'll have to deal with long pauses in your program
during a rehash. Hello, latency spikes! Incremental rehashing could improve
this, but only at the cost of degrading memory locality even further.

"Trees are better." For most applications, trees really ARE better. Trees are
simple to use and they use less memory most of the time. You really should not
be using hash tables unless you have a really good idea of how many elements
you are going to have, AND you believe that they will not be accessed in any
particular pattern.

"Hash functions are slow." I don't even know what this statement is supposed
to mean. But I do know that there are a lot of bad hash functions out there,
and it's easy to choose a bad one as a novice programmer. The right hash
function will depend on your data-- something I note that this article fails
to mention.

"Hash tables use too much memory." A wise programmer once told me, "fancy data
structures are only good when N is large-- and N is usually small." That
definitely applies here. Another way to look at this is that you need to keep
the hash table half empty to keep the good properties that made you choose a
hash table in the first place: O(1) lookup, O(1) insert, O(1) delete, etc. So
you will just have to pay the memory price.

Hash tables can be very useful for certain applications. But in general,
they're an optimization that you should make only if you know what you're
doing. You should be very clear on why you're using them, how many elements
you expect to have, and what your strategy is if you overflow. It's not for
beginners. And if you are optimizing, you should also be aware that something
like a radix tree or a B-Tree may lead to much better performance, due to its
much better memory locality.

Just about the only really intriguing use for hash tables on a modern CPU is
in lockless data structures, but if you're considering writing one of those--
you already know all the stuff I just wrote.

