Hacker News new | comments | show | ask | jobs | submit login
Myths about Hash Tables (hughewilliams.com)
101 points by zengr 1364 days ago | past | web | 100 comments

A good article, but two quick thoughts:

1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.

2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.

> and over the last 4 years we're starting to see more exploitation of it.

Interesting; this came up in the Lua community and the Lua authors (who I have a ton of respect for) agreed to fix the problem, but thought the threat was overstated. Their argument was that many other forms of DoS attack are lower hanging fruit.


As to the first point, I remember this being described well in an older Phrack article [1] written by Solar Designer with regards to port scan detection:

In scanlogd, I'm using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time.

[1] http://www.phrack.org/issues.html?issue=53&id=13#article

Do there exist good hash functions that are resistant or immune to the attacks you mention? For example, does Python 3.3's new hash randomization provide adequate protection?

This is what SipHash was made for: it's a cryptographic MAC that's competitive in speed with other hash functions meant for hash table use. The paper describes its design, and the weaknesses with competitors like Python's randomized hashing:


Some code for it here:


Interesting, though at 140 cycles to hash 16 bytes, it's much slower than state-of-the-art hash functions, which can hash 16 bytes in just over 16 cycles. If you seed your non-cryptographic hash function with a cryptographically random number, you can get the best of both worlds: fast hashing and collision resistance against an adversary.

Not really. The SipHash page contains proof-of-concept code that breaks common good non-cryptographic randomized hashes, i.e., it generates collisions regardless of the chosen seed.

Regarding performance, SipHash is essentially as fast as Python's randomized hash, and quite faster for long inputs (>16 bytes), cf. http://bench.cr.yp.to/results-auth.html#amd64-sandy .

That code appears to rely on having direct access to the hash function; this seems very unlikely to be leaked to an attacker. More interesting would be code that gets the secret with only access to ordered iteration on the table, which is more likely to leak.

I don't know anything about Python's hash function, all of those benchmarks are still an order of magnitude slower than CityHash (for example), which peaks at 0.16 cycles/byte.

I would be interested to see SipHash put through SMHasher, which is the most comprehensive hash function test suite I know of: http://code.google.com/p/smhasher/

Seed recovery (which requires access to hash function) is for Python hash. For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Of course, it passes SMHasher with 10 score -- it's a cryptographic PRF.

> For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Are you referring to the "Advanced hash flooding" section of the paper, or am I missing this in another place? That analysis appears to rely on the hash function being highly linear/periodic (ie. you can always add "l" to get back the same hash value). I don't think that City, Spooky, or Murmur are nearly this linear. For example, City is based on CRC32 (ie. polynomial division).

Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic hash. But modern non-cryptographic hash functions are so fast (especially when they're CPU-accelerated, like City is on Intel thanks to the CRC32 instruction) that I don't think SipHash can claim to be the obvious choice for all hash tables.

I am interested to know though if the random seed + City/Spooky/Murmur approach really does have practical hash flooding attacks.

It's pretty quick to generate a lot of collisions, and the set of inputs you get will collide independently of the seed. Look (CityHash64):

$ time ./citycollisions 1000000 > /dev/null

real 0m0.904s

user 0m0.439s

sys 0m0.008s

$ ./citycollisions 3

128-bit key 741b848b0065aec04bb25b26ca2c3e9

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = d3290c3d600b8add

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = d3290c3d600b8add

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = d3290c3d600b8add

$ ./citycollisions 3

128-bit key 5e2a23015c89da98172fbcb77ad98468

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = 59d0b041b16594b

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = 59d0b041b16594b

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = 59d0b041b16594b

Wow, collisions against the best non-cryptographic hash functions that are independent of seed... color me impressed! This does then seem to strongly suggest that a cryptographic hash function is the only way to defend against hash flooding attacks. Props to the SipHash authors, this is some ground-breaking work.

See Downloads section here https://www.131002.net/siphash/

Edit: wait, I'm mistaken -- these are not independent of seed.

I think they are, actually! The code just uses "seed" in a slightly confusing way -- it's seeding its own algorithm with "seed" but using "key" as the seed for the hash function.

This is really impressive work.

Aha, thanks for clearing that up.

Isn't Python open source? Along with a large quantity of server-side software? I'd assume everyone already knows my hash functions, I didn't make any new ones.

He means access to [seeded] hash function output.

The strategy you propose gives immunity to the more straightforward attacks on hash tables, but in practice tends to leak information about your secret key through differences in the time taken by hash table operations. If you check out pages 14--16 of the SipHash paper, it goes into a bit more detail about this. You may also find Appendix B entertaining.

Can you name some of these speedy state-of-the-art hash functions?

MurmurHash3 (http://code.google.com/p/smhasher/wiki/MurmurHash3), CityHash (http://code.google.com/p/cityhash/), SpookyHash (http://burtleburtle.net/bob/hash/spooky.html).

But you said 16 cycles.

"For example, hashing 16 bytes takes 141 Bulldozer cycles with SipHash-2-4, against 82 and 126 for CityHash and SpookyHash, and 600 for MD5."

https://www.131002.net/siphash/siphash.pdf (p.12)

According to CityHash's benchmarks (http://code.google.com/p/cityhash/source/browse/trunk/README), they can hash 8 bytes in 6ns and 64 bytes in 9ns on a 2.67GHz machine (so 16-24 cycles).

Inside the Redis distribution you can find "dict.c" that is a very easy to understand hash table implementation that uses incremental rehashing, that is, when it needs to switch to a bigger or smaller table this happens incrementally in three ways:

1) At every operation you perform there is a rehashing step.

2) You have a function that you can call to perform a rehashing step.

3) There is an higher level function where you can say "rehash for N milliseconds" in case you have some idle time to spend for a good cause inside your program.

Another uncommon thing of dict.c is that it supports the "pick a random element" operation with guaranteed O(1) behaviour.

EDIT: Now that I look at the source, clearly while the code is easy to follow a design section should be present in the dict.c top-comment. I'll make sure to add this, but long story short:

* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.

* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).

* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).

* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.

Well these are the main ideas I think.


Thanks, perhaps the design is even more evident here -> https://github.com/antirez/redis/blob/unstable/src/dict.h

Just a bit of clarification, when you say "pick a random element", do you mean choosing an arbitrary element, or do you actually mean choosing one at random such that each element has a 1/N probability of being chosen?

I'd yield to someone who understands the code, but a quick glance at dictGetRandomKey() shows it is randomized, and I believe is very close to 1/N. If there are a large number of elements in a single bucket this may not be true. At an extreme, if there is one element in a bucket and N-1 in another, there is a 50% chance teh first will get picked, and 50/(N-1)% the others will.

Exactly as you said, it's a decent approximation and with large tables it works well enough, but if there are clusters (long chains) in a bucket those elements have a smaller percentage of chances to be picked.

However there is a trick to improve this that I don't use, that is, instead of searching for a non empty bucket, and select a random element from the chain, it is possible to do this, choosing a small M (like M=3):

* Find M non-empty buckets.

* Pick the bucket among the M buckets, with a chance proportional to the chain length at every bucket.

* Finally pick a random element from the bucket.

The bigger M, the more accurate the algorithm becomes.

For instance in the pathological case you shown where a bucket as one element and another all the rest, this would find the two buckets and then pick the 1 element bucket with a much smaller probability that would adjust the difference in chain length.

Thanks for clarifying. This is a third possibility that hadn't occurred to me, so I'm glad I asked and learned something!

I love hash tables. They're my favorite data structure, maybe.

A story:

Back in the 70s it appeared that every major state university had to do their own version of PL/1. At the University of Maryland it was called PL/UM. I don't know why, maybe it was a testosterone thing ("IBM took a zillion man-years to write PL/1, but look what WE did with three starving grad students, a half-soused half-professor and a keypunch operator!") (Yeah, another version of PL/1, and that doesn't necessarily make the world a better place. Anyway --)

PLUM was notorious for being slow (I don't know about bugs, but they were probably there, too), and it was pretty common for students using PLUM to have to grovel for more seconds of CPU time.

Five or six years into PLUM's use as a student language, someone discovered that all of the symbols were being hashed into the same bucket.

PLUM got amazingly fast (well, faster, at least) after that. But by then anyone with a clue had managed to scam an account on one of the bsd Vaxen and write their stuff in C.

Don't take hash tables on faith. Measure, measure, measure.

The first two myths aren't really myths.

In "The worst case is terrible", the author goes on to explain that the worst case is unlikely. That may be true, but it's still O(n).

In "Hash tables become full, and bad things happen", if I'm reading it correctly, the author shows how you can make each hash table bucket a linked list so that each bucket never becomes full. But what happens when you have 1000 items in each bucket? If you keep adding elements and you want to keep reads fast, sooner or later you're going to have to add more buckets, at which time you need to re-hash every element in the table.

I'm a fan of hashtables, but there's some rather suspicious handwaving in this article.

goes on to explain that the worst case is unlikely. That may be true, but it's still O(n).

It's not O(n), it's more like O(1 + max(0, n - m)) where n is the number of elements in the table and m is the number of buckets. The table can grow up to about n == m and it's still approximately O(1).

But what happens when you have 1000 items in each bucket?

Don't do that. Rather if you do that, then you don't get to complain that the performance of your data structure is that of the fallback strategy instead of a properly sized hashtable.

You should have begin executing your growth strategy back at about 0.5 items in each bucket.

If you keep adding elements and you want to keep reads fast, sooner or later you're going to have to add more buckets,

Sooner or later you're going to have to put more RAM in the machine too. But in the meantime, your trusty old hash-based data structure will give you reliable service.

at which time you need to re-hash every element in the table.

If latency is a concern this re-hashing can be done incrementally.

If the hashing of elements itself is expensive, you can store the resulting hash value for each element.

Note that there are three zero bits at the bottom of each 64-bit-aligned pointer value. You could use these to store three additional bits of the hash value, allowing you to grow your table by a factor of 8 without rehashing.

It's not O(n), it's more like O(1 + max(0, n - m)) where n is the number of elements in the table and m is the number of buckets. The table can grow up to about n == m and it's still approximately O(1).

Isn't that just the long way of saying "O(n)"? It's a worst-case bound.

I'm guilty of sometimes using big-O notation when discussing amortized average costs as well.

It seems nonsensical to discuss worst-case behavior of hash-based data structures since we can (and should) make it arbitrarily unlikely.

In other words, if O(n) is the limit as n goes to infinity then the probability of actually encountering worst-case beahvior on a hashtable operation is 0 (assuming a non-broken hash function). Near-worst case behavior becomes similarly improbable.

This is different than many algorithms like, say, classic quicksort which have worst-case behavior on values that naturally appear in practice.

It depends on the application of the hash table, but malicious agents could 'attack' code by purposefully creating collisions if the author isn't careful. An algorithm expected to run in O(1) that runs at O(n) could be catastrophic.

But the solution to that problem (using a cryptographic hash function randomized with a secret) doesn't change the complexity of the algorithm.

I'm just pointing out that the worst case time complexity is an issue that can't always be cast aside. Aren't cryptographic hash functions generally slower?

The fundamental problem, especially wrt to 1 and 2, is that you're starting to deal more with the design and implementation of the hash table than anything else. There are two things that need to be considered for a good hash table implementation:

1. Collisions should be expected, and handled correctly.

2. The hashing algorithm should minimize collisions as much as possible.

Yes, the worst case running time is linear. However, assuming a properly defined hashing function, it's practically a linear running time over the expected number of collisions per bucket, which should be a tiny fraction of the total expected data size.

The problem that the second gives us is that proper hash algorithms are tailored to the specific problem at hand. MD5 is overkill when your entire pool of data is 1000 names. And knowing your data set makes it so you can avoid having a relatively large percentage of your data in one bucket.

Minor correction:

The worst case runtime is not O(n), it is defined by how you store your nodes in a hash bucket. If you choose to use a linked list, then searching a linked list is O(n). Also, obviously, there is no free lunch, so using a hash table in each bucket does not magically produce O(1) lookups.

Pick a data structure that has O(log n) runtime for lookups and use it when the hash table breaks down - that is, use it when searching in a bucket.

On the other hand, if your hash table has enough nodes in a bucket that you even notice the difference between O(very small n) and O(log very small n), you really should migrate to a larger hash table.

I;'m the author. It's not hand waving, it's an attempt to write a simple, approachable article -- I'm not trying to be Knuth.

If you want details, read the two articles that are linked into the text -- in those I delve into plenty of details, and include real-world experiments.

I thought it was a great article.

If things are being hashed evenly, and you have a 100 bucket table, putting 200, 300, etc. elements into the table isn't a terribly big deal, because it's just not slow to traverse a few nodes of a linked list. Of course, there's no reason not to make the table much bigger--it's just an array of pointers, make it 1,000,000 entries and you're still within a few megabytes.

Now, if someone wants to be a jerk, and figures out how to easy collide on your hash function, then you could have problems. He could stuff 1 million elements into the same hash table element, which will suck for linked-list traversal. So get a good hash algorithm :)

It's not just a "good" hash algorithm, it has to be one that cannot be made to behave in a way predictable to an attacker.

Picking a new secure random seed value at startup time and prepending it to every hashed value is generally what you need to do.

I am not going to buy a lottery ticket because the super unlikely best case is really good. For the same reason, I am not going to avoid a hash table because the super unlikely worst case is kind of bad.

You never have 1000 items in each bucket. You resize well before then.

I think at the point of having 1000 items in each bucket, a hashtable was never a good idea. I read it more as like, "Don't worry if you have collisions cos it'll be OK" as opposed to "Collisions have no impact on performance".

When you have 1000 items per bucket, you realize that you needed a bigger table. Depending on the implementation you're using, it may be able to expand the table without going "ok, hold everything, I have to rehash all this existing content". For instance, you can create a second hash table, start putting new additions in there. When you get an access, first check the new table, then the original one. If it's in the original table, return the value and move the k/v pair to the new table. Lots of papers have surely been written on this topic.

I ran experiments back in the day that varied the size of the hash table, and tried extremes where there were extremely long chains (high collisions). Take a look at pages 274 and 275, you'll be perhaps surprised how fast hash tables are when there are collisions -- especially if you use the move-to-front heuristic. (I agree with the other comments that the article is primarily concerned with the read use case.) http://www.mathcs.emory.edu/~whalen/Hash/Hash_Articles/In-me...

You could make each hash bucket a tree instead?

I'm the author of the article. I've tried that -- the downside is that the data structures (nodes in particular) take more space, and have more costs to manage / reorganize. Using a linked list with move-to-front-on-access works better in all experiments I've tried.

Just to throw it out there, here's a 6th myth to consider:

> Hash tables are fast

When you're dealing with lots of keys, they're usually the best option out there. But if you're dealing with a small number of keys, a simple array can often be faster. A hash table lookup involves a constant amount of overhead resulting from calculating the key, and possibly also added indirection depending on whether the implementation uses open addressing. If the amount of time that would be spent on a simple linear or binary search of the array is less than (or even not much more than) the time that would be spent on that overhead, then the array will be faster.

Of course, the point where the hash table overtakes the array depends on a lot of factors. As usual the only sure-fire way to know which option is better in a particular situation is to measure.

I recently read a story by one of Wirth's students, reminiscing about working on the Oberon compiler. They had several important principles for developing the Oberon compiler that drove development, one was something to the effect that anything you add must pay for itself or improve the speed of compilation--the benchmark being the whole Oberon system. The students had implemented a very fancy data structure, I think a balanced tree, to use with register allocation. Wirth found out and ordered them to remove it and replace it with a linked list. They groaned, but complied--and the compiler was both smaller and faster as a result, despite worsening the worst case.

"I still vividly remember the day that Wirth decided to replace the elegant data structure used in the compiler’s symbol table handler by a mundane linear list. In the original compiler, the objects in the symbol table had been sorted in a tree data structure (in identifier lexical order) for fast access, with a separate linear list representing their declaration order. One day Wirth decided that there really weren’t enough objects in a typical scope to make the sorted tree cost-effective. All of us Ph.D. students were horrified: it had taken time to implement the sorted tree, the solution was elegant, and it worked well – so why would one want to throw it away and replace it by something simpler, and even worse, something as prosaic as a linear list? But of course, Wirth was right, and the simplified compiler was both smaller and faster than its predecessor."


Yes, that's it! Thank you! I apparently misremembered the purpose, but at least I got the sorted tree part right. :)

Perhaps this was an example where the total number of elements was well bounded. After all, CPUs aren't growing more general-purpose registers very often, or the size of the scope considered in the scheduling may be limited by other constraints.

Yes, absolutely. Along similar lines, Hindley-Milner type inferencing is also super-exponential in the worst case, but people don't write code bad enough to trigger the worst case.

Hash tables are fantastic, but:

Depending on the hash and hash table implementation, large hash tables where elements are spread across several pages in memory can pathologically thrash your caching. You can get similar behaviour to randomly accessing elements in a large array. A cache miss from L2 to RAM is extremely slow and can be about a million times slower if you have to page fault all the way into swap.

This is a good example of a case where the standard model of computer that's used in CS proofs starts to fall over because it ignores hierarchical memory and cache locality.

Hand wavily, accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.

That being said, I'm not sure how the C++ unordered_map handles this, it's just something I experienced when running cachegrind on someone's hand rolled implementation in the past. (I 'fixed' it by making the templated class store pointers, and then locality was greatly improved because malloc's default behaviour.)

* A cache miss from L2 to RAM is extremely slow*

Yes, but a properly-sized hash table will result in an average of 1 to 2 cache misses per lookup regardless of how many elements. This is far better than what you get with other structures, except perhaps in specialized cases that happen to follow tree traversal.

and can be about a million times slower if you have to page fault all the way into swap.

Disk based data requires specialized algorithms.

accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.

Sure, a vector is better when you only ever access elements in a single sequential order and don't add or remove elements anywhere except the end.

That being said, I'm not sure how the C++ unordered_map handles this

C++ defines a method of iterating over all the elements of an unordered_map. I think a typical implementation will just treat the hash memory as a vector and return values from the occupied buckets.

Cache locality is easily approached in closed hashing by dividing the complete slot array into aligned page-sized chunks. Then part of the hash value points to the chunk and the rest of the hash value is used to do the secondary probe linearly within that chunk so all reads take place in the L1 cache. For example, if each slot contains the hash and a pointer to the object (for final key comparison), you can fit 512 entries onto a single 4K page in a 32-bit system.

(Of course, the chunks must eventually be processed linearly if they're all full and the key isn't found till the last chunk, but this is again a degenerate case which generally isn't a problem as long as the hash table is resized often enough as it grows.)

The reason this often isn't a problem in practice is that hash tables are typically used in cases where you're going to be accessing elements randomly anyways. If you're accessing elements randomly, I don't know of a data structure that will do a better job of keeping things in cache.

> Engineers irrationally avoid hash tables...

Really? That seems about as likely as the statement 'carpenters irrationally avoid using screws'.

Language affects matters here, I think. I haven't met a Python programmer who doesn't love dictionaries, but I've seen C programmers (and to an extent, C++ programmers where 'std::map' is a red black tree) that avoid them because they have to roll their own or find and trust (and learn how to use) someone else's. Java programmers are more mixed. When the language forces you if you want to have nested hash tables to write "Map<String, Map<String, Map<String, Map<String, String>>> > meta_data = new HashMap<String, Map<String, Map<String, Map<String, String>>> >();" instead of "meta_data = {}", it's a pretty strong incentive to not use these things.

To be fair to the C++ programmers, there are several legitimate reasons to avoid std::map<> - for instance, if your project doesn't use exceptions.

So you don't use std::string or dynamic_cast<>() either?

It's typical in game development to disable both exceptions and RTTI entirely. dynamic_cast is out on either of those counts. std::almost_everything is out due to exceptions.

EA released a GPL'd "EASTL" library a few years ago for those and other reasons, you can read all about it here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227...

I think we've had this conversation before :-)

I've posted that link once or twice before (once or twice literally, not sarcastically), it wouldn't surprise me if I was replying to you in one of those instances :)

Here, I found it, we both posted in this thread, but not to each other: http://news.ycombinator.com/item?id=3560158

Perhaps I saw that link on Reddit?

Indeed, I find academia is averse to hashtables (because of the poor worst-case performance) but engineers think they are the best thing since sliced bread.

This "worst-case" issue seems silly to me.

If all your elements magically hash to the same bucket somehow and you end up having to make a list or a tree out of them, you're still no worse off than if you used a list or a tree in the first place.

Most hash-table implementations either re-hash or use an O(n) lookup structure. And replace "magically" with "a malicious attacker" and you see it does come up more often.

Though in general, you are right that academics tend to be over-obsessed with worst-case performance; see e.g. heap-sort.

Agreed! And taken to the logical extreme often times they are used as an excuse for lazy thinking. One example is stream processing where you're processing lots of messages per second and typically have some sort of originating request ID. Keeping a map of request ID's -> handlers gets very expensive if your goal is low latency (because you perform the map lookup on every operation).

I agree, it's a straw man. Also, who ever implements their own data structures anymore? These days people use native-language maps freely, which may be implemented as trees or hash tables.

Misses the point about B-trees.

They're not just 'good for inexact matches' - they're optimised for things like database indexes, to reduce the number of seeks over a block-based device (eg a filesystem) required for a index lookup.

External tables are an entirely orthogonal issue to this discussion. He's comparing hash tables to in-memory balanced binary trees.

third myth in the article compares B-trees to hash tables:

"Trees are better In general, trees aren’t faster for searching for exact matches. They’re slower, and here’s peer-reviewed evidence that compares B-trees, splay trees, and hashing for a variety of string types. So, why use a tree at all? Trees are good for inexact matches — say finding all names that begin with “B” — and that’s a task that hash tables can’t do."

It looks like the paper doesn't refer to B-trees. It refers to binary search trees.

and so the article confusing B-trees with binary search trees doesn't exactly increase confidence here ...

There are two articles that are linked to in the text, one that describes comparisons of binary tress, self-adjusting tress, and hash tables. There's a second article that compares B-trees, hash tables, and a few other structures -- that's this one: http://ww2.cs.mu.oz.au/~jz/fulltext/acmtois02.pdf

What "confidence" do you need for a straightforward comparison of hash tables and binary trees? These aren't Apple rumors; this is CS 101.

my point is merely that as written, the article is confusing.

to paraphrase: "hashtables are faster. here's some research showing b-trees are slower than hashtables. why would you ever use a tree? prefix matching, that's why."

which misses the point that the primary reason for using a B-tree as a general purpose database index is not about prefixes, it's about block devices (prefixes is an important reason, but it's not the main one)

if the purpose of the article was to ignore implementation and spinning disks and just look at in-memory performance, then don't mention B-trees. they're an implementation-specific optimisation, and it just confuses.

I'm not really sure this article explains any myths about hashtables at all.

Aren't all of these statements obvious is you actually understand what a hashtable is?

The reason functional programmers prefer trees is because it's much easier to make them persistent (in the functional sense).

Seems somewhat vague, I think it could be a lot more accurate. I'm a bit surprised to see it here.

It claims that it's a myth that "trees are better" than hash tables and just describes trees as better for "finding inexact matches", as in the example "finding all names that begin with B". That's too vague. It's misleading because:

1. Trees can't just answer any random vague questions, such as "Find all names that have a P anywhere". That doesn't really describe the advantages of trees. I think a more accurate formulation would be "iterating over a range of values" or "finding values based on some ordering" (eg. find the smallest element bigger than X). Or for walking the entire contents in some predefined order. Sure, this is not an article about trees, but it doesn't take more than two sentences to describe that in a much better manner.

2. For any of the former query patterns, trees are actually much better than hash tables. That's definitely not a myth, which this article would sorta have you believe.

> Chained hash tables are a very good idea. Problem solved.

Wrong. If you don't resize your hash table, the complexity of accesses will degrade linearly as it exceeds its size. If your hash uses, say, 100 buckets each with a linked lists and stores N > 100 values, an access has complexity O(N) (sure, O(N/100) in the best case (assuming perfect spread of hash keys), but that's really O(N)).

Finally, "The worst case is terrible" myth part seems somewhat naive. I think it should at least mention the idea of hash table collision DOS attacks. "This doesn't happen in practice" unless there's some attacker out to get you by triggering that. There's plenty of examples of that happening in practice, of systems vulnerable to that. See example presentation here: http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1 Sure, that's not a reason to switch away from hash tables, but it's definitely something to consider when using them. I get a bad flavor from this simplification of "yeah, this doesn't happen in practice, at least as long as you know what your keys are".

C++11 has hash backed containers now if you prefer them. std::unordered_set and std::unordered_map are identical to std::set and std::map except they use hash tables rather than red black trees.

I've seen some huge (10-20x) performance gains simply by switching some of my sets and maps to unordered.

Why did they wait so ridiculously long to add this to the standard?!

Isn't that implementation specific?

The standard requires amortized O(1) insert and lookup operations, so it's in practice going to be a hashtable.

Any Trie varient will have that as well though.

Really? Looks to me like the lookup function is invoking recursion here: https://en.wikipedia.org/wiki/Trie#Algorithms

I would think the amortized cost would be O(log_b n), with b being something like the Shannon entropy of each symbol.

Well in that case the cost of a hash-function is also not O(1) since the minimum length of a n distinct keys is log(n).

Really both a trie and a hash-function boil down to O(m) where m is the length of the key.

There are some good ideas here, but I must take issue with a number of the statements the author makes.

To begin with, it is not a "myth" that hash tables have poor worst-case performance. It is true that this matters little for most applications, and not at all for some, but the worst case is still there. So I would avoid using hash tables, e.g., in safety-critical systems: medical, military, etc. (Or, as I tell my students: don't program a pacemaker in Python.)

And we should note that the excuse that datasets leading to poor performance are very rare, went out 20 years ago. When the person providing your dataset is a teenager who's trying to break your system just for the fun of it, rarity does not mean quite what you might think.

Second, hash tables do indeed have performance problems when they fill up. True, there are ways around this, and any decent implementation of a mutable hash table will include a strategy for handling these situations.

And that suggests a myth the author of this article might believe: that any ol' programmer can & should write their own hash table. I don't agree. Don't write a hash table for use in production code unless you're sure you know what you're doing. And even if you do, someone else's polished & tested hash table is likely to be better than what you're going to come up with.

Lastly, balanced search trees do have better worst-case behavior. That mostly doesn't matter, but sometimes it does. (You have my permission to use a Red-Black Tree in your pacemaker code.)

I don't believe most engineers brush off hash tables because of the worst case search. Or even storage costs or implementation complexity as the author suggests.

My reason for shrugging off hash tables for most requirements: cost/complexity of inserts and deletes. Hash tables are amazing when you do offline index generation (circa Google "non-realtime" search).

However, if you ever need to have frequent updates, deletes, or inserts, you need to grow the hash table and in doing so managing the hash table usurps any search gains. For some use cases that's fine - perhaps you only optimize for reads. Hash tables are awesome for read heavy loads. Hash tables are sub-optimal for mixed read/write work loads.

On that note, I believe the Red-Black btree is far superior to hash tables for mixed work loads. RBtree's tend to be more predictable and when your application needs the insert/delete (ie, for real time updates) you don't have to worry about having to find a new data structure.

On a related note, XFS had to switch their hash over to an rbtree to fix a performance issue related to hashes falling over in costs of insert/deletes:


Yes there are myths about hash tables but article doesn't touch on the real truths of why engineers avoid hash tables.

So we have two ways of dealing with overfullness. We can use linked-list buckets, which are expected O(n) or we grow and sometimes have to rehash the entire table as part of an add.

I'm jumping on the occasion to ask a review of my own hash table implementation:

https://github.com/norswap/nucleotides/blob/master/src/map.c (think also to peruse the documentation and the tests!)

I'm open to any remarks and criticisms: performance, code style, etc. Since the implementation is meant to be simple and educational I may not act on all of them, tough.

Hash Tables make iteration order hard to predict.

Hash functions rarely are meaningful outside of the context of the collection.

The tree interface leads to a simpler world.

Hashtables are great until they grow to a significant fraction of available memory. Then there is no way to avoid the cost of rehashing or giving up constant time access - and if your program isn't using most of memory then you can run it on a smaller computer!

As a developer I tend to use RB trees in the prototype phase then migrate to a hash table if the performance difference matters.


I really disagree with this article.

"The worst case is terrible" is not a myth, it's a reality-- and it's spawned several major security vulnerabilities recently. CVE-2012-1150 for example.

"Hash tables become full, and bad things happen." Yes. Yes they absolutely do. It depends on your implementation exactly what will happen, but in every case, you are in for some pain. If you choose the "linked list per element" route, congratulations-- your hash table will gradually approximate the lookup behavior of an unsorted linked list as the number of elements gets larger. If you choose rehashing, you'll have to deal with long pauses in your program during a rehash. Hello, latency spikes! Incremental rehashing could improve this, but only at the cost of degrading memory locality even further.

"Trees are better." For most applications, trees really ARE better. Trees are simple to use and they use less memory most of the time. You really should not be using hash tables unless you have a really good idea of how many elements you are going to have, AND you believe that they will not be accessed in any particular pattern.

"Hash functions are slow." I don't even know what this statement is supposed to mean. But I do know that there are a lot of bad hash functions out there, and it's easy to choose a bad one as a novice programmer. The right hash function will depend on your data-- something I note that this article fails to mention.

"Hash tables use too much memory." A wise programmer once told me, "fancy data structures are only good when N is large-- and N is usually small." That definitely applies here. Another way to look at this is that you need to keep the hash table half empty to keep the good properties that made you choose a hash table in the first place: O(1) lookup, O(1) insert, O(1) delete, etc. So you will just have to pay the memory price.

Hash tables can be very useful for certain applications. But in general, they're an optimization that you should make only if you know what you're doing. You should be very clear on why you're using them, how many elements you expect to have, and what your strategy is if you overflow. It's not for beginners. And if you are optimizing, you should also be aware that something like a radix tree or a B-Tree may lead to much better performance, due to its much better memory locality.

Just about the only really intriguing use for hash tables on a modern CPU is in lockless data structures, but if you're considering writing one of those-- you already know all the stuff I just wrote.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact