1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.
2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.
Interesting; this came up in the Lua community and the Lua authors (who I have a ton of respect for) agreed to fix the problem, but thought the threat was overstated. Their argument was that many other forms of DoS attack are lower hanging fruit.
In scanlogd, I'm using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time.
Some code for it here:
Regarding performance, SipHash is essentially as fast as Python's randomized hash, and quite faster for long inputs (>16 bytes), cf. http://bench.cr.yp.to/results-auth.html#amd64-sandy .
I don't know anything about Python's hash function, all of those benchmarks are still an order of magnitude slower than CityHash (for example), which peaks at 0.16 cycles/byte.
I would be interested to see SipHash put through SMHasher, which is the most comprehensive hash function test suite I know of: http://code.google.com/p/smhasher/
Of course, it passes SMHasher with 10 score -- it's a cryptographic PRF.
Are you referring to the "Advanced hash flooding" section of the paper, or am I missing this in another place? That analysis appears to rely on the hash function being highly linear/periodic (ie. you can always add "l" to get back the same hash value). I don't think that City, Spooky, or Murmur are nearly this linear. For example, City is based on CRC32 (ie. polynomial division).
Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic hash. But modern non-cryptographic hash functions are so fast (especially when they're CPU-accelerated, like City is on Intel thanks to the CRC32 instruction) that I don't think SipHash can claim to be the obvious choice for all hash tables.
I am interested to know though if the random seed + City/Spooky/Murmur approach really does have practical hash flooding attacks.
$ time ./citycollisions 1000000 > /dev/null
$ ./citycollisions 3
128-bit key 741b848b0065aec04bb25b26ca2c3e9
CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = d3290c3d600b8add
CityHash64( 7ae31886221136ba2148534148202021, 16 ) = d3290c3d600b8add
CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = d3290c3d600b8add
128-bit key 5e2a23015c89da98172fbcb77ad98468
CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = 59d0b041b16594b
CityHash64( 7ae31886221136ba2148534148202021, 16 ) = 59d0b041b16594b
CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = 59d0b041b16594b
Edit: wait, I'm mistaken -- these are not independent of seed.
This is really impressive work.
"For example, hashing 16 bytes takes 141
Bulldozer cycles with SipHash-2-4, against 82 and 126 for CityHash and SpookyHash, and 600 for MD5."
1) At every operation you perform there is a rehashing step.
2) You have a function that you can call to perform a rehashing step.
3) There is an higher level function where you can say "rehash for N milliseconds" in case you have some idle time to spend for a good cause inside your program.
Another uncommon thing of dict.c is that it supports the "pick a random element" operation with guaranteed O(1) behaviour.
EDIT: Now that I look at the source, clearly while the code is easy to follow a design section should be present in the dict.c top-comment. I'll make sure to add this, but long story short:
* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.
* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).
* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).
* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.
Well these are the main ideas I think.
However there is a trick to improve this that I don't use, that is, instead of searching for a non empty bucket, and select a random element from the chain, it is possible to do this, choosing a small M (like M=3):
* Find M non-empty buckets.
* Pick the bucket among the M buckets, with a chance proportional to the chain length at every bucket.
* Finally pick a random element from the bucket.
The bigger M, the more accurate the algorithm becomes.
For instance in the pathological case you shown where a bucket as one element and another all the rest, this would find the two buckets and then pick the 1 element bucket with a much smaller probability that would adjust the difference in chain length.
Back in the 70s it appeared that every major state university had to do their own version of PL/1. At the University of Maryland it was called PL/UM. I don't know why, maybe it was a testosterone thing ("IBM took a zillion man-years to write PL/1, but look what WE did with three starving grad students, a half-soused half-professor and a keypunch operator!") (Yeah, another version of PL/1, and that doesn't necessarily make the world a better place. Anyway --)
PLUM was notorious for being slow (I don't know about bugs, but they were probably there, too), and it was pretty common for students using PLUM to have to grovel for more seconds of CPU time.
Five or six years into PLUM's use as a student language, someone discovered that all of the symbols were being hashed into the same bucket.
PLUM got amazingly fast (well, faster, at least) after that. But by then anyone with a clue had managed to scam an account on one of the bsd Vaxen and write their stuff in C.
Don't take hash tables on faith. Measure, measure, measure.
In "The worst case is terrible", the author goes on to explain that the worst case is unlikely. That may be true, but it's still O(n).
In "Hash tables become full, and bad things happen", if I'm reading it correctly, the author shows how you can make each hash table bucket a linked list so that each bucket never becomes full. But what happens when you have 1000 items in each bucket? If you keep adding elements and you want to keep reads fast, sooner or later you're going to have to add more buckets, at which time you need to re-hash every element in the table.
I'm a fan of hashtables, but there's some rather suspicious handwaving in this article.
It's not O(n), it's more like O(1 + max(0, n - m)) where n is the number of elements in the table and m is the number of buckets. The table can grow up to about n == m and it's still approximately O(1).
But what happens when you have 1000 items in each bucket?
Don't do that. Rather if you do that, then you don't get to complain that the performance of your data structure is that of the fallback strategy instead of a properly sized hashtable.
You should have begin executing your growth strategy back at about 0.5 items in each bucket.
If you keep adding elements and you want to keep reads fast, sooner or later you're going to have to add more buckets,
Sooner or later you're going to have to put more RAM in the machine too. But in the meantime, your trusty old hash-based data structure will give you reliable service.
at which time you need to re-hash every element in the table.
If latency is a concern this re-hashing can be done incrementally.
If the hashing of elements itself is expensive, you can store the resulting hash value for each element.
Note that there are three zero bits at the bottom of each 64-bit-aligned pointer value. You could use these to store three additional bits of the hash value, allowing you to grow your table by a factor of 8 without rehashing.
Isn't that just the long way of saying "O(n)"? It's a worst-case bound.
It seems nonsensical to discuss worst-case behavior of hash-based data structures since we can (and should) make it arbitrarily unlikely.
In other words, if O(n) is the limit as n goes to infinity then the probability of actually encountering worst-case beahvior on a hashtable operation is 0 (assuming a non-broken hash function). Near-worst case behavior becomes similarly improbable.
This is different than many algorithms like, say, classic quicksort which have worst-case behavior on values that naturally appear in practice.
1. Collisions should be expected, and handled correctly.
2. The hashing algorithm should minimize collisions as much as possible.
Yes, the worst case running time is linear. However, assuming a properly defined hashing function, it's practically a linear running time over the expected number of collisions per bucket, which should be a tiny fraction of the total expected data size.
The problem that the second gives us is that proper hash algorithms are tailored to the specific problem at hand. MD5 is overkill when your entire pool of data is 1000 names. And knowing your data set makes it so you can avoid having a relatively large percentage of your data in one bucket.
The worst case runtime is not O(n), it is defined by how you store your nodes in a hash bucket. If you choose to use a linked list, then searching a linked list is O(n). Also, obviously, there is no free lunch, so using a hash table in each bucket does not magically produce O(1) lookups.
Pick a data structure that has O(log n) runtime for lookups and use it when the hash table breaks down - that is, use it when searching in a bucket.
On the other hand, if your hash table has enough nodes in a bucket that you even notice the difference between O(very small n) and O(log very small n), you really should migrate to a larger hash table.
If you want details, read the two articles that are linked into the text -- in those I delve into plenty of details, and include real-world experiments.
Now, if someone wants to be a jerk, and figures out how to easy collide on your hash function, then you could have problems. He could stuff 1 million elements into the same hash table element, which will suck for linked-list traversal. So get a good hash algorithm :)
Picking a new secure random seed value at startup time and prepending it to every hashed value is generally what you need to do.
You never have 1000 items in each bucket. You resize well before then.
> Hash tables are fast
When you're dealing with lots of keys, they're usually the best option out there. But if you're dealing with a small number of keys, a simple array can often be faster. A hash table lookup involves a constant amount of overhead resulting from calculating the key, and possibly also added indirection depending on whether the implementation uses open addressing. If the amount of time that would be spent on a simple linear or binary search of the array is less than (or even not much more than) the time that would be spent on that overhead, then the array will be faster.
Of course, the point where the hash table overtakes the array depends on a lot of factors. As usual the only sure-fire way to know which option is better in a particular situation is to measure.
Depending on the hash and hash table implementation, large hash tables where elements are spread across several pages in memory can pathologically thrash your caching. You can get similar behaviour to randomly accessing elements in a large array. A cache miss from L2 to RAM is extremely slow and can be about a million times slower if you have to page fault all the way into swap.
This is a good example of a case where the standard model of computer that's used in CS proofs starts to fall over because it ignores hierarchical memory and cache locality.
Hand wavily, accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.
That being said, I'm not sure how the C++ unordered_map handles this, it's just something I experienced when running cachegrind on someone's hand rolled implementation in the past. (I 'fixed' it by making the templated class store pointers, and then locality was greatly improved because malloc's default behaviour.)
Yes, but a properly-sized hash table will result in an average of 1 to 2 cache misses per lookup regardless of how many elements. This is far better than what you get with other structures, except perhaps in specialized cases that happen to follow tree traversal.
and can be about a million times slower if you have to page fault all the way into swap.
Disk based data requires specialized algorithms.
accessing every element once in a large naive hash table will give you O(n) page faults on average whereas accessing every element sequentially in a vector will give you O(n/d) page faults where d is your page size. While asymptotically similar, in real life that constant d can be gigantic.
Sure, a vector is better when you only ever access elements in a single sequential order and don't add or remove elements anywhere except the end.
That being said, I'm not sure how the C++ unordered_map handles this
C++ defines a method of iterating over all the elements of an unordered_map. I think a typical implementation will just treat the hash memory as a vector and return values from the occupied buckets.
(Of course, the chunks must eventually be processed linearly if they're all full and the key isn't found till the last chunk, but this is again a degenerate case which generally isn't a problem as long as the hash table is resized often enough as it grows.)
Really? That seems about as likely as the statement 'carpenters irrationally avoid using screws'.
EA released a GPL'd "EASTL" library a few years ago for those and other reasons, you can read all about it here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227...
Here, I found it, we both posted in this thread, but not to each other: http://news.ycombinator.com/item?id=3560158
If all your elements magically hash to the same bucket somehow and you end up having to make a list or a tree out of them, you're still no worse off than if you used a list or a tree in the first place.
They're not just 'good for inexact matches' - they're optimised for things like database indexes, to reduce the number of seeks over a block-based device (eg a filesystem) required for a index lookup.
"Trees are better
In general, trees aren’t faster for searching for exact matches. They’re slower, and here’s peer-reviewed evidence that compares B-trees, splay trees, and hashing for a variety of string types. So, why use a tree at all? Trees are good for inexact matches — say finding all names that begin with “B” — and that’s a task that hash tables can’t do."
to paraphrase: "hashtables are faster. here's some research showing b-trees are slower than hashtables. why would you ever use a tree? prefix matching, that's why."
which misses the point that the primary reason for using a B-tree as a general purpose database index is not about prefixes, it's about block devices (prefixes is an important reason, but it's not the main one)
if the purpose of the article was to ignore implementation and spinning disks and just look at in-memory performance, then don't mention B-trees. they're an implementation-specific optimisation, and it just confuses.
Aren't all of these statements obvious is you actually understand what a hashtable is?
It claims that it's a myth that "trees are better" than hash tables and just describes trees as better for "finding inexact matches", as in the example "finding all names that begin with B". That's too vague. It's misleading because:
1. Trees can't just answer any random vague questions, such as "Find all names that have a P anywhere". That doesn't really describe the advantages of trees. I think a more accurate formulation would be "iterating over a range of values" or "finding values based on some ordering" (eg. find the smallest element bigger than X). Or for walking the entire contents in some predefined order. Sure, this is not an article about trees, but it doesn't take more than two sentences to describe that in a much better manner.
2. For any of the former query patterns, trees are actually much better than hash tables. That's definitely not a myth, which this article would sorta have you believe.
> Chained hash tables are a very good idea. Problem solved.
Wrong. If you don't resize your hash table, the complexity of accesses will degrade linearly as it exceeds its size. If your hash uses, say, 100 buckets each with a linked lists and stores N > 100 values, an access has complexity O(N) (sure, O(N/100) in the best case (assuming perfect spread of hash keys), but that's really O(N)).
Finally, "The worst case is terrible" myth part seems somewhat naive. I think it should at least mention the idea of hash table collision DOS attacks. "This doesn't happen in practice" unless there's some attacker out to get you by triggering that. There's plenty of examples of that happening in practice, of systems vulnerable to that. See example presentation here:
http://www.youtube.com/watch?v=R2Cq3CLI6H8&lr=1 Sure, that's not a reason to switch away from hash tables, but it's definitely something to consider when using them. I get a bad flavor from this simplification of "yeah, this doesn't happen in practice, at least as long as you know what your keys are".
Why did they wait so ridiculously long to add this to the standard?!
I would think the amortized cost would be O(log_b n), with b being something like the Shannon entropy of each symbol.
Really both a trie and a hash-function boil down to O(m) where m is the length of the key.
To begin with, it is not a "myth" that hash tables have poor worst-case performance. It is true that this matters little for most applications, and not at all for some, but the worst case is still there. So I would avoid using hash tables, e.g., in safety-critical systems: medical, military, etc. (Or, as I tell my students: don't program a pacemaker in Python.)
And we should note that the excuse that datasets leading to poor performance are very rare, went out 20 years ago. When the person providing your dataset is a teenager who's trying to break your system just for the fun of it, rarity does not mean quite what you might think.
Second, hash tables do indeed have performance problems when they fill up. True, there are ways around this, and any decent implementation of a mutable hash table will include a strategy for handling these situations.
And that suggests a myth the author of this article might believe: that any ol' programmer can & should write their own hash table. I don't agree. Don't write a hash table for use in production code unless you're sure you know what you're doing. And even if you do, someone else's polished & tested hash table is likely to be better than what you're going to come up with.
Lastly, balanced search trees do have better worst-case behavior. That mostly doesn't matter, but sometimes it does. (You have my permission to use a Red-Black Tree in your pacemaker code.)
My reason for shrugging off hash tables for most requirements: cost/complexity of inserts and deletes. Hash tables are amazing when you do offline index generation (circa Google "non-realtime" search).
However, if you ever need to have frequent updates, deletes, or inserts, you need to grow the hash table and in doing so managing the hash table usurps any search gains. For some use cases that's fine - perhaps you only optimize for reads. Hash tables are awesome for read heavy loads. Hash tables are sub-optimal for mixed read/write work loads.
On that note, I believe the Red-Black btree is far superior to hash tables for mixed work loads. RBtree's tend to be more predictable and when your application needs the insert/delete (ie, for real time updates) you don't have to worry about having to find a new data structure.
On a related note, XFS had to switch their hash over to an rbtree to fix a performance issue related to hashes falling over in costs of insert/deletes:
Yes there are myths about hash tables but article doesn't touch on the real truths of why engineers avoid hash tables.
(think also to peruse the documentation and the tests!)
I'm open to any remarks and criticisms: performance, code style, etc. Since the implementation is meant to be simple and educational I may not act on all of them, tough.
Hash functions rarely are meaningful outside of the context of the collection.
The tree interface leads to a simpler world.
"The worst case is terrible" is not a myth, it's a reality-- and it's spawned several major security vulnerabilities recently. CVE-2012-1150 for example.
"Hash tables become full, and bad things happen." Yes. Yes they absolutely do. It depends on your implementation exactly what will happen, but in every case, you are in for some pain. If you choose the "linked list per element" route, congratulations-- your hash table will gradually approximate the lookup behavior of an unsorted linked list as the number of elements gets larger. If you choose rehashing, you'll have to deal with long pauses in your program during a rehash. Hello, latency spikes! Incremental rehashing could improve this, but only at the cost of degrading memory locality even further.
"Trees are better." For most applications, trees really ARE better. Trees are simple to use and they use less memory most of the time. You really should not be using hash tables unless you have a really good idea of how many elements you are going to have, AND you believe that they will not be accessed in any particular pattern.
"Hash functions are slow." I don't even know what this statement is supposed to mean. But I do know that there are a lot of bad hash functions out there, and it's easy to choose a bad one as a novice programmer. The right hash function will depend on your data-- something I note that this article fails to mention.
"Hash tables use too much memory." A wise programmer once told me, "fancy data structures are only good when N is large-- and N is usually small." That definitely applies here. Another way to look at this is that you need to keep the hash table half empty to keep the good properties that made you choose a hash table in the first place: O(1) lookup, O(1) insert, O(1) delete, etc. So you will just have to pay the memory price.
Hash tables can be very useful for certain applications. But in general, they're an optimization that you should make only if you know what you're doing. You should be very clear on why you're using them, how many elements you expect to have, and what your strategy is if you overflow. It's not for beginners. And if you are optimizing, you should also be aware that something like a radix tree or a B-Tree may lead to much better performance, due to its much better memory locality.
Just about the only really intriguing use for hash tables on a modern CPU is in lockless data structures, but if you're considering writing one of those-- you already know all the stuff I just wrote.