Hacker News new | past | comments | ask | show | jobs | submit login

The author seems to ignore that GCC's implementation of std::hash is the identity on integral types: https://github.com/gcc-mirror/gcc/blob/cf8c140a8b8131b3ab4f5...

This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.

The solution is to use better hash functions, not prime hash table sizes. Prime modulus will only make the problems harder to notice, but if you have a poor hash function, you're giving up every complexity guarantee. Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.

Personally I just use folly::Hash (https://github.com/facebook/folly/blob/da08e63ce3bac9196b18d...) as it uses sane defaults for hash functions, and contrary to std::hash it's a polymorphic functor so you can just drop it as template argument without having to specify the key type: dense_hash_map<int, int, folly::Hash>.

The other reason dense_hash_map uses power-of-2 sizes (other than to avoid the modulo) is that it is sufficient to guarantee that the quadratic probing schedule is transitive, that is, it can reach all the slots of the table. Quadratic probing is used because it's very effective for unsuccessful lookups. You can find a table of expected probe counts at different load factors here: https://github.com/sparsehash/sparsehash/blob/4cb924025b8c62...

Hash tables are a fine craft. While it's interesting to gain these experiences first-hand, this is an amateurish post and it should be taken with a grain of salt.




I think "amateurish" is overly harsh here. He sets out a very professionial plan of measuring and testing. It is a nice post.

Yes, probably true that requiring a non-broken hash function is strictly better than his preferred prime modulus workaround. Okay, so? If you notice, the whole post keeps comparing a power-of-two version of his hashtable as well. Other than the one graph showing pathological behaviour of dense_hash with in-order inputs, everything in the post applies assuming a reasonable hash function.


I didn't mean "amateurish" in a judgmental way, it's a matter of fact that the author is a novice when it comes to hash tables. He himself admits that he doesn't understand some of the details ("Why do prime numbers help? I can’t quite explain the math behind that"), so he gets some of the conclusions wrong.

> Other than the one graph showing pathological behaviour of dense_hash with in-order inputs

That's a pathological behavior of std::hash, not of dense_hash.

> everything in the post applies assuming a reasonable hash function.

No, see this paragraph:

    All of these problems are solvable if you’re careful about choosing a hash
    function that’s appropriate for your inputs. But that’s not a good user
    experience: You now always have to be vigilant when using
    hashtables. Sometimes that’s OK, but sometimes you just want to not have to
    think too much about this. You just want something that works and doesn’t
    randomly get slow. That’s why I decided to make my hashtable use prime
    number sizes by default and to only give an option for using powers of two.
The belief is that prime number sizes will defend you against poor hash function. That's a very dangerous advice.


> This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.

Nonsense. GCC developers have absolutely no way of knowing or guessing how the distribution of all integers map into the domain of positive integers. And any guess (without knowledge of the specific application domain) would have as many degenerate cases as the identity.

Maybe heuristically Facebook found some better general purpose hashes for their use case. But "clowntown" is a denigration.


I read the section on sequential numbers as just illustrating a worst-case scenario of the case that could happen with either a naive hasher or malicious input.

As far as I can see all the graphs but one use random numbers, so hashing won't add anything given good randomness.

I've heard good arguments why linear probing should perform as well as quadratic probing when using Robin Hood bucket stealing, e.g. http://codecapsule.com/2013/11/11/robin-hood-hashing/#commen...

> Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.

Which can be great! E.g. Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.


> Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.

That's a good idea when you control both the hash function and the hashtable implementation, as in the Python interpreter, so you can make assumptions on both sides. My point was that when using composable templates, as in the STL, the hash function should not make any assumptions about the hash table, and vice versa.


> Quadratic probing is used because it's very effective for unsuccessful lookups

IS it really that good?

If you end up having to probe a lot with linear probing, it is likely you have a bad hash function, which is the same reason you give against prime modulus (that they are a stopgap more than a solution).

Second, don't you end up having to jump all over the place, which means cache misses, that are way more expensive that a few probes on consecutive slots?


I found the lecture notes that made this claim on internet archive https://web.archive.org/web/20031127062932/http://www.august... It doesn't give any sources unfortunately.


The documentation for std::hash in GCC clearly states the integral type identity property, gives a reason for it, and suggests it's not the best type for some circumstances.

That means it's not amateurish: it was an intentional decision, likely by somebody who implements hash functions for a living, who gave thought to what the appropriate default should be. It is a trivial perfect hash, of course.


ot didn't say that GCC's std::hash is amateurish, (s)he said that the article is: "...this is an amateurish post and it should be taken with a grain of salt..."

(And clarified in another reply that it was meant only in respect to hash-tables, not the authors programming skill in general)

(s)he did call it an incredibly poor design choice, however. Which it may or may not be, regardless of whether GCC developers are amateurish or not.


Yes, I misread the post and the replies. You can sed my reply s/amateurish/poor design choice/ and the point stands.


Ouch, this is much more thorough than my attempt to debunk the post.


Hey, this really belongs on one of your older comments about you starting a blog but the comment was too aged to reply.

Can you add me to the list of people to contact when you get it off the ground? Cheers.

prostorm.blue@gmail.com




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: