This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.
The solution is to use better hash functions, not prime hash table sizes. Prime modulus will only make the problems harder to notice, but if you have a poor hash function, you're giving up every complexity guarantee. Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.
Personally I just use folly::Hash (https://github.com/facebook/folly/blob/da08e63ce3bac9196b18d...) as it uses sane defaults for hash functions, and contrary to std::hash it's a polymorphic functor so you can just drop it as template argument without having to specify the key type: dense_hash_map<int, int, folly::Hash>.
The other reason dense_hash_map uses power-of-2 sizes (other than to avoid the modulo) is that it is sufficient to guarantee that the quadratic probing schedule is transitive, that is, it can reach all the slots of the table. Quadratic probing is used because it's very effective for unsuccessful lookups. You can find a table of expected probe counts at different load factors here:
Hash tables are a fine craft. While it's interesting to gain these experiences first-hand, this is an amateurish post and it should be taken with a grain of salt.
Yes, probably true that requiring a non-broken hash function is strictly better than his preferred prime modulus workaround. Okay, so? If you notice, the whole post keeps comparing a power-of-two version of his hashtable as well. Other than the one graph showing pathological behaviour of dense_hash with in-order inputs, everything in the post applies assuming a reasonable hash function.
> Other than the one graph showing pathological behaviour of dense_hash with in-order inputs
That's a pathological behavior of std::hash, not of dense_hash.
> everything in the post applies assuming a reasonable hash function.
No, see this paragraph:
All of these problems are solvable if you’re careful about choosing a hash
function that’s appropriate for your inputs. But that’s not a good user
experience: You now always have to be vigilant when using
hashtables. Sometimes that’s OK, but sometimes you just want to not have to
think too much about this. You just want something that works and doesn’t
randomly get slow. That’s why I decided to make my hashtable use prime
number sizes by default and to only give an option for using powers of two.
Nonsense. GCC developers have absolutely no way of knowing or guessing how the distribution of all integers map into the domain of positive integers. And any guess (without knowledge of the specific application domain) would have as many degenerate cases as the identity.
Maybe heuristically Facebook found some better general purpose hashes for their use case. But "clowntown" is a denigration.
As far as I can see all the graphs but one use random numbers, so hashing won't add anything given good randomness.
I've heard good arguments why linear probing should perform as well as quadratic probing when using Robin Hood bucket stealing, e.g. http://codecapsule.com/2013/11/11/robin-hood-hashing/#commen...
> Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.
Which can be great! E.g. Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.
That's a good idea when you control both the hash function and the hashtable implementation, as in the Python interpreter, so you can make assumptions on both sides.
My point was that when using composable templates, as in the STL, the hash function should not make any assumptions about the hash table, and vice versa.
IS it really that good?
If you end up having to probe a lot with linear probing, it is likely you have a bad hash function, which is the same reason you give against prime modulus (that they are a stopgap more than a solution).
Second, don't you end up having to jump all over the place, which means cache misses, that are way more expensive that a few probes on consecutive slots?
That means it's not amateurish: it was an intentional decision, likely by somebody who implements hash functions for a living, who gave thought to what the appropriate default should be. It is a trivial perfect hash, of course.
(And clarified in another reply that it was meant only in respect to hash-tables, not the authors programming skill in general)
(s)he did call it an incredibly poor design choice, however. Which it may or may not be, regardless of whether GCC developers are amateurish or not.
Can you add me to the list of people to contact when you get it off the ground? Cheers.