This is an incredibly poor design choice (at Facebook we'd say "clowntown") as it makes it unusable with any sophisticated hashtable data structure that assumes reasonable randomness. That's what causes the quadratic growth in the sequential example with dense_hash_map.
The solution is to use better hash functions, not prime hash table sizes. Prime modulus will only make the problems harder to notice, but if you have a poor hash function, you're giving up every complexity guarantee. Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.
Personally I just use folly::Hash (https://github.com/facebook/folly/blob/da08e63ce3bac9196b18d...) as it uses sane defaults for hash functions, and contrary to std::hash it's a polymorphic functor so you can just drop it as template argument without having to specify the key type: dense_hash_map<int, int, folly::Hash>.
The other reason dense_hash_map uses power-of-2 sizes (other than to avoid the modulo) is that it is sufficient to guarantee that the quadratic probing schedule is transitive, that is, it can reach all the slots of the table. Quadratic probing is used because it's very effective for unsuccessful lookups. You can find a table of expected probe counts at different load factors here:
Hash tables are a fine craft. While it's interesting to gain these experiences first-hand, this is an amateurish post and it should be taken with a grain of salt.
Yes, probably true that requiring a non-broken hash function is strictly better than his preferred prime modulus workaround. Okay, so? If you notice, the whole post keeps comparing a power-of-two version of his hashtable as well. Other than the one graph showing pathological behaviour of dense_hash with in-order inputs, everything in the post applies assuming a reasonable hash function.
> Other than the one graph showing pathological behaviour of dense_hash with in-order inputs
That's a pathological behavior of std::hash, not of dense_hash.
> everything in the post applies assuming a reasonable hash function.
No, see this paragraph:
All of these problems are solvable if you’re careful about choosing a hash
function that’s appropriate for your inputs. But that’s not a good user
experience: You now always have to be vigilant when using
hashtables. Sometimes that’s OK, but sometimes you just want to not have to
think too much about this. You just want something that works and doesn’t
randomly get slow. That’s why I decided to make my hashtable use prime
number sizes by default and to only give an option for using powers of two.
Nonsense. GCC developers have absolutely no way of knowing or guessing how the distribution of all integers map into the domain of positive integers. And any guess (without knowledge of the specific application domain) would have as many degenerate cases as the identity.
Maybe heuristically Facebook found some better general purpose hashes for their use case. But "clowntown" is a denigration.
As far as I can see all the graphs but one use random numbers, so hashing won't add anything given good randomness.
I've heard good arguments why linear probing should perform as well as quadratic probing when using Robin Hood bucket stealing, e.g. http://codecapsule.com/2013/11/11/robin-hood-hashing/#commen...
> Note that even if you reduce modulo a prime > n, you'll still have the identity over the first n numbers.
Which can be great! E.g. Python uses the identity function for number hashes, because if you're inserting mostly sequential numbers into a dict this gives you better cache performance.
That's a good idea when you control both the hash function and the hashtable implementation, as in the Python interpreter, so you can make assumptions on both sides.
My point was that when using composable templates, as in the STL, the hash function should not make any assumptions about the hash table, and vice versa.
IS it really that good?
If you end up having to probe a lot with linear probing, it is likely you have a bad hash function, which is the same reason you give against prime modulus (that they are a stopgap more than a solution).
Second, don't you end up having to jump all over the place, which means cache misses, that are way more expensive that a few probes on consecutive slots?
That means it's not amateurish: it was an intentional decision, likely by somebody who implements hash functions for a living, who gave thought to what the appropriate default should be. It is a trivial perfect hash, of course.
(And clarified in another reply that it was meant only in respect to hash-tables, not the authors programming skill in general)
(s)he did call it an incredibly poor design choice, however. Which it may or may not be, regardless of whether GCC developers are amateurish or not.
Can you add me to the list of people to contact when you get it off the ground? Cheers.
I've certainly done this in hash tables I've implemented at times, and I've seen this in library hash tables too.
One public well-known example that springs to mind is Cliff Click's NonBlockingHashMap in Java.
Edit: Apparantly I can no longer delete the comment. Stupid HN.
He is also going to run into modulo bias since using the remainder to calculate a slot position is biased toward certain positions for every table size that isn't a power of two. (see https://ericlippert.com/2013/12/16/how-much-bias-is-introduc... for cool graphs) Prime number table sizes do nothing to fix these issue. The power-of-two size with a bitshift is not just faster, it gets rid of the modulo bias.
Fastrange is much faster than modulo but if his goal is to build the fastest hashtable it's stupid to use a table size of anything except a power of two.
Source: I also wrote a hashtable :)
[...] using a power of two to size the table so that a hash can be mapped to a slot just by looking at the lower bits.
I think the text is clear that the reason the hash table supports power of two size tables is to avoid the modulo operation.
From the article:
>This is the main reason why really fast hash tables usually use powers of two for the size of the array. Then all you have to do is mask off the upper bits, which you can do in one cycle.
For instance, several popular hash algorithms (of the form 31 * hash + value) will have the same upper bits for any two strings of the same length, up to about 12 characters in length when the seed value finally gets pushed out. Unless you're still using 32 bit hashes for some reason and then it's most strings under 6 characters long.
Multiply by prime and then add or xor the next value guarantees that the bottom bits are less biased and so fits with any hashtable function that uses modular math, even if the modulus is taken by bit masking. Getting the upper bits to mix would take a different operator or larger operands.
I know I've encountered at least one hash function that uses a much larger prime (16 or more bits) as its multiplier, so it's not unheard of, but it's certainly not universal.
But he doesn't know about ctz (counting trailing zeros) for fast load/fillrate detection. With 50% it's trivial (just a shift), but for 70% or 90% you'd need to use _builtin_ctz.
If you happened to use 2^n ± 1 you'd have very little bias according to that map, but you wouldn't strictly be using a power of 2.
Unfortunately Java's original hashtable started at 11 and increased using f(m) = 2 * f(m - 1) + 1, giving you 23, 47, 95, 181, etc. Or pretty close to right on the bad parts of the bias curve shown in that link.
Makes me wonder, if Java hashtables had been given a default size of 15 or 7, if this ever would have been fixed...
(Speaking from experience: I used his trick to accelerate one of the TensorFlow hash tables a while ago -- https://github.com/tensorflow/tensorflow/commit/a47a30018502... )
I chose to use Cuckoo in my implementation of a high performance fixed-size cache for bitcoin signature validation (see https://github.com/bitcoin/bitcoin/blob/master/src/cuckoocac...). Cuckoo is great in my experience if you want to use memory fully and can afford > 3 high quality hashes per entry. I also like cuckoo because you're able to use it pretty well for lockfree/high concurrency use cases (can make concurrent erases and reads lock free for example), which made it a good fit for the bitcoin workload. It seems that this hashtable's approach isn't quite amenable to concurrent operations, but perhaps there are some tricks that could work for that.
What's used these days should be called "Robin Hood linear probing" or something else. It's just a variant of linear probing where an entry occupying a "better" slot than the one being inserted gets pushed back.
It might still be an interesting comparison but I'm under the impression that Robin Hood is a very good solution for a general purpose hash table.
While the current trend is to reduce the size of this array by using size-compressed indirect indices, e.g. the new ruby/python hashes.
Especially for primes. I can imagine that various power-of-two related values could have some bit-twiddling ways to make them much faster, but assuming that the compiler will be able to improve x mod 711 to something better than using the CPU modulo operation seems weird... but it somehow works.
mov eax, edi
mov edx, -1202107583
mov eax, edi
sar eax, 31
add edx, edi
sar edx, 9
sub edx, eax
mov eax, edi
imul edx, edx, 711
sub eax, edx
This is something where function pointers / overloading would likely have even better performance than direct switching - when the hash table size changes, switch out the function from modulo_by_511 to modulo_by_711.
Anyway, Lemire's fastrange modulo solved that problem better.
There's no need for a slow modulo.
I can't see the shift that you normally need after pulling the result out of the overflow register though. My asm is a little rusty though.
I recommend the book as well, it's got a lot of really awesome tricks!
Division is hard to fully pipeline; IIRC Skylake can start a new division every 6 clock cycles, but the latency for 32 bit integer division is 19 cycles (which is pretty amazing BTW).
However 64 bit divisions is terrible at ~100 cycles per 64 bit division. It is true that the 32 bit division is much faster, but now that the target is 64 bit, the 64 bit divisions are very popular. It's something to take in mind when doing fast math... when 32 bits are enough and there are unavoidable divisions, use 32 bits integers.
But the trick in the link shared by jlrubin is more relevant for hash tables.
In the case of a switch like this I would try something horrible like a template for each prime, so that the code is inlined.
On a related note, I have in general concerns about latency spikes due to rehashing. Is there a good, general, incremental rehashing scheme?
If you follow a Shalev/Shavit hash-ordered-lists scheme the rehash doesn't even need to move anything, and can be done with a straight memcpy, or in a separate thread. The link below is an implementation that was intended for Cassandra (so is simplified and append-only, but also a concurrent map - which could be relaxed). It has the nice property that it will "fix" the backing array on-demand, so straight-up copying the existing contents of the backing array into the new region of the backing array, once doubled, will set things up for a gradual transition.
Similarly, it's trivial for the backing array to instead grow incrementally, and for an extra level of indirection to prevent ever having to reallocate older regions (since their contents never changes on rehash).
Paper link: http://people.csail.mit.edu/shanir/publications/Split-Ordere...
If you rehash based on something like "number of reprobes" then it's probably impossible to absolutely guarantee that a rehash cannot be triggered before a prior one begins, but the linked approach would permit growing the backing array discontiguously (assuming an extra level of indirection to the backing array). But this would obviously incur some extra overheads, and is not entirely dissimilar to hash-tries, which also avoid any costly global rehash.
I think the class-level comment in the code I linked is perhaps the best introduction to split-ordered lists I know of (not hash-ordered; they're sorted by the reverse bitstring of the hash), since they're not widely discussed and not immediately intuitive.
The idea is not super clever so I'm sure many people found & implemented it in papers and other languages, but I didn't find them. Will be grateful if someone points me to prior work.
But the amortized O(1) lookup of hash tables is hard to beat.
On the other hand: the performance characteristics of trees are much easier to predict.
And in many cases a HT lookup isn't distinguishable from a tree lookup, at least if it's a dedicated application data structure, like some index or so.
I feel like HTs are very good for smaller structures (excellent example: dictionary/hashtable-based interpreters), but for larger data structures (e.g.: indices) ... not so much.
One option is to move a few elements over from the old hash table on each operation.
A second option is to use linear hashing (different from linear probing). Linear hashing let's you increase the size of your hash table by one bucket at a time. The general idea is once your hash table is full, you allocate a new bucket (generally from a preallocated array) and move half of the elements from an old bucket to the new bucket. I know Postgres used to use linear hashing, but I'm not sure about the current state. I heard they started using Robin Hood hashing, but I haven't looked at whether that is unrelated to the code that uses linear hashing.
Because the expected number of probes can be calculated from it, which is 1 + 1/(1 − α).
At the extreme, if every item produces the same hash code, growing on reprobes will result in exponential insert complexity (since every insert will trigger a grow). Growing on load factor will just maintain O(n) insert/lookup complexity (or the complexity of the hash bucket, if not open-addressing).
With less extreme skew, growing on load factor will simply use less memory than growing on reprobes, while reprobes will have a slightly lower median (and possibly mean) insert/lookup costs.
Are they fixed size, non-heap-allocated?
And would the performance improvement still be relevant if this is not the case, e.g. for variable size strings?
Power of 2 hashtables are so much faster because indexing uses a mask. The problem is that, as others have mentioned, powers of 2 suffer a lot with bad hash functions, and people tend to write pretty bad ones, so it's generally a good idea to salt or rehash hashcodes before dropping the upper bits with a mask.
In fact it looks like the results from the author's bear that out, as the flat_hashmap_powers_of_two is almost always the fastest.
I don't think my tone was too aggressive, maybe made exception for the word "amateurish"? I explained above what I meant, if you replace that word with a better one does it change the tone of the post?
Yes, it does.
'Amateurish' is often used as an insult to imply that someone needn't have tried at all. If you replaced it with 'novice,' I think people would have understood your meaning without taking offense.
Also see answers to this Quora question:
> It's a way to bring much-needed levity to the process of solving hard problems.
> It's a playful, tongue-in-cheek, self-deprecating term to describe a situation in a company whose motto was "move fast and break things."
We detached this subthread from https://news.ycombinator.com/item?id=13745383 and marked it off-topic.
It's also faster than dense_hash_map.
The author claiming their hash table is the fastest ever is kind of ridiculous.
"And by that I mean that I have the fastest lookups of any hashtable I could find, while my inserts and erases are also really fast. (but not the fastest)"
Yes, it is the fastest that author wrote. However, that's not the sole claim. But also that it has the fastest lookups of any hashtable the author could find.
The complaint being that he did not do a very thorough job of finding alternatives, and thus the title of "I wrote the fastest hashtable" is not reasonable.
aka "I'm the smartest/fastest/strongest person of the people I could find"
It is quite similar with how you wrote "fastest" instead "fastest among all existing publicly available implementations". Nothing wrong with that, since it is obvious from context, very much the same way it is obvious from article what author means.
The author could just say "very fast" or "faster than common hash table libraries" in the title. The current title, IMHO, undermines the quality of the post (which is actually good).
I mean, I'd still call it clickbait, but it's hardly bullshit clickbait.