The issue with memory overhead (load factor) is actually something that Cuckoo hashing solves compared to linear probing. Bucketized Cuckoo hashing supports much higher load factors than linear probing. You can have your cake and eat it too.
I have implemented hash tables with open addressing and tombstones in the past (linear probing, triangular probing etc.) and I much prefer Cuckoo hashing now.
When Cuckoo hashing is combined with a bloom filter in the first position to reduce cache misses to the second position, it's almost perfect in every way: fast (drastically reduced cache misses), constant lookup time in the worst case (not so for linear probing), higher load factors (80% or more for Cuckoo hashing vs at most 50% for linear probing - this also affects amortized resize latency), not to mention elegant simplicity.
In case anyone is interested, here are some design details and decisions for an optimized bucketized Cuckoo hash table with bloom filters implemented for Node.js:
AFAIK, cuckoo works best when load factor is less than 50%. Wikipedia seems to also agree with me.
> Each bucket contains up to 8 elements to support a hash table load factor of 80% or higher, unlike linear or chained hash tables which support a load factor (elements / capacity) of at most 50% and which waste the other 50% of memory.
There's the related cuckoo filter as well: https://github.com/efficient/cuckoofilter
Thus, I would argue that the problem reduces to the case where you have a reasonable hash function.
This is not a purely theoretical concern, I should note. There are practical DoS attacks that can be carried out by deliberately crafting colliding keys. The current state of the art seems to be to make it impractical for attackers to accomplish this by incorporating secret data into the hash function, rather than by ensuring the theoretical worst case is acceptable. Which is probably the right approach.
It's premised on the idea that you can dynamically change the hash function such that eventually you will find one with fewer than some fixed number of items sharing the same hash value.
There are two problems:
1) It completely undermines the "worst-case constant lookup", because there is no upper bound on the number of times you might have to change the hash function and rebuild the table before you find one with sufficiently few collisions.
2) It is a much stronger requirement on the keys than other hash tables have. With other hash tables I can simply omit "difficult to hash" data from my hash function, on the basis that enough other data is hashed to avoid a significant performance penalty. With this implementation, the entire hash table will simply fail.
This additional failure mode is also a security concern - if an untrusted user can affect what data is added to the table, it could cause a severe DoS where the table will either enter an infinite loop trying to find a non-conflicting hash function, or have an unexpected failure mode (most people don't expect their hash tables to randomly fail). Even if it is somehow designed such that a hash function can always be found, the attacker could spend time up-front finding values that take a long time for that hash function to be reached, causing numerous re-hashes of the table.
> It's premised on the idea that you can dynamically change the hash function such that eventually you will find one with fewer than some fixed number of items sharing the same hash value.
Why would you think there are any fixed numbers? The OP is about a stdlib hash table (for Rust). Comparable hash table implementations all resize the table when necessary to decrease the load factor. Performance with respect to load factor still matters, of course, less resizing is better, but there are no fixed numbers of anything.
> It completely undermines the "worst-case constant lookup", because there is no upper bound on the number of times you might have to change the hash function and rebuild the table before you find one with sufficiently few collisions.
Why would you need to change the hash function or rebuild anything during lookup? It's lookup. Nobody is changing anything.
> It is a much stronger requirement on the keys than other hash tables have.
What hash table doesn't require keys to be hashable?
1. Tabulation hashing provides guarantees against chosen key attacks.
2. In addition, bucketized cuckoo hashing provides guarantees on the probability of resize (low compared to linear probing), and multiple resizes (very low compared to linear probing).
3. Further, most hash table implementations that I know of have layers of defense, including limiting the number of sequential resizes for a given key.
Putting everything together, if anything, I would be more concerned with the failure mode for linear probing (scanning the whole table before resizing) than for Cuckoo hashing (scanning two fixed-size buckets before resizing).
Isn’t that a general problem  that is addressed with randomization of the input data before hashing?
I assume there must be a substantial performance gain for this to be used as it seems significantly more complicated, any information on how much better it is?
I'm also curious if or how that implementation would translate to other languages. What he did might not be declarable in the Rust ownership model.
Update: So I just read the article (I didn't have the chance earlier) and I see it explains tombstones, which seem like a pretty clever solution I wasn't thinking about. What I had been confused about, though was the other more-obvious attempt at a solution, which is backshifting. I think I had gotten stuck is what you do with the spot that opens up after the backshift... it might have been part of another probe chain (or multiple, for that matter) that had jumped over it before, so how do you find one of these chains to move back an item into this slot (which you would have to do)?
With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.
The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.
In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.
# any time a tombstone immediately preceeds a empty, it can be marked empty
[ $ _ ] -> [ _ _ ]
# any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
[ $ A ] -> [ A $ ]
# (moving the tombstone closer to a empty that will destroy it)
# if you don't have iterators, you can also jump over other keys
[ $ B A ] -> [ A B $ ]
[ $ B C A ] -> [ A B C $ ] # etc
# (this will cause a iterator at A to yield B again, or at B to skip A)
As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.
That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.
TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.
However the tombstone idea fits better with minimizing mutations and improving the potential for concurrency if the design only cares that inserts/deletes are atomic (not that they're propagated in parallel instantly).
For the 'move back' idea to be that safe I'd still want to use a tombstone value, but it would need to also include a pointer to the moved location. The list of tombstones would need to be submitted (as in, message queue) to a kind of garbage collection thread that sat on the messages for a pre-determined validity time and then did a scrub of the key/value to make sure the compaction still made sense. A shorter interval might also allow for that deferred compaction to be replaced by a new entry instead.
I don't like any of that as a generic solution though, as the trade-offs with each bit of extra effort and code complexity seem likely to be premature optimization and sources of potential bugs when not considered in a specific problem domain.
Of course, having the hashes available also speeds up recreation, should the tombstone approach be used.
Basically, keep the full hashes around :)
(And of course you could use parallel arrays if you're super concerned about cache lines, though the tradeoffs would very much depend on whether hash misses or hits are the expected mode.)
* not sure if that's literally true, but I've never seen anyone do chaining in performance-sensitive applications, and all the papers on fast hash tables use some way of open addressing.
However, Rust is allergic to intrusive data structures. Which is a little ironic as single ownership ostensibly would otherwise make intrusive data structures more practical as you're already effectively precluded from inserting an object into multiple hash tables and so there's no potential for contention over the object's internal node fields.
I suppose linear probing also permits prefetching of successive buckets, though ideally collisions should be the exception so such prefetching might just waste bandwidth.
In C I prefer red-black trees over hash tables because memory management is easier, you get smooth performance characteristics (no resize hiccups), and you're immune from computational complexity attacks without sweating over hashing algorithms. But I use intrusive node fields for my trees, and also tend to keep keys (even string keys) internal to the object as well, which substantially closes the cache performance gap.
I can't speak directly to the performance of non-tombstone based open addressing implementations, but it definitely seems like the need to move things around after deletions would be significantly more work than the fastest chaining implementation I've come up with.
folly's F14 maps use this strategy, and handle erase + insert workloads fine with no rehashing needed. https://github.com/facebook/folly/blob/master/folly/containe...
Open addressing does do more comparisons than chaining. It used to save time to traverse the linked list rather than to spend CPU cycles on those additional comparisons. Now, because CPU cycles are relatively cheaper, the opposite is true.
The reason the Qt and Java hashtables use chaining is simply that the code in question was initially written back when the tradeoff ran the other way.
These indirections have always been costly on any machine with virtual memory. TLB misses aren't free. I'd have put any estimate on the tradeoff being the other way around to more than 25 years.
Furthermore as the article explains they can use SIMD to search several buckets at once, something that wouldn't really be possible if they didn't exist in contiguous memory.
What looks like "two loops" is really "do two simple SIMD operations".
Tombstones in place of deleted elements mean that finding out that key is not in the table requires going to a different place than finding the right place to finally put it. Hence two lookups. SIMD just makes keeping two lookups separate economical.
No, it's not quite so offensive, but this doesn't explain why it's the best option. Is there no equally-fast way to write the first-tombstone implementation with SIMD instructions? The answer seems to be in the sketch of the implementation, which I'm having trouble understanding.
EDIT: I'm watching the original SwissTable talk now... would it really have been worse to use 2 bits for empty/tombstone/full/sentinel, and 6 bits for hash prefix?
EDIT 2: More implementation info. Tombstones are actually rare, because if any element in your 16-wide chunk is empty, you don't have to create a tombstone. In the very best case (a perfectly distributed hash function), your hashmap has to be ~94% full before it's even possible to fail this. Because tombstones are so rare, it's better to save the single bit for extra confidence in the hash prefix.
So, here is my understanding of the implementation and its rationale:
* Every bucket in the backing array has a corresponding byte in a metadata array
* 1 bit of this byte stores whether the bucket is empty, the other 7 bits are for a hash prefix
* SIMD instructions search 16 bytes at a time, checking: this bucket is not empty, this bucket's key matches the first 7 bits of my key
* Since 16 buckets are checked for emptiness at the same time, you can avoid creating a tombstone for a bucket if any of the other 15 buckets are empty (just set it to empty, i.e. set the first bit to 0)
* This means that tombstones are very unlikely- you'll probably rehash before you get to the load factor where you start seeing tombstones
* Since tombstones are so unlikely, it's more valuable to add an extra bit to the hash prefix than it is to quickly find tombstones
My question remains: why can't the first search return the offset of the first empty bucket? In this loop, why is there not an else that saves the offset?: https://github.com/abseil/abseil-cpp/blob/256be563447a315f2a...
> it was doing something that was so offensive to people who care about collection performance
Hmm. It also helps here to go back to academia. Big O notation doesn't usually express coefficients/constants, it usually only deals with exponents. The Wikipedia page has a good explanation as to why.
Opinion: coefficients/constants are, however, useful if you're running over a network or some other latency-bound operation.
(Might actually be the same technique, it's not quite clear!)
Raymond Hettinger (HN commenter and all around Smart Dude (tm)) has a good talk from PyCon 2017 about the evolution of Python dictionaries that goes into tombstones and so much more: http://www.youtube.com/watch?v=npw4s1QTmPg
Does Neon get similar order of magnitude benefit?
Unfortunately none of the current popular architectures seem to have either.
Fast array bounds checking is, handling the lower bound by unsigned integer math :
I read somewhere the main problem is LLVM not being quite capable of optimizing these decently: It has to handle all these extra jumps which causes lots of extra basic blocks.
I'm becoming convinced that we really need to just go back to the Alpha 21164 architecture and stamp out multiple copies with really fast interconnect.
So, why don't compilers do this already? Nothing stops them. The answer is that they lose a ton of performance.
There is a lot of housekeeping in order to keep track of speculative execution and it happens on every single mathematical operation in your regime. And you pay that continually for an operation that happens "very rarely" as you point out.
Overflow/underflow probably isn't so bad, but array bounds checks are very bad. You have at least one bound that is variable and so causes an extra "Compare to X" where X is a non-zero constant and so needs to be loaded. So your tight loops all look like "increment I; compare I to 0; compare I to X; do real work" That's two extra instructions that the speculative execution system has to keep track of every loop and it eats a lot of the speculative execution resources very quickly.
An in-order chip like the 21164 doesn't execute the loops any faster, but it doesn't have to pay hardware chip area because of speculative execution. Given that everything is moving to high multi-core, it would be better to let a stuck core give way to another quickly than to try speculate deeper--especially as SSDs continue to increase in speed--and especially because, quite often in a battery operated world, your would rather shut down while waiting.
In addition, if you trap on these, it will break an enormous amount of software in existence. Look at how many people went absolutely ape over gcc suddenly doing aggressive optimizations on undefined behavior.
(And, by the way, every microprocessor used to compute underflow on arithmetic instructions (8080, Z80, 6800, etc.), so you need to ask why that went away.)
It is in-order designs that take a slightly larger relative penalty since in some sense "all instructions matter" there so bounds checks cost similar to surrounding operations that do real work.
You make a lot of valid observation about the cost of actually implementing a modern, deep and wide out of order architecture: it is much harder and uses more silicon that some in-order designs, but once you have it (and largely that's what we have today as general purpose CPUs), the type of branches that are involved in overflow and bounds checks are not costly.
Compilers are generally smart enough to hoist these comparisons out of the loop, at least in a static, AOT-compiled language like Rust.
You're right though that in-order chips are a lot more power-efficient, and the in-order approach is the one that's taken in most RISC-V implementations (which seems to be a highly comparable architecture to the Alpha you mention).
It would require a minor compatibility break, significant checking work and changes to unsafe code to make it work, but it should be doable.
Array writes do need precise checking though, since otherwise you can write to arbitrary memory.
Rust's implicit overflow checks are only in debug mode I think. Overflow is defined to wrap in release mode.