Hacker News new | past | comments | ask | show | jobs | submit login
Open-sourcing F14 for memory-efficient hash tables (fb.com)
225 points by ot on Apr 26, 2019 | hide | past | favorite | 45 comments

I've recently started using Folly's F14 hash map implementation (specifically, F14ValueMap--my keys and values are relatively compact) and it's the fastest unordered associative container I've yet found. This is for an application that repeatedly adds to and removes from the map. In particular, it's faster than a custom solution using open addressing that I had written that is pretty much optimal for this particular application (for an open addressing implementation, that is).

I do wish it could be used header-only out of the box, but still a great library.

Have you compared to ska::flat_hash_map or khash? I’ve found the latter to be ideal if my keys/values were POD and the former otherwise. I’m not yet convinced that it’s actually better based on their website. std::unordered_map is infamously slow.

Comparing to ska::flat_hash_map or khash wouldn't be fair, because the F14 hash tables use SIMD intrinsics to lookup 14 buckets at once (and thus the name) whilst the former two don't have that advantage. A fairer comparison would be against Swiss tables[1] from Google's abseil library, which uses the same trick. Also, std::unordered_map is slow because its interface requires a chaining based collision resolution strategy.

[1]: https://abseil.io/blog/20180927-swisstables

You may want to try my version of the Abseil hash tables (I updated the code to make them header-only) at https://github.com/greg7mdp/parallel-hashmap.

I know it's late, but I finally got around to trying ska::flat_hash_map. For my application (lots of insertions and deletions) it's not quite as fast as F14, but the difference is fairly small. It's also much easier to incorporate than folly, since the entire thing is in a single header file.

I did look at klib about a year ago. Honestly I don't recall whether I looked at khash in particular; I know I was also considering some of the other klib stuff for other uses. I did use Google's dense_hash_map for a time.

For my application, the problem with implementations that use more traditional tombstone algorithms is that removing items from the table doesn't decrease the load, so that at some point the entire table has to be recreated. For my application, worst case performance is far more important than average performance.

dense_hash_map definitely behaves as described above, and (from a quick look at the code) I believe khash does as well.

I haven't looked at ska::flat_hash_map but thanks for the tip.

Non-SIMD algorithms work better in some cases, for example if you look up the same key over and over (so the control flow is predictable) or when all of the data is in the L1. The SIMD approach tends to be more robust inside a larger program where there's less chance for branch prediction and caching to help you.

If I am right, if you delete all elements in F14, there will be no tombstone. However, if you delete some, there may still be some tombstones left. You may still need rehashing due to those remaining tombstones. In that sense, F14 is not eliminating the worst case; it makes the worst case happens less often. Then the question is how "less often". This probably depends on the access pattern.

What is your access pattern? Are insertions and deletions frequently interleaved? Or do you insert a batch and then delete another batch? It will be interesting to have a micro-benchmark to measure how the F14 strategy works in practice.

> I believe khash does as well.

Yes, khash uses the traditional tombstone solution. It rehashes in the same space when there are too many tombstones.

It would be nice to have a deeper understanding of how/why this works, but in practice it seems fine. Under a continuous insert and erase workload the number of chunks with an overflow fluctuates a bit but stays low enough to keep probe lengths short, so F14 never needs to rehash to clean them up. Batched adds and then removes might let you bias a bit toward having longer probes for the remaining keys, but it never gets too bad. A truly adversarial workload could make things pretty bad, but they could also just insert lots of keys that all have the same hash code.

See the steady_state_stats test at https://github.com/facebook/folly/blob/master/folly/containe...

> What is your access pattern? Are insertions and deletions frequently interleaved?

It's basically random, but generally they are very interleaved and most of the time the number of insertions and deletions are approximately equal over some period of time (~seconds). This is for keeping track of orders while processing a depth of book feed, in case you're familiar with that sort of thing.

F14 is using some kind of reference counting scheme for tombstones, described in the article. I don't fully understand it, but it definitely seems different than the standard tombstone-based algorithm, which I do understand. In any case, when using something like dense_hash_map, I see huge spikes in processing time (due to rehashing) that I don't see when using F14. I also resize the hash tables up front to avoid having to do so later on, but of course for my use case that doesn't help with implementations that use a traditional tombstone algorithm unless I use ridiculously large initial sizes.

Does it beat judy arrays?

Performance is very similar to my header-only version of the Abseil's hash tables available at https://github.com/greg7mdp/parallel-hashmap.

> Those options are slow, risky (clients can change the key), incompatible (maybe we can try again after proxy iterators work), and ugly, respectively. After careful deliberation, we chose ugly, using const_cast to make the key mutable just prior to its destruction during rehash.

You can't win them all, I guess…

Also, TIL that invalidating iterators does not actually invalidate references to keys and values:

> The standard guarantees reference stability: References and pointers to the keys and values in the hash table must remain valid until the corresponding key is removed.

https://github.com/sparsehash/sparsehash was for a decade+ hands down the most memory efficient map implementation, originally written by Craig Silverstein of Google. I am not sure how the more recent SIMD stuff impacts it.

sparsehash is still more memory efficient, but it is quite slow in comparison. Also, in practice it doesn't reach the ultra-low space overheads claimed by the documentation. It allocates memory blocks of many different sizes, so the dominant space overhead becomes internal fragmentation in the allocator. For sparsehash using JEMalloc's default allocation classes (spaced about a factor of 1.2 apart) the memory used relative to useful data varies from about 1.3 for small values to 1.1 for large values.

This seems like an implementation similar to the one Google engineers have given talks about (w/vectorization, etc) so it's cool to have more options to use a modern performant table like that.

Since the hashes uses open addressing wouldn't they be susceptible to hash poisoning attacks? Java plugged that hole in HashMap in Java 8 by reallocating lists over a certain length as balanced trees. But that strategy isn't possible with open addressing. https://www.nagarro.com/en/blog/post/24/performance-improvem...

F14 supports stateful hashers, so in situations that need to be resilient to hash poisoning we can use a hasher designed for that purpose.

I recall that that was what was suggested for Java's HashMap too, but they found it to be insufficient. By careful analysis, an attacker could still figure out how to cause hash collisions.

I haven't seen those discussions, but is it possible that they were trying to rehash the results of Object.hashCode rather than pass all of the bytes of the original value through a new hash algorithm?

If you haven't seen it before you will appreciate https://accidentallyquadratic.tumblr.com/post/153545455987/r...

Afaik, they augmented the hash function with a random seed, which didn't work out. See: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash... It appears that SipHash would be enough (for now) to prevent hash poisoning. Most open addressing hash implementations doesn't yet use SipHash though.

They use a byte for collision counting, and a byte is enough to reliably detect and punish DOS attackers on 32bit tables (max 4G keys). On 64bit tables I didn't do the math yet.

There was a discussion thread on here recently about Google's Hash Table implementation. I wonder how the approaches and design considerations compare to each other.

I'm one of the authors. Seeing the Google presentation at CPPCon convinced us that the potential wins were worth the effort to productionize, but the similarity in the designs is a case of convergent evolution. Since then Google and Facebook have collaborated on a microbenchmark (https://github.com/google/hashtable-benchmarks/), which shows that the algorithms have slightly different tradeoffs but neither dominates the other.

They are very, very similar. The hashes are two level: one is fairly traditional. You take the first k bits from the hash and look that up in a table. The second layer is 128 bit blocks. (one SSE/NEON register) This block contains 14 8 bit blocks, where the top bit is a tombstone marker, and the bottom 7 bits are the k+1 though k+7 bit from the hash. (or the top 7 bits? not sure. it's 7 bits from the hash that aren't used in the top layer) It uses SSE instructions to identify which of the 14 blocks to do a full key compare with.

Both of them have two variants, one which guarantees that references to keys and values will remain valid even if iterators are invalidated, and another which is faster but iterator invalidation invalidates references to keys or values.

IMHO it looks like the Folly team saw the Abseil presentation last year and said, "That's a good idea, we should do that."

abseil presentation: https://www.youtube.com/watch?v=ncHmEUmJZf4 It's super interesting.

This thread has some info: https://groups.google.com/forum/m/#!topic/hashtable-benchmar....

Edit: Or at least I think that's relevant to this announcement. I don't know enough about it to say for sure.

Hasn't this been open-sourced since at least last March?


You're right, it's been available for a while now. This is the first time we've made an announcement about it.

The blog post is dated April 25, 2019, so I suspect this is a higher level overview on how it works/benchmark results/etc.

It's funny how they bring up same birthday probability without referring to the "birthday problem" even if it supports their case for chunking even better.

For a hash table implementation, it's even more relevant that out of 180 people in a room, chances are > 99.999% that two people share a birthday (meaning collisions), whereas the chance that 15 people in that group are born in the same fortnight is still very small.

I realized that I misunderstood this comment (I assumed same fortnight meant "same fortnight number in a year" - when it should be same 14 day span) but I thought it was worth calculating this example anyway. The oldest person alive was born in 1903. Since then 3035 fortnights have passed. Assuming an even distribution of birth fortnights, there'd be a 3.4058% chance of two people having the same birth fortnight (1 - (P(3035,15)) / (3035^15), where P is for Permutations).

It's still fairly likely, but far from the ~1 in 10^23 odds of having NO shared birthdays between 180 people.

how comes that F14BasicMap (the base class of F14FastMap F14ValueMap and all the others) is extending std::unordered_map ? https://github.com/facebook/folly/blob/master/folly/containe...

they seem to be calling the table_ member for every method in the book. Still weird by they are deriving from std::unordered_map in the first place.

F14Table<Policy> table_;

the implementation class F14Table - https://github.com/facebook/folly/blob/master/folly/containe...

That's a fallback implementation for platforms without SIMD. The real code is lower down.

Curious how this compares to brownhash, which is on its way to stable Rust on July 4th.

Someone should port this to C and add it to the next version of Lua, given that Lua's main datatype is pretty much a hashtable with some extra magic.

I'm curious what Rust's implementation looks like in comparison now...

https://github.com/rust-lang/hashbrown is a port of Google's SwissTable algorithm (abseil flat_hash_map). It also uses SIMD filtering and has the potential to be quite good.

Which has been recently upstreamed to the standard library, replacing the old hash map implementation. (This sped up rustc a noticeable amount, incidentally.)

Is `hashbrown::HashMap` now completely ported/identical with `std::collections::HashMap`? I just compared their performance (on yesterday's beta channel) on a hashmap with over 200k keys and didn't notice any relevant speedup.

Yes, as of a couple of days ago in nightly (pre-1.36): https://github.com/rust-lang/rust/pull/58623

Can this easily be ported to Rust?

Rust's standard hash table has recently been replaced with Hashbrown, which is a port of Google's SwissTable, which is a close relative of this hash table. So it's basically already done. :)

Fb probably open sourced the work after the engineers realized that hashbrown is superior to it. Just a hunch. Need to see benchmarks..

F14 has been open source for over a year already...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact