Questioning the Criteria for Evaluating Non-Cryptographic Hash Functions

aappleby · 2025-02-15T06:12:30 1739599950

Hi, MurmurHash and SMHasher author here.

The author of the linked article seems to be missing a _bafflingly_ _large_ amount of context from the last decade of people writing and testing hash functions.

Hell, Murmur2 is deprecated - they should've done their test with Murmur3. And none of the other hashes invented since then have been mentioned.

Was this article actually written years and years ago and just republished with a new date?

ot · 2025-02-15T12:39:18 1739623158

The fact that they categorize FNV-1a as "Good all-rounder, decent speed and collision resistance" was an immediate red flag for me.

It does look like an article written more than 10 years ago.

tialaramex · 2025-02-15T13:50:08 1739627408

Certainly when I saw FNV-1 and FNV-1a called out separately I assumed we're going to see a point made about an early period in hashing and then in a paragraph or two we'll get "real" options.

orlp · 2025-02-15T14:32:26 1739629946

I'm very confused by the article. It's missing about two decades of modern non-cryptographic general hash development, it completely fails to mention the field of universal hashing which created formal criteria for collision probability and produced provably correct constructs to achieve those criteria...

As a self-plug for a related topic, I think going forward we should also take into account HashDoS for non-cryptographic hash functions. All of the hashes mentioned in the article are vulnerable to this, see https://orlp.net/blog/breaking-hash-functions/ for some techniques.

wmf · 2025-02-15T03:30:36 1739590236

Seems weird not to mention the SMhasher test suite and the dozens of fast and high-quality hash functions that were created in recent years.

AlotOfReading · 2025-02-15T06:25:23 1739600723

The ubiquity of decent hashes also means that they're commonly used for situations where they're not wholly appropriate. I've frequently encountered non-cryptographic hashes being used for fixed sized integer maps or iterated.

Even non-cryptographic functions can benefit from the same kinds of threat modeling and carefully attention to selection criteria common in crypto. The risk for failing is that you end up with something as catastrophically flawed as boost's hash_combine [0].

[0] https://www.boost.org/doc/libs/1_70_0/doc/html/hash/referenc...

arthur2e5 · 2025-02-15T06:58:47 1739602727

on one hand i can see how it's going to break stuff (and I see they've changed it! []). on the other hand... mind talking a bit more about how it did break things?

[]: https://github.com/boostorg/container_hash/commit/40ec854466...

AlotOfReading · 2025-02-15T16:48:11 1739638091

Let's call the old combine function C. How long can you combine things C(C(C(C(a, b), c), d), e)... until you get a collision? Experimentally, it averaged around 2^14 iterations and I had examples with 7. The issue was that I found this code deep in the bowels of a system controlling heavy machinery such that a collision potentially meant life altering consequences for someone.

I wasn't aware boost had changed it, but the solution I settled on was essentially identical to what boost came to here. There's some further issues with that system in that the mixing functions aren't ideal for all bitlengths T. The way they're using it here is safe, but it annoyed me enough to devise a class of mixing functions with provably max length periods for all sizes T (at the cost of some mixing quality). Haven't published that crate yet because I'm not sure who else would want it.

tialaramex · 2025-02-15T02:31:00 1739586660

Quite early this says well, it would be easy to see what happens for the chain based hash tables and so, then it just doesn't talk about any other type.

This probably made the analysis easier, but if you don't use these archaic hash tables it is unclear how valuable these results are. Maybe I should go measure :/