Hacker News new | past | comments | ask | show | jobs | submit login
GxHash is a fast and robust non-cryptographic hashing algorithm (github.com/ogxd)
87 points by vlovich123 14 days ago | hide | past | favorite | 61 comments



Just to be 100% fair, they may want to retest against xxHash3 for that "fastest" claim as it's ~2.5x faster than xxHash that they benched against.


Their graph is just poorly labelled. It's against XX3. I even tried switching their benchmark harness from twox_hash to xxhash-rust which has SSE support - their benchmark harness doesn't show a major difference for xxhash-rust although maybe I somehow forgot to set native-cpu for the benchmark & it wasn't testing the right thing.


According to my test, xxhash-rust does have significant edge over TwoX-Hash. However, it is still slower than GxHash.

  gxhash
    | 4 > 8300.74
    | 8 > 16684.68
    | 16 > 33121.69
    | 32 > 38146.97
    | 64 > 57805.72
    | 128 > 49712.76
    | 256 > 69277.29
    | 512 > 88315.11
    | 1024 > 97135.29
    | 2048 > 98200.60
    | 4096 > 104709.01
    | 8192 > 108988.86
    | 16384 > 110630.70
    | 32768 > 111406.96

  xxhash (twox_hash)
    | 4 > 1291.20
    | 8 > 2579.19
    | 16 > 6029.70
    | 32 > 10281.18
    | 64 > 15950.47
    | 128 > 22031.84
    | 256 > 17929.12
    | 512 > 20728.87
    | 1024 > 21683.24
    | 2048 > 21679.94
    | 4096 > 21732.89
    | 8192 > 21748.57
    | 16384 > 21534.84
    | 32768 > 21703.27

  xxhash (xxhash-rust)
    | 4 > 1968.26
    | 8 > 3946.71
    | 16 > 8030.94
    | 32 > 13099.80
    | 64 > 19073.49
    | 128 > 23469.46
    | 256 > 22909.73
    | 512 > 33117.15
    | 1024 > 42467.53
    | 2048 > 54638.44
    | 4096 > 63193.78
    | 8192 > 68677.38
    | 16384 > 71658.92
    | 32768 > 71421.18


I would like to see the collision distribution on /usr/share/dict/words. I have found that 'fast hash algorithms' can have very unequal hash distributions on short strings.


Using https://gist.github.com/rjp/3bbf33911aaf9ed7a07be417adb4b8f1 on Arch Linux `words` 2.1-8 (which defaults to the `usa` word list) gives collisions for the 32 bit hash (but none for the 64 bit or the 128 bit variants.)

    > ./target/debug/gxwords | cut -d' ' -f1 | sort | uniq -c | awk '$1>1'
          2 28273056
          2 8e264f8
          2 f7bce777
Which then gives us

    > ./target/debug/gxwords  | egrep '28273056|8e264f8|f7bce777' | sort
    28273056 counterclaiming
    28273056 oncogene's
    8e264f8 ingrates
    8e264f8 toadstool's
    f7bce777 Carib's
    f7bce777 stemming


I thought interesting that all three collision pairs have an 's ending, but looking into this dictionary, no less than 25% of the words finish by it. weird dict.

Wouldn’t that be true for any 32-bit hash? The birthday paradox would suggest you only need 2^16 input values to generate a collision right?


xxhash32 does give a couple of collisions on the same file, yep.

    734c8981 abloom
    734c8981 sating
    ef336a7d contravene
    ef336a7d seducers


Strange that all three collisions include "'s"


Considering that 43% of the words (in my dictionary at least) include "s" it's not so strange:

    $ cat /usr/share/dict/words | wc -l
      235976
    $ cat /usr/share/dict/words | grep s |  wc -l
      101882


GP meant that all collisions contained 's - a possessive apostrophe (if I remember my grammar correctly?). oncogene's, toadstool's, Carib's.

The /usr/share/dict/words on my macOS machine doesn't seem to have any 's words in it.


counterclaiming Doesn’t have any s and stemming doesn’t have any possessive. I think drawing conclusions from a 32bit hash against a 123k word list is fruitless. You’d expect at least 1 collision due simply to birthday paradox.


    oncogene's <> counterclaiming

    ingrates <> toadstool's

    Carib's <> stemming
all 3 pairs have one word ending in "'s", which I think is what the GGGP (?) was pointing out


Yes. But I didn't realize that 43% of the words in that dictionary contain "'s".

In that case, the probability that, for 3 collisions, there is at least one element in every collision contains "'s" is 31%. That seems reasonable to me.


but 43% of the words contain s, not 's


> counterclaiming Doesn’t have any s and stemming doesn’t have any possessive.

But "oncogene's" (collides with counterclaiming) and "Carib's" (collides with stemming) do.

Also I did point out that the 64 and 128 bit variants have no collisions.


Yeah, but it's an uninteresting test. As others point out, 32-bit XXH3 has the same issue. It wouldn't surprise me if 32-bit AES did too. It has nothing to do with the quality of the hash algorithm & more to do with the fact that you're not building a perfect hash with such a small hash size.


> Yeah, but it's an uninteresting test.

On its own, yes, fair, but it was combined with the 64 and 128 bit variants which aren't uninteresting.

> As others point out, 32-bit XXH3 has the same issue.

Yes, me.

> It has nothing to do with the quality of the hash algorithm

The original post asked "I would like to see the collision distribution on /usr/share/dict/words." and that is what they got a demonstration of - some collisions on 32 bit, none on 64 or 128 bit. What else could be done to satisfy their original request in a way that would be interesting to you?


Point to the fact that it passes smasher in addition to that 64-bit and 128-bit variants don’t have collisions and ignore 32-bit altogether. I don’t know why libraries even bother producing 32-bit variants at this point because of how easy a collision is.

The article emulates that result by synthesizing words. The problem with using /user/dict/words is that there aren't enough to get a really good statistical estimate of collisions at more than very small scale.

Also the core primitive is AES which should calm most such worries.


Not to mention you should be using the 64-bit variant for collision resistance and the 128-bit variant for integrity.


Deleted


AES is more than just sboxes. The multiple rounds (including XORs) are integral to collision resistance.


I've done some security testing against this hash function. Does anyone want to follow up and see if they can extend the attacks I describe[1]?

[1]https://github.com/ogxd/gxhash/issues/25


I may well be missing something, but isn't the point of the title:

"GxHash is a fast and robust non-cryptographic hashing algorithm"

Exactly that I should use it when I want a fast hash, but not when I want a robust cryptographic hash? In which case I can ignore your attack? (Not to discount your work, I'm just trying to understand the scope here)

So I can use it in my hash-map, or whatever O(1) lookup, but I shouldn't use it in my rewrite of SSL?


Even non-cryptographic hashes must provide some sort of security against attacks. Let's say we use a weak hash function in an open addressing hash-map - if I can calculate a set of inputs that will all collide in the hash-map, then it will turn the O(1) lookup into a O(n) lookup.

If a small change to the seed mixing (literally moving a line from the end to the beginning of the function) increases the security with no penalty to performance, then we might as well make the change.


> So I can use it in my hash-map, or whatever O(1) lookup...

BTW in the past (20 years ago?) there have been attacks on non-cryptographic hashes used for hash maps: denial of service by creating crafted requests to HTTP servers... The parameteres would be picked maliciously and would be all ending in the same "buckets". That attack worked on both Java and PHP servers. IIRC it's been solved by adding random seeding (and I noticed that GxHash mentions it's seedable: not saying it'd counter every attack but it's already something).


Very good point, and one I had heard of, but wasn't considering here.


> there have been attacks on non-cryptographic hashes used for hash maps

There have been places where a dev chose to use a non-cryptographic hash when they should not have, because they hadn't thought of the threat model (out of ignorance, or just didnt occur to them).

But in most cases, nothing happens. They gained performance, or ease of development, etc. So it's a worthy trade off most of the time (whether they did it knowingly or not).


I'm a bit confused why the author thinks the key recovery complexity would be on the order of full AES. It's a reduced 4-round AES, and even 5 round key recoveries are quite practical with modern hardware.


Is the output stable across machines in the same architecture (i.e. with various simd features enabled/disabled, compiled in 32-bit vs 64-bit modes, etc) and across architectures (x86_64 to aarch64)?


From the page itself:

> All generated hashes for a given version of GxHash are stable, meaning that for a given input the output hash will be the same across all supported platforms.


Thanks, I missed that.


It would be interesting to have AES itself (with the same number of rounds) in the performance comparison, given that GxHash appears to be built on top of AES.


AES is only used for the final mixing from their paper. AHash is the pure AES hash using AES-NI.


Am I missing something? The paper indicates that AES-NI is used for both compression and mixing.

If you look at the source code, it's using it in both parts too. Interestingly, there's a discrepancy between the paper and the code. The paper says the seed is passed to the first round of the 3-round finalizer/mixer, but the code has a separate 0th round that gets the seed instead.



how does it fare against murmurhash?

edit: I did the work myself as I should have in the first place, although I'm tired.

  % cargo bench --bench throughput
   Compiling gxhash v3.1.1 (/Users/daniel/Developer/gxhash)
   Compiling fastmurmur3 v0.2.0
    Finished `bench` profile [optimized] target(s) in 1.03s
     Running benches/throughput/main.rs (target/release/ deps/throughput-c9580307c88ca541)
  gxhash
    | 4 > 3980.55
    | 8 > 7961.11
    | 16 > 15922.22
    | 32 > 17986.93
    | 64 > 25083.88
    | 128 > 24548.44
    | 256 > 32485.82
    | 512 > 40386.43
    | 1024 > 45980.96
    | 2048 > 49396.93
    | 4096 > 51314.47
    | 8192 > 52324.80
    | 16384 > 51536.52
    | 32768 > 52342.91
 
  murmur3
    | 4 > 547.33 
    | 8 > 910.5 1
    | 16 > 1752.21
    | 32 > 2676.02
    | 64 > 3552.02
    | 128 > 4045.35
    | 256 > 3975.75
    | 512 > 3748.31
    | 1024 > 3560.22
    | 2048 > 3439.41
    | 4096 > 3411.45
    | 8192 > 3389.27
    | 16384 > 3376.18
    | 32768 > 3379.46


What do the numbers on the right mean? Time? Throughput?


It says "throughput" in the command line :)


The AES instruction thing has already been done in MeowHash.

Not sure how it compares.

https://github.com/cmuratori/meow_hash


If this hash was used as checksum, how would it compare to CRC?


If you’re asking in terms of quality of the checksum in terms of catching corruption, CRC32 is bad just like any 32-bit hash would be. It’s too easy to generate collisions - other algorithms might perform better but it’s all going to be a marginal improvement - you want 64-bit or 128-bit depending on the expected size of data you’re hashing. CRC also exhibits pretty bad issues that cause it to fail SMHasher (even CRC64) - bias, collisions, etc. So gxhash and XXH3 should both perform significantly better in terms of more reliably detecting changes to the underlying data.

In terms of speed, XXH3 is roughly the speed of CRC32 but can output 64 or 128-bit hashes (CRC64 by comparison ~3x slower). This algorithm is ~10x faster than XXH3 for small data sizes and ~2x faster for larger ones so it would be 2x faster than CRC32 for a much higher quality hash and ~6x faster than CRC64 for a much higher quality hash.

TLDR: Don’t use CRC, use at least 64-bit or 128-bit checksum, and use a good modern hash algorithm for checksumming like XXH3. I know XXH3 very specifically targets the checksum use-case. I don’t know about gxhash’s suitability for that, but I suspect that it’s equally as fine - it’s really hard to measure the quality of a 64-bit and 128-bit hash unless they’re so bad they fail obvious tests.


Wanted to highlight it since it looks like it's a significant Pareto frontier jump from existing techniques.


Honestly, the paper is pretty confusing. It's a weird mix of explaining things people reading a hash paper should already know (e.g. avalanche effect) and not explaining the important bits. The mixing stage looks a lot like a Davies-meyer hash, but it's hard to be sure without reading code. If so, this will be susceptible to fixed point attacks (among others). It doesn't particularly matter because this isn't presented as secure, but worth noting.

The paper also doesn't exhaustively verify numbers when domain=codomain, which is a super common scenario and a weird omission. 32 bits is small enough to check everything.


Hi! I'm the author of this document. It indeed covers some concepts that will sound trivial for people experienced like you. However, I am not a cryptography researcher (only a developer passionate about performance and optimization) and nor is my entourage. So I wrote this also for people like them.

Still, I tried my best getting into the details, but I learned a lot while developing this hash function, and I still have plenty to learn especially regarding security. That is one of the reasons I made it open source: get feedback from more seasoned peers and improve this :).


I have to say, despite any nits I have about it the overall construction is pretty clever and I like a lot of your design choices. It's a cool algorithm.

The hash internally is a 128bit hash & can output 32 or 64 bit as well. Can you speak more to how fixed point attacks are relevant? The only reference I found online was in a cryptographic context which wouldn't be relevant here. Are other hashes like SipHash-1-3 (Rust stdlib hashmap) or XXH3 susceptible to it?


A fixed point attack would allow creating collisions. Whether that's an issue or not depends on your use case, but the answer is going to be yes surprisingly often in my experience.


Can you realistically create fixed-point collisions of a randomly seeded hash function? E.g. to defeat the DOS-resistance of a hash table? I think the papers on fixed-point attacks of Davies-Meyer hashes require selecting the IV or knowing the IV in order to carry out the attack.


Not sure where the IV would come in, but both the plaintext and the initial key seem to be taken directly from the input as the first two blocks in any "compression group". Let's say you want intermediate key X after the first two steps, the first 3 blocks should be something like m_2 = D(X, E(m_1, m_0)) because we're missing the xor step from davies-meyer.

Let me know if I'm wrong though, strictly an amateur here.


I took a closer look, and it looks like the compression doesn't depend on the seed at all. This makes it look like any collision is a collision on all seeds.

    pub(crate) unsafe fn gxhash(input: &[u8], seed: State) -> State {
        finalize(aes_encrypt(compress_all(input), seed))
    }
For a hash table, you would normally use just the low bits to select a bucket, so this still protects you from collisions on just a subset of the bits, but if you can produce a large number of collisions on the exact 128-bit output, then it's not DOS-resistant.


I don’t know about this at all, but ChatGPT suggests that this attack is thwarted by a secure compression function and block function. As you point out elsewhere, it uses AES for compression but not sure if that’s enough. Do you know?


ChatGPT would be wrong in this case. It applies to (and was originally identified with) hash functions built from cryptographically secure hash functions. Note that this isn't as strong as AES. It's just built with some of the same machinery as AES, it doesn't carry over its guarantees.

Still an interesting construction I'll play around with some more when I have time.


Haven't checked it in smhasher yet. just saw this, claiming to be the very best. Well, maybe a PR will come soon. The rust framework for hashes is prepared, there just haven't been any good rust hashes yet.


i know nothing about this algorithm, but I went and checked that already; it does have a PR for smhasher: https://github.com/rurban/smhasher/pull/279

The note in there that the output of the algorithm is different for different SIMD widths seems to indicate that this isn't so much an algorithm as a family of related algorithms


That's a bit weird since it says on the front page that:

> All generated hashes for a given version of GxHash are stable, meaning that for a given input the output hash will be the same across all supported platforms.

The github repo is more recent so maybe it wasn't stable at the time they tried to get it into smhasher from last year?


Looks like the original failed smhasher, since the PR modified the compress function


The article quotes results from smhasher runs.

It passes.


how do we know it is a robust one ?


Did you read the article?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: