Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Discohash – Fast Hash (github.com)
74 points by fantastisch 2 days ago | hide | past | web | favorite | 32 comments





> The standard digest is 64-bits, but you can modify it to take 128-bits if you want a cryptographically secure hash.

There's not even an attempt at a preliminary cryptanalysis anywhere as far as I can tell. I'd advise staying far away from this for cryptography, especially if it considers a 128-bit digest size reasonable for anything but a message authentication code.


This is a hash designed for non-cryptographic use, like in hash tables or bloom filters. You can tell by their small output size, which greatly reduces the cost of a brute-force collision search. It's also in the linked readme.

Hash function families with a similar target usecase include: cityhash, falkhash, farmhash, FNV, meowhash, metrohash, murmur, t1ha, wyhash, xxh.

The SMHasher suite tests hash functions for speed, distribution, bias, and collisions. This function ranks well in those tests.


> This is a hash designed for non-cryptographic use

The Readme specifically says "you can modify it to yield 128-bits or more if you want a cryptographically secure hash."

Which is a problematic statement, because it is not designed for cryptography even if you extended the output to 256 bit. It's not the output length that makes it cryptographicaly secure. (Rather it's the difference between "you won't find collisions by accident" and "you won't find collisions even if you try really hard using very sophisticated math", but there are more requirements.)

This is similar to the issues with a different hash by the same Github user posted two weeks ago:

https://news.ycombinator.com/item?id=23103521

Making fast non-cryptographic hash functions is a fun challenge and I appreciate the projects, but please, please do not make any claims about cryptographic properties!


All of those mentioned have Hash flooding issues due to invertability or lack of diffusion.

I think SipHash is the better choice for non-cryptographic use cases (e.g. hash tables) https://131002.net/siphash/


SipHash was specifically designed to be resistant to hash flooding attacks.

FWIW, the SMHasher test suite takes the view [1] that defense against hash flooding attacks is a concern for the hash table's collision resolution method, which is a fair point. Nonetheless, SipHash was subsequently adopted by several programming languages' standard libraries for use in hash tables. SipHash is also notable for its clear and concise specification, including security claims, preliminary cryptanalysis, and a discussion on hash flooding [2].

[1] https://github.com/rurban/smhasher#security [2] https://eprint.iacr.org/2012/351.pdf


SipHash is considered acceptable even for cryptographic use cases (MACs)

I am using Bob Jenkin's one_at_a_time hash for that purpose.

It is really simple and works with unaligned data.

But it is not doing well in benchmarks. I wonder if I should use another one


Stay away. The author can't tell the difference between a cryptographically secure hash and a regular one, claims that simply going from 64 to 128 bits will make magically make this cryptographically secure, and then benchmarks it against actually cryptographically secure hash functions (blake3), which are almost necessarily slower than generic hash functions.

Look at meowhash and wyhash instead for the latest and greatest in that field.


Not bad, I like that the author stuck with plain C instead of sse/aes.

Looks like two interleaved 128-bit hashes, don't see any cross-half mixing. Code style needs a bit of cleanup, the sindex stuff obscures the algorithm a bit.

-Austin, SMHasher/Murmur author


Was it the inconsistent use of curly braces that made you comment on the code style?

Predeclared aliased pointers, STATEM/HSTATEM constants, len/Len, that kind of stuff.

Thank you, Austin!

BTW, not sure exactly what you mean by cross-half mixing, but I'm assuming it means mixing both halves of the 128 bit state with each other. There is that, on 3 lines 91[0], 100 and 108.

The 128 bit mix, stirs two adjacent blocks of 64 bits. Those 3 lines can mix blocks 1 & 2.

Subsequent mixes on either half then propagate that cross mix. So it does use the full 256 bit state.

Is that what you meant?

[0]: https://github.com/cris691/discohash/blob/master/src/discoha...


Looks like a solid hash function indeed, with a very simple round.

Q: Is there any reason you don't use the same mixing function for the residue of the string (the final tail that is not a full 8 bytes block)? You could copy the bytes you have into a 8 byte array that is pre-padded with some pattern, and call the same mixing function again. Is this in order to save the distribution / avalanche properties in case the string is very short? In general would be nice to read some design note, if P and Q were obtained experimentally by checking for distribution or alike, if a different rotation length changes significantly the distribution properties and so forth.


Great question, and thank you, Salvatore!

Some reasons for that are: different hashes for s and s\0 (without explicitly using length) to make it harder to create collisions, related to as you say better distribution/avalanche for short strings, also design-wise I like the asymmetry that strings not perfectly divisible by 8 will be treated a little differently. I just think that makes it better overall.


Thanks for the replies!

No probs. I think your idea of unifying the two loops is a good one.

A new version that uses the idea and overcomes the challenges would be great, and would simplify the code even more.

If you had a way to make a start on it, I welcome your PR.


wyhash is very impressive for those interested. Not sure why it doesn't get a lot of attention.

https://github.com/wangyi-fudan/wyhash


It's odd to have the documentation in a .docx file in the repo. The code itself is not very readable. The core of it is something like (simplified):

    function mix(uint64 a, uint64 b) {
        a ^= secret
        b ^= seed
        hi, lo = mul128(a, b)
        seed = hi ^ lo
    }
It's elegantly simple, but depends critically on 'secret' not appearing in the data.

An interesting read is Chris Wellons (aka null program) prospecting for hash functions: https://nullprogram.com/blog/2018/07/31/

HN discussion: https://news.ycombinator.com/item?id=17659672


Tested at ~ 5GB/s @ 3Gz, (Google Cloud Platform, N1 CPU)

What is the memory bandwidth on that instance? I ask because I'm not seeing that listed [1] but it would be a useful point of comparison. Maybe run Doctor Bandwidth's STREAM benchmark [2].

DiscoHash is included in SMHASHER [3] but its benchmark results aren't.

[1] https://cloud.google.com/compute/docs/machine-types

[2] https://www.cs.virginia.edu/stream/

[3] https://github.com/rurban/smhasher


> DiscoHash is included in SMHASHER [3] but its benchmark results aren't.

They are included, but understandably you missed them because it's also called BEBB4185, stated in README. Find BEBB4185 line in SMHASHER Readme.

Good question on the memory bandwidth. From memory it was a multicore system, so I think that has a higher memory bandwidth than a single core. Thanks for the STREAM thing!


It is fast hash sure, but it still only sits in the middle of the pack in smhasher results, which is not very impressive. The code is maybe bit simpler than some of its competition, but to my non-expert eyes its still not trivial, and the compiled code still seems as big as some of its competitors. Although I'm bit curious why there is no "size" result in the smhasher table for this hash?

http://rurban.github.io/smhasher/doc/table.html

The ecrypt result is completely irrelevant when this can not be in any way be considered cryptographically secure at this point.



Simple hash, 128-bit mixing function is just:

  mix(const int A)
    {
      const int B = A+1;
      
      ds[A] *= P;
      ds[A] = rot(ds[A], 23);
      ds[A] *= Q;
      
      ds[B] ^= ds[A];

      ds[B] *= P;
      ds[B] = rot(ds[B], 23);
      ds[B] *= Q;
    }
with P and Q prime.

Please modify this: "The standard digest is 64-bits, but you can modify it to take 128-bits if you want a cryptographically secure hash."

Just because it's 128-bit doesn't make it cryptographically secure.


Can you break 128-bits?

How should I say it?


Hi Cris! First of all, I love to see new hash designs, it is great!

The standard disclaimer is “please do not use this for cryptographic purposes,” placed at the top of the README.

(I like to add a hint afterwards, like “If you need security, please use BLAKE3.”)

Second, if you do want to make a version in the same family with cryptographic properties, a few things are expected:

• Careful list of cryptographic claims (is it a PRF? a PRP? a compression function? is it collision resistant? with what probability of success?…)

• A published paper with preliminary cryptanalysis. What is the average number of evaluations of the hash function for key recovery? How much probabilistic information of the state bits can be gained from the output? How much output leads to a state recovery? What is the worst statistical bias of the output from single-bit input changes?

• Multiple rounds. The production hash should use at least one more round than is shown to be cryptanalytically safe; ideally twice.

• I worry that multiplication, in particular 128-bit multiplication, is subject to timing attacks. It is uncommon to see it used in cryptographic hashes.

Disclaimer: I am not a cryptographer by trade, so this advice is insufficient.


Should I use this as an everyday hash function (for non-security purposes)?

I am always interested when these get attention, but I don't know enough about the implications to switch over from just using SHA.


Depends on your use-case and the tradeoffs you're willing to make. SHA1 is slower than the the top finishers of the SMHasher suite [1], even if it's hardware-accelerated. Meanwhile, SHA256 and later are currently considered to be suitable for cryptographic use, so if you're not sensitive to the size of the hash output, you may get additional nice properties that you didn't intentionally design for. But they're even slower.

If you're shipping a black-box component that needs to use hashing internally, and the hash outputs don't leak out of the black-box, consider switching to a faster non-cryptographic hash to gain performance. Consider the implications of switching -- dependencies, trust, performance profile, hash output size, documentation, customer expectation -- and you will have to discard or otherwise invalidate the meaning behind of your past hash outputs.

If you're occasionally applying a hash function and obtain a digest that gets put into long-lived files or records, e.g. you're checksumming your own files for sanity and then verify them later against these records, then you may value availability and stability more than you value performance. If so, don't switch.

If you are fine with the current performance profile, the cost and complexity of switching (or any nontrivial change) may outweigh the benefits of leaving everything as-is.

[1] https://github.com/rurban/smhasher#summary


Are you running into performance issues with SHA? If not, I wouldn't change

Is this a replacement for SimpleCrypto on NPM?

No because I think that library is for encryption. This is for hashing.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: