
Show HN: Discohash – Fast Hash - fantastisch
https://github.com/cris691/discohash.git
======
beefhash
> _The standard digest is 64-bits, but you can modify it to take 128-bits if
> you want a cryptographically secure hash._

There's not even an attempt at a preliminary cryptanalysis anywhere as far as
I can tell. I'd advise staying far away from this for cryptography, especially
if it considers a 128-bit digest size reasonable for anything but a message
authentication code.

------
niftich
This is a hash designed for non-cryptographic use, like in hash tables or
bloom filters. You can tell by their small output size, which greatly reduces
the cost of a brute-force collision search. It's also in the linked readme.

Hash function families with a similar target usecase include: cityhash,
falkhash, farmhash, FNV, meowhash, metrohash, murmur, t1ha, wyhash, xxh.

The SMHasher suite tests hash functions for speed, distribution, bias, and
collisions. This function ranks well in those tests.

~~~
remcob
> This is a hash designed for non-cryptographic use

The Readme specifically says "you can modify it to yield 128-bits or more if
you want a cryptographically secure hash."

Which is a problematic statement, because it is not designed for cryptography
even if you extended the output to 256 bit. It's not the output length that
makes it cryptographicaly secure. (Rather it's the difference between "you
won't find collisions by accident" and "you won't find collisions even if you
try really hard using very sophisticated math", but there are more
requirements.)

This is similar to the issues with a different hash by the same Github user
posted two weeks ago:

[https://news.ycombinator.com/item?id=23103521](https://news.ycombinator.com/item?id=23103521)

Making fast non-cryptographic hash functions is a fun challenge and I
appreciate the projects, but please, please do not make any claims about
cryptographic properties!

~~~
fantastisch
you need others to not make crypto claims, but you're so sure making your
crypto claim that it's not cryptographic.

it's always easy to front as an expert spectator and believe others work is no
good. why don't you back it up with some effort?

I believe it is a crypto hash. so..a fun challenge?

how about this for a fun challenge, why don't you post a cryptanalysis and
I'll link it from the README.

Thanks for the beamsplitter link!

maybe it is designed as a crypto hash

------
ComputerGuru
Stay away. The author can't tell the difference between a cryptographically
secure hash and a regular one, claims that simply going from 64 to 128 bits
will make magically make this cryptographically secure, and then benchmarks it
against actually cryptographically secure hash functions (blake3), which are
almost necessarily slower than generic hash functions.

Look at meowhash and wyhash instead for the latest and greatest in that field.

~~~
fantastisch
Can you break this? If no, so your claim is basically hot air, is that right?

------
aappleby
Not bad, I like that the author stuck with plain C instead of sse/aes.

Looks like two interleaved 128-bit hashes, don't see any cross-half mixing.
Code style needs a bit of cleanup, the sindex stuff obscures the algorithm a
bit.

-Austin, SMHasher/Murmur author

~~~
osrec
Was it the inconsistent use of curly braces that made you comment on the code
style?

~~~
aappleby
Predeclared aliased pointers, STATEM/HSTATEM constants, len/Len, that kind of
stuff.

~~~
fantastisch
Thank you, Austin!

BTW, not sure _exactly_ what you mean by cross-half mixing, but I'm assuming
it means mixing both halves of the 128 bit state with each other. There _is_
that, on 3 lines 91[0], 100 and 108.

The 128 bit mix, stirs two adjacent blocks of 64 bits. Those 3 lines can mix
blocks 1 & 2.

Subsequent mixes on either half then propagate that cross mix. So it does use
the full 256 bit state.

Is that what you meant?

[0]:
[https://github.com/cris691/discohash/blob/master/src/discoha...](https://github.com/cris691/discohash/blob/master/src/discohash.cpp#L91)

~~~
antirez
Looks like a solid hash function indeed, with a very simple round.

Q: Is there any reason you don't use the same mixing function for the residue
of the string (the final tail that is not a full 8 bytes block)? You could
copy the bytes you have into a 8 byte array that is pre-padded with some
pattern, and call the same mixing function again. Is this in order to save the
distribution / avalanche properties in case the string is very short? In
general would be nice to read some design note, if P and Q were obtained
experimentally by checking for distribution or alike, if a different rotation
length changes significantly the distribution properties and so forth.

~~~
fantastisch
Great question, and thank you, Salvatore!

Some reasons for that are: different hashes for s and s\0 (without explicitly
using length) to make it harder to create collisions, related to as you say
better distribution/avalanche for short strings, also design-wise I like the
asymmetry that strings not perfectly divisible by 8 will be treated a little
differently. I just think that makes it better overall.

~~~
antirez
Thanks for the replies!

~~~
fantastisch
No probs. I think your idea of unifying the two loops is a good one.

A new version that uses the idea and overcomes the challenges would be great,
and would simplify the code even more.

If you had a way to make a start on it, I welcome your PR.

------
halayli
wyhash is very impressive for those interested. Not sure why it doesn't get a
lot of attention.

[https://github.com/wangyi-fudan/wyhash](https://github.com/wangyi-
fudan/wyhash)

~~~
remcob
It's odd to have the documentation in a .docx file in the repo. The code
itself is not very readable. The core of it is something like (simplified):

    
    
        function mix(uint64 a, uint64 b) {
            a ^= secret
            b ^= seed
            hi, lo = mul128(a, b)
            seed = hi ^ lo
        }
    

It's elegantly simple, but depends critically on 'secret' not appearing in the
data.

------
andreareina
An interesting read is Chris Wellons (aka null program) prospecting for hash
functions:
[https://nullprogram.com/blog/2018/07/31/](https://nullprogram.com/blog/2018/07/31/)

HN discussion:
[https://news.ycombinator.com/item?id=17659672](https://news.ycombinator.com/item?id=17659672)

------
CalChris
_Tested at ~ 5GB /s @ 3Gz, (Google Cloud Platform, N1 CPU)_

What is the memory bandwidth on that instance? I ask because I'm not seeing
that listed [1] but it would be a useful point of comparison. Maybe run Doctor
Bandwidth's STREAM benchmark [2].

DiscoHash is included in SMHASHER [3] but its benchmark results aren't.

[1] [https://cloud.google.com/compute/docs/machine-
types](https://cloud.google.com/compute/docs/machine-types)

[2] [https://www.cs.virginia.edu/stream/](https://www.cs.virginia.edu/stream/)

[3] [https://github.com/rurban/smhasher](https://github.com/rurban/smhasher)

~~~
fantastisch
> DiscoHash is included in SMHASHER [3] but its benchmark results aren't.

They _are_ included, but understandably you missed them because it's also
called BEBB4185, stated in README. Find BEBB4185 line in SMHASHER Readme.

Good question on the memory bandwidth. From memory it was a multicore system,
so I think that has a higher memory bandwidth than a single core. Thanks for
the STREAM thing!

------
zokier
It is fast hash sure, but it still only sits in the middle of the pack in
smhasher results, which is not very impressive. The code is maybe bit simpler
than some of its competition, but to my non-expert eyes its still not trivial,
and the compiled code still seems as big as some of its competitors. Although
I'm bit curious why there is no "size" result in the smhasher table for this
hash?

[http://rurban.github.io/smhasher/doc/table.html](http://rurban.github.io/smhasher/doc/table.html)

The ecrypt result is completely irrelevant when this can not be in any way be
considered cryptographically secure at this point.

~~~
fantastisch
> middle of the pack in smhasher

for all hashes. lot of those top ones are not crypto hashes.

> The ecrypt result is completely irrelevant when this can not be in any way
> be considered cryptographically secure

Break it first, then you can say that. otherwise, what would you know?

not sure if it's because English is not your first language or just had a
grumpy day or something else, this came across veeery negative. Be positive in
your comments on work.

if you want to be critical doing so with positivity gives it more credibility
because it shows you're able to see both sides. leading people to assume
you're more likely making a balanced rather than a biased assessment, whether
or not that's true.

------
bufferoverflow
xxHash64 has demonstrated 14 GB/s a while ago

[https://aras-p.info/img/blog/2016-08/hash2-pc.png](https://aras-p.info/img/blog/2016-08/hash2-pc.png)

And FarmHash64 with SSE4.2 did almost 18 GB/s:

[https://aras-p.info/img/blog/2016-08/hash2-farmhashoptions.p...](https://aras-p.info/img/blog/2016-08/hash2-farmhashoptions.png)

Full article: [https://aras-p.info/blog/2016/08/09/More-Hash-Function-
Tests...](https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/)

------
azhenley
Should I use this as an everyday hash function (for non-security purposes)?

I am always interested when these get attention, but I don't know enough about
the implications to switch over from just using SHA.

~~~
niftich
Depends on your use-case and the tradeoffs you're willing to make. SHA1 is
slower than the the top finishers of the SMHasher suite [1], even if it's
hardware-accelerated. Meanwhile, SHA256 and later are currently considered to
be suitable for cryptographic use, so if you're not sensitive to the size of
the hash output, you may get additional nice properties that you didn't
intentionally design for. But they're even slower.

If you're shipping a black-box component that needs to use hashing internally,
and the hash outputs don't leak out of the black-box, consider switching to a
faster non-cryptographic hash to gain performance. Consider the implications
of switching -- dependencies, trust, performance profile, hash output size,
documentation, customer expectation -- and you will have to discard or
otherwise invalidate the meaning behind of your past hash outputs.

If you're occasionally applying a hash function and obtain a digest that gets
put into long-lived files or records, e.g. you're checksumming your own files
for sanity and then verify them later against these records, then you may
value availability and stability more than you value performance. If so, don't
switch.

If you are fine with the current performance profile, the cost and complexity
of switching (or any nontrivial change) may outweigh the benefits of leaving
everything as-is.

[1]
[https://github.com/rurban/smhasher#summary](https://github.com/rurban/smhasher#summary)

------
LAMike
Is this a replacement for SimpleCrypto on NPM?

~~~
fantastisch
No because I think that library is for encryption. This is for hashing.

------
fantastisch
Simple hash, 128-bit mixing function is just:

    
    
      mix(const int A)
        {
          const int B = A+1;
          
          ds[A] *= P;
          ds[A] = rot(ds[A], 23);
          ds[A] *= Q;
          
          ds[B] ^= ds[A];
    
          ds[B] *= P;
          ds[B] = rot(ds[B], 23);
          ds[B] *= Q;
        }
    

with P and Q prime.

~~~
gliptic
Please modify this: "The standard digest is 64-bits, but you can modify it to
take 128-bits if you want a cryptographically secure hash."

Just because it's 128-bit doesn't make it cryptographically secure.

~~~
fantastisch
Can you break 128-bits?

How should I say it?

~~~
espadrine
Hi Cris! First of all, I love to see new hash designs, it is great!

The standard disclaimer is “please do not use this for cryptographic
purposes,” placed at the top of the README.

(I like to add a hint afterwards, like “If you need security, please use
BLAKE3.”)

Second, if you do want to make a version in the same family with cryptographic
properties, a few things are expected:

• Careful list of cryptographic claims (is it a PRF? a PRP? a compression
function? is it collision resistant? with what probability of success?…)

• A published paper with preliminary cryptanalysis. What is the average number
of evaluations of the hash function for key recovery? How much probabilistic
information of the state bits can be gained from the output? How much output
leads to a state recovery? What is the worst statistical bias of the output
from single-bit input changes?

• Multiple rounds. The production hash should use at least one more round than
is shown to be cryptanalytically safe; ideally twice.

• I worry that multiplication, in particular 128-bit multiplication, is
subject to timing attacks. It is uncommon to see it used in cryptographic
hashes.

Disclaimer: I am not a cryptographer by trade, so this advice is insufficient.

