
Fast strongly universal 64-bit hashing everywhere - ingve
https://lemire.me/blog/2018/08/15/fast-strongly-universal-64-bit-hashing-everywhere/
======
harshreality
It's "fast" and "strong"? No measurements are given other than one informal
speed comparison against murmur64.

What is the goal, and where's the comparison, relative to xxhash, siphash,
highwayhash?

[https://en.wikipedia.org/wiki/List_of_hash_functions#Non-
cry...](https://en.wikipedia.org/wiki/List_of_hash_functions#Non-
cryptographic_hash_functions)

I'm not sure what the point of the blog post's universal hash functions is,
within the domain of practical software engineering. Who's the audience,
programmers or number theorists? Who's going to implement this, and how are
they going to choose (or write a function to randomly choose) a, b, and c to
give at least adequate properties for their task?

~~~
CodesInChaos
It's not "strong", it's "strongly universal", which is a well defined
mathematical property. Universal hash functions are the basis of MAC part of
popular authenticated encryption algorithms, like GCM and ChaCha-Poly1305
(though for this purpose 128-bit tags are preferred).

Though I don't see what the point of this particular implementation is, with
its hardcoded keys and fixed 64-bit inputs instead of a random secret key and
variable length inputs.

~~~
stochastic_monk
Because it’s hashing an item of fixed size. This is specialized for 32 or
64-bit integers. clhash supports arbitrary length strings. There is a
performance cost for that generality.

For example, in my experiments, clhash is 1/3 as fast as murmurhash or Wang’s
64-bit hash for 64-bit integers.

------
crankylinuxuser
Uh, wait a second...

64 bit hashing means only 2^64 . Understanding the birthday paradox (
[https://en.wikipedia.org/wiki/Birthday_attack](https://en.wikipedia.org/wiki/Birthday_attack)
) means that one needs only process 2^32 hashes to find a collision.

Yuck. NO.

~~~
MichaelBurge
Not to mention a "fast hash" is inherently insecure, so the author must be
mistaken: He should be trying to add arithmetic operations rather than remove
them.

~~~
olliej
Fast hashes are not inherently insecure - in fact you want hashing to be as
fast as possible. You may be confusing the general case of hashing data, with
the specific case of hashing passwords.

The goal of the former is to provide a hash such that the likelihood of any
arbitrary objects hashing to the same value is proportional to the square of
the hash size (I think?), for cryptographic hashes their are additional goals
like the hash output being uniform across all inputs, and that changing any
single bit in the input should on average change half the bits in the output,
etc. that’s because the use case is validation of a single input - eg
verifying your download isn’t corrupt or whatever.

For passwords the goal is different - you’re generally dealing with small
objects, and so the cost of an “expensive” hash being measure in milliseconds
is acceptable - the expectation being that people generally don’t get them
wrong. But the attack scenario here is an attacker with a password hash
wanting to brute force a large number of passwords to find the one that hashes
to it. Given large is potentially very large, it doesn’t take a huge amount of
time per hash to make the attack non viable.

This particular version is hashing of the former type, without cryptographic
concerns. It’s goal - to my reading - is to be a hash function that achieves
something close to a single bit change flipping half the bits, while also
guaranteeing that no two values hash to the same value.

------
natch
From the code:

    
    
        long a,b,c; // randomly assigned 64-bit values
    

As I first understood the way this blog post was written, the author was
relying on the JRE to populate these uninitialized variables with random
values.

Edit: that was an incorrect understanding. It's not shown in the post, but in
his real code he actually does assign values to these (thanks for the
clarifying replies).

More of the code for context:

    
    
        // in Java, long is 64-bit, int 32-bit
    
        long a,b,c; // randomly assigned 64-bit values
    
        int hash32(long x) {
          int low = (int)x;
          int high = (int)(x >>> 32);
          return (int)((a * low + b * high + c) >>> 32);
        }

~~~
gjm11
What he means is that to construct a particular instance of this hash function
you pick random values a,b,c and do what he describes. You can make multiple
different hash functions by picking different random (a,b,c) triples.

Now, there are some specific choices of a,b,c that yield bad results. If you
pick a=b=0 then your hash function ignores your data. If a or b is a multiple
of a large power of 2 then it ignores many bits of your data. If you take e.g.
(a,b,c) and (3a,3b,c') then you get two closely-related hash functions.

The specific property he's interested in here is that of being "strongly
universal" which means that when you pick your hash function at random from
whatever family of hash functions you've got, "on average" it behaves well.
(Specifically, it means that if you fix two possible _inputs_ and vary your
_hash function_ at random then collisions between those two inputs are as rare
as they should be.) His family is strongly universal despite the issues I
describe above because having _enough_ low-order bits be zero to cause trouble
is "rare enough".

