
MetroHash: Faster, Better Hash Functions - jandrewrogers
http://www.jandrewrogers.com/2015/05/27/metrohash/
======
malkia
Is there a quality benchmark for hashes?

In general, how does one test the overall good uniqueness of a hash-function -
e.g. less hash collisions on overall?

Is it done with some corpuses, like words in a dictionary, sentences from
books, randomly generated strings, etc?

I'm just interrested how one goes to show that X has better quality than Y and
it's still faster.

For example how a language runtime would choose it's hash function for built-
in types?

~~~
jandrewrogers
The best single quality benchmark is the SMHasher test suite. For most
purposes it is very thorough at finding defects in non-cryptographic hash
function designs. A strong design has good statistical properties regardless
of the types of key sets and SMHasher tests many kinds of key sets, including
several that are designed to be pathological.

There are two caveats with using SMHasher to be aware of.

First, it does not thoroughly exercise the internals of some hash function
constructions that you sometimes see in larger hashes. You can modify the
source to add this coverage but it is not there by default.

Second, the speed test implementation is broken. It is good enough to give you
a rough idea of what is going on for bulk hashing relative to other
algorithms, but the absolute performance measurement deviate from proper
testing by a significant amount (up to 20-25% anecdotally). Also really fast
small key algorithms can be significantly off. Basically, performance is
measured using a method that Intel's performance measurement documentation
strongly advises to not do. This can give inconsistent results even across
CPUs that are based on the same microarchitecture.

~~~
aappleby
Yeah, some hashes (City) change behavior for small keys and I'm not bothering
to exercise every possible key size thoroughly.

The perf test was pretty quick and dirty - if you have better ways of
benchmarking things I'd gladly accept a patch. :)

~~~
gpakosz
Well at least what's awesome with SMHasher is that it's easy to integrate all
hash functions into it and compare them even though the speed test is
supposedly flawed.

------
bodyfour
A few recommendations from the peanut gallery:

1\. The files have .cpp extensions, but they seem like plain C other than
static_cast/reinterpret_cast? For integrating with other software just using C
would be better. Use 'export "C"' in the header file, though

2\. It's reasonable to optimize for intel CPUs, but the current
implementations of read_u64() and friends would crash if used on CPUs that
don't allow unaligned reads. It would be nice if that was #ifdef'ed so that it
was portable.

3\. Similarly, it seems that the result of the hash functions are endian-
dependent. Optimizing for little-endian makes sense, but it would be nice if
they produced consistent answers on big-endian CPUs. This is often a
requirement for uses like on-disk bloomfilters. Probably all that is required
is doing endian swaps in read_u64()/etc on big-endian.

------
rurban
I've updated now my smhasher results at
[https://github.com/rurban/smhasher](https://github.com/rurban/smhasher)
including metrohash and xxHash but not FarmHash yet.

From the good enough hash functions, metrohash are by far the fastest, so they
should be the new default on 64bit machines. On x86_64 and arm8 the crc
variants are the best.

For 32bit we are stuck with Murmur still.

~~~
Cyan4973
In all tests I know, xxh64 should be about 2x faster than xxh32 on x86_64
systems. Surprisingly, it doesn't match with the readme table.

~~~
rurban
Interesting, thanks. Sounds legit. I'll recheck again when I'm at home. Maybe
the benchmarks got distorted.

    
    
        doc/xxHash32:Small key speed test -   10-byte keys -    35.02 cycles/hash
        doc/xxHash64:Small key speed test -   10-byte keys -    42.33 cycles/hash

------
aappleby
Operations of the form "v[0] ^= rotate_right((v[0] * k2) + v[1], 17) * k1;"
aren't reversible, so your intermediate states are leaking a few bits of
entropy. Probably not enough to matter since the internal state is 256 bits -
if it passes SMHasher, I'd use it. :)

------
jedisct1
How does it compare to xxHash?
[https://github.com/Cyan4973/xxHash](https://github.com/Cyan4973/xxHash)

Comparing MetroHash to CityHash is not really fair. Why would anyone use
CityHash in 2015?

~~~
jandrewrogers
With respect to xxHash64, the closest (and slowest) algorithm is Metro64. I
just ran it on my laptop using SMHasher for comparison purposes.

\- Metro64 is about 10-15% faster for bulk hashing

\- Metro64 is about 2x faster for small keys

\- Metro64 has better statistical quality

These days, most people should be using 128-bit hash functions. Not only are
they more collision resistant, they also run faster on modern
microarchitectures.

Measuring CityHash at 32-bits is not a realistic measurement of its
performance. Its structure is optimized for 64-128 bit hashes, which it
actually does quite well for some types of keys.

~~~
rurban
Most is true, but:

128-bit hash functions for hash table makes no sense, unless you have 1TB
tables in external memory. And for security 128 is peanuts, you start with 256
there.

First you need 4x more space, when you store the hash also for faster
collision checks. 32bit is normal. cache size is more important than cpu
mostly.

Second, collision resistancy on normal sized hash tables take only the first
7-15 bits, so calculating the other 113 bits makes not much sense. 32 bit is
usually enough unless you need huge tables, where 64bit is enough.

------
Cyan4973
MetroHash source code is impressively close to xxHash one, to the point that
it seems to be a tweaked version of it.

Yet, the author never mentions a single word about it in its source.

------
arthursilva
Possibly stupid question... Why there's _1 and _2 variants for each hash size?

~~~
jandrewrogers
Mostly because I could. There are use cases for statistically independent
hashes, such as Bloom filters, though these can sometimes be satisfied by
other means. Having two independent functions for each gives people
flexibility and also serves as a (trivial) proof of capability.

It would be a small thing to generate a dozen statistically independent
versions of each hash algorithm, primarily because the generation process is
so straightforward. This by itself is unique for popular hash functions. One
possible use case for random generation of these functions is security through
obscurity by having every hash table implementation using a different function
to make them more difficult to attack.

~~~
rurban
Well, we don't call that "security by obscurity", we call that "universal
hashing", which is provable secure.

The rules to create the constants would nice to have to actually create good
universal hashing. 2 variants are not enough yet, but the quality of _1 and _2
are already much better than the typical universal hash function variant.

------
TheLoneWolfling
I would be interested to see what the x64 versions were like.

~~~
rurban
These versions only work on 64bit machines, optimized for x64 and arm64. The
crc variant only on those.

