Hacker News new | past | comments | ask | show | jobs | submit login
MetroHash: Faster, Better Hash Functions (jandrewrogers.com)
15 points by jandrewrogers on May 27, 2015 | hide | past | web | favorite | 22 comments

Is there a quality benchmark for hashes?

In general, how does one test the overall good uniqueness of a hash-function - e.g. less hash collisions on overall?

Is it done with some corpuses, like words in a dictionary, sentences from books, randomly generated strings, etc?

I'm just interrested how one goes to show that X has better quality than Y and it's still faster.

For example how a language runtime would choose it's hash function for built-in types?

The best single quality benchmark is the SMHasher test suite. For most purposes it is very thorough at finding defects in non-cryptographic hash function designs. A strong design has good statistical properties regardless of the types of key sets and SMHasher tests many kinds of key sets, including several that are designed to be pathological.

There are two caveats with using SMHasher to be aware of.

First, it does not thoroughly exercise the internals of some hash function constructions that you sometimes see in larger hashes. You can modify the source to add this coverage but it is not there by default.

Second, the speed test implementation is broken. It is good enough to give you a rough idea of what is going on for bulk hashing relative to other algorithms, but the absolute performance measurement deviate from proper testing by a significant amount (up to 20-25% anecdotally). Also really fast small key algorithms can be significantly off. Basically, performance is measured using a method that Intel's performance measurement documentation strongly advises to not do. This can give inconsistent results even across CPUs that are based on the same microarchitecture.

Yeah, some hashes (City) change behavior for small keys and I'm not bothering to exercise every possible key size thoroughly.

The perf test was pretty quick and dirty - if you have better ways of benchmarking things I'd gladly accept a patch. :)

Well at least what's awesome with SMHasher is that it's easy to integrate all hash functions into it and compare them even though the speed test is supposedly flawed.

A few recommendations from the peanut gallery:

1. The files have .cpp extensions, but they seem like plain C other than static_cast/reinterpret_cast? For integrating with other software just using C would be better. Use 'export "C"' in the header file, though

2. It's reasonable to optimize for intel CPUs, but the current implementations of read_u64() and friends would crash if used on CPUs that don't allow unaligned reads. It would be nice if that was #ifdef'ed so that it was portable.

3. Similarly, it seems that the result of the hash functions are endian-dependent. Optimizing for little-endian makes sense, but it would be nice if they produced consistent answers on big-endian CPUs. This is often a requirement for uses like on-disk bloomfilters. Probably all that is required is doing endian swaps in read_u64()/etc on big-endian.

I've updated now my smhasher results at https://github.com/rurban/smhasher including metrohash and xxHash but not FarmHash yet.

From the good enough hash functions, metrohash are by far the fastest, so they should be the new default on 64bit machines. On x86_64 and arm8 the crc variants are the best.

For 32bit we are stuck with Murmur still.

In all tests I know, xxh64 should be about 2x faster than xxh32 on x86_64 systems. Surprisingly, it doesn't match with the readme table.

Interesting, thanks. Sounds legit. I'll recheck again when I'm at home. Maybe the benchmarks got distorted.

    doc/xxHash32:Small key speed test -   10-byte keys -    35.02 cycles/hash
    doc/xxHash64:Small key speed test -   10-byte keys -    42.33 cycles/hash

Operations of the form "v[0] ^= rotate_right((v[0] * k2) + v[1], 17) * k1;" aren't reversible, so your intermediate states are leaking a few bits of entropy. Probably not enough to matter since the internal state is 256 bits - if it passes SMHasher, I'd use it. :)

How does it compare to xxHash? https://github.com/Cyan4973/xxHash

Comparing MetroHash to CityHash is not really fair. Why would anyone use CityHash in 2015?

With respect to xxHash64, the closest (and slowest) algorithm is Metro64. I just ran it on my laptop using SMHasher for comparison purposes.

- Metro64 is about 10-15% faster for bulk hashing

- Metro64 is about 2x faster for small keys

- Metro64 has better statistical quality

These days, most people should be using 128-bit hash functions. Not only are they more collision resistant, they also run faster on modern microarchitectures.

Measuring CityHash at 32-bits is not a realistic measurement of its performance. Its structure is optimized for 64-128 bit hashes, which it actually does quite well for some types of keys.

Most is true, but:

128-bit hash functions for hash table makes no sense, unless you have 1TB tables in external memory. And for security 128 is peanuts, you start with 256 there.

First you need 4x more space, when you store the hash also for faster collision checks. 32bit is normal. cache size is more important than cpu mostly.

Second, collision resistancy on normal sized hash tables take only the first 7-15 bits, so calculating the other 113 bits makes not much sense. 32 bit is usually enough unless you need huge tables, where 64bit is enough.

My smhasher fork has both: https://github.com/rurban/smhasher/

Comparable results for xxHash and the new metrohash will be uploaded soon. These things take time.

But I'm more interested in comparing it to HW CRC32-C, the fastest and best hash so far, 2x faster than metro.

For the record, CRC32-C doesn't appear in your summary table

I do it all the time. CityHash has been around long enough that it is broadly included in most hash suites on most platforms, but is new enough that it doesn't suck as bad as its predecessors.

It looks like 'FarmHash' is the successor from the CityHash authors. Does xxHash have 128/256-bit variants?

MetroHash source code is impressively close to xxHash one, to the point that it seems to be a tweaked version of it.

Yet, the author never mentions a single word about it in its source.

Possibly stupid question... Why there's _1 and _2 variants for each hash size?

Mostly because I could. There are use cases for statistically independent hashes, such as Bloom filters, though these can sometimes be satisfied by other means. Having two independent functions for each gives people flexibility and also serves as a (trivial) proof of capability.

It would be a small thing to generate a dozen statistically independent versions of each hash algorithm, primarily because the generation process is so straightforward. This by itself is unique for popular hash functions. One possible use case for random generation of these functions is security through obscurity by having every hash table implementation using a different function to make them more difficult to attack.

Well, we don't call that "security by obscurity", we call that "universal hashing", which is provable secure.

The rules to create the constants would nice to have to actually create good universal hashing. 2 variants are not enough yet, but the quality of _1 and _2 are already much better than the typical universal hash function variant.

I would be interested to see what the x64 versions were like.

These versions only work on 64bit machines, optimized for x64 and arm64. The crc variant only on those.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact