In general, how does one test the overall good uniqueness of a hash-function - e.g. less hash collisions on overall?
Is it done with some corpuses, like words in a dictionary, sentences from books, randomly generated strings, etc?
I'm just interrested how one goes to show that X has better quality than Y and it's still faster.
For example how a language runtime would choose it's hash function for built-in types?
There are two caveats with using SMHasher to be aware of.
First, it does not thoroughly exercise the internals of some hash function constructions that you sometimes see in larger hashes. You can modify the source to add this coverage but it is not there by default.
Second, the speed test implementation is broken. It is good enough to give you a rough idea of what is going on for bulk hashing relative to other algorithms, but the absolute performance measurement deviate from proper testing by a significant amount (up to 20-25% anecdotally). Also really fast small key algorithms can be significantly off. Basically, performance is measured using a method that Intel's performance measurement documentation strongly advises to not do. This can give inconsistent results even across CPUs that are based on the same microarchitecture.
The perf test was pretty quick and dirty - if you have better ways of benchmarking things I'd gladly accept a patch. :)
1. The files have .cpp extensions, but they seem like plain C other than static_cast/reinterpret_cast? For integrating with other software just using C would be better. Use 'export "C"' in the header file, though
2. It's reasonable to optimize for intel CPUs, but the current implementations of read_u64() and friends would crash if used on CPUs that don't allow unaligned reads. It would be nice if that was #ifdef'ed so that it was portable.
3. Similarly, it seems that the result of the hash functions are endian-dependent. Optimizing for little-endian makes sense, but it would be nice if they produced consistent answers on big-endian CPUs. This is often a requirement for uses like on-disk bloomfilters. Probably all that is required is doing endian swaps in read_u64()/etc on big-endian.
From the good enough hash functions, metrohash are by far the fastest, so they should be the new default on 64bit machines.
On x86_64 and arm8 the crc variants are the best.
For 32bit we are stuck with Murmur still.
doc/xxHash32:Small key speed test - 10-byte keys - 35.02 cycles/hash
doc/xxHash64:Small key speed test - 10-byte keys - 42.33 cycles/hash
Comparing MetroHash to CityHash is not really fair. Why would anyone use CityHash in 2015?
- Metro64 is about 10-15% faster for bulk hashing
- Metro64 is about 2x faster for small keys
- Metro64 has better statistical quality
These days, most people should be using 128-bit hash functions. Not only are they more collision resistant, they also run faster on modern microarchitectures.
Measuring CityHash at 32-bits is not a realistic measurement of its performance. Its structure is optimized for 64-128 bit hashes, which it actually does quite well for some types of keys.
128-bit hash functions for hash table makes no sense, unless you have 1TB tables in external memory.
And for security 128 is peanuts, you start with 256 there.
First you need 4x more space, when you store the hash also for faster collision checks. 32bit is normal. cache size is more important than cpu mostly.
Second, collision resistancy on normal sized hash tables take only the first 7-15 bits, so calculating the other 113 bits makes not much sense. 32 bit is usually enough unless you need huge tables, where 64bit is enough.
Comparable results for xxHash and the new metrohash will be uploaded soon. These things take time.
But I'm more interested in comparing it to HW CRC32-C, the fastest and best hash so far, 2x faster than metro.
Yet, the author never mentions a single word about it in its source.
It would be a small thing to generate a dozen statistically independent versions of each hash algorithm, primarily because the generation process is so straightforward. This by itself is unique for popular hash functions. One possible use case for random generation of these functions is security through obscurity by having every hash table implementation using a different function to make them more difficult to attack.
The rules to create the constants would nice to have to actually create good universal hashing. 2 variants are not enough yet, but the quality of _1 and _2 are already much better than the typical universal hash function variant.