
XXH3 – a new speed-optimized hash algorithm - corysama
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
======
pcwalton
It's good to see a focus on small input sizes. Small keys are extremely
important for hash functions and are often overlooked. Whenever I've looked at
the distribution of key sizes in real world Rust code using hash maps it's
always turned out to be shockingly small. 8 byte hash inputs are very common
in real world code, for instance.

Most hash functions and benchmarks are not tuned for small inputs. This is why
SipHash is disfavored in Rust these days, because while its performance is
excellent on large keys it suffers quite a bit on small ones.

~~~
adrianratnapala
Yes, although even here, small is not that very small:

> The main focus is about short keys of random lengths, with a distribution of
> length roughly in the 20-30 bytes area, featuring occasional outliers, both
> tiny and large.

I have been playing around trying to write interpreters and compilers for a
hobby, and though I haven't taken stats, but I suspect that in source code,
identifiers half that length are more than twice as common.

It make me wonder if compilers are better of using old-school byte-by-byte
hashes rather than new ones optmised for parallelism. (Then again, the hash
function probably makes no real difference to speed once in a real-world
compiler that does optimisation).

~~~
zokier
> Yes, although even here, small is not that very small:

> > The main focus is about short keys of random lengths, with a distribution
> of length roughly in the 20-30 bytes area, featuring occasional outliers,
> both tiny and large.

Based on the graphs, XXH3 still seems to beat everything at key sizes > 6
(presumably bytes?)

------
injinj
I added xxh3 to my fork of smhasher:
[https://github.com/injinj/smhasher](https://github.com/injinj/smhasher)

I find it to be the fastest hash in my stable of hashes.

Graphs for 64 / 128:

[https://github.com/injinj/smhasher/blob/master/plot64.svg](https://github.com/injinj/smhasher/blob/master/plot64.svg)

[https://github.com/injinj/smhasher/blob/master/plot128.svg](https://github.com/injinj/smhasher/blob/master/plot128.svg)

Latency graphs for 64 / 128

[https://github.com/injinj/smhasher/blob/master/plot64lat.svg](https://github.com/injinj/smhasher/blob/master/plot64lat.svg)

[https://github.com/injinj/smhasher/blob/master/plot128lat.sv...](https://github.com/injinj/smhasher/blob/master/plot128lat.svg)

Latency is using the hash result after each function, the original smhasher
algo does not use the result.

~~~
BeeOnRope
For each size shown on the x-axis, every hashing call using exactly that size?

It would be interesting to also have a test where an average size was chosen,
but where the actual sizes varied randomly around that average - because
that's a common case in real world code, and the results can be very different
since branch prediction won't be perfect any more as it is with the fixed size
tests.

Also, latency usually means feeding the output of the function back into the
input of the next call. Throughput is what you get when you don't do that but
you still need to _consume_ the result, so it's not optimized away. You can
use something like DoNotOptimize here:

[https://github.com/google/benchmark/blob/master/include/benc...](https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L309)

~~~
injinj
Yes, it is the same size. I agree, a complex test like that in the original
article is better (the Throughput versus Latency section).

The graphs from the "latency" test are doing this:

    
    
      for (;;) {
        h = hash( val, seed );
        seed += h;
      }
    

The original smhasher did this:

    
    
      for (;;) {
        hash( val, seed );
      }

~~~
Cyan4973
Actually, making the next hash dependent on previous hash through the `seed`
doesn't work for benchmarking latency.

Several hashes only use the seed at the very end of calculation. Among them,
CityHash, FarmHash, MeowHash, probably a few more. This makes it possible for
them to start calculating next hash before the end of the previous one, so
it's no longer "latency" for them, and the test condition becomes uneven.

In reality, in a latency scenario, the hash function is waiting for the input.
So it's the input which must depends on previous hash result. This way, all
algorithms must wait for previous result, no more dependency on how a specific
algorithm handle the seed.

~~~
injinj
Good point.

I created a graph with these attributes:

0 total += hash (XXH364) 1 seed += hash (XXH364Seed) 2 seed += hash, val +=
hash (XXH364SH)

[https://gist.github.com/injinj/138543ccc6a23ceb1fcdc05f46288...](https://gist.github.com/injinj/138543ccc6a23ceb1fcdc05f4628858e)

There are weird bumps at some key sizes with City and XXH when the key is
updated with the previous hash. My guess is there is some underlying cache
line latency added when the key is written to.

smhasher should really have a test for good seeding. As I understand it, bad
City seeding was the reason people started using SipHash.

~~~
Cyan4973
I'm afraid I don't follow. Is there any code source that could be read ?

The way to force the hash algorithm to actually wait for input is to use the
previous hash to determine the start position of next hash's input. Jumping
doesn't have to be large. You can get good results with just a few hundred
bytes of variance.

~~~
injinj
This is notable, because this is what rurban/smhasher does in the small key
test, which many are using nowadays.

In SpeedTest.cpp, there is a function called timehash():

    
    
      uint64_t *aligned_result = (uint64_t *) aligned_alloc( 64, 256 / 8 );
      uint64_t *aligned_buf = (uint64_t *) aligned_alloc( 64, 256 );
      memcpy( aligned_buf, key, len );
    
          begin = rdtsc();
          for (int i = 0; i < repeat; i++) {
            hash(aligned_buf,len,seed,aligned_result);
            seed += aligned_result[0];
            aligned_buf[0] += aligned_result[0];
            aligned_buf[1] += aligned_result[1];
          }
          end = rdtsc();
    

Take away the aligned_buf += aligned_result, and the additional latency goes
away.

When I get a chance, I'll try your method of randomizing input.

------
dragontamer
[https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxh...](https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.cpp)

The core of the function is as follows:

    
    
        static U64 XXH64_round(U64 acc, U64 input)
         {
          acc += input * PRIME64_2;
          acc = XXH_rotl64(acc, 31);
          acc *= PRIME64_1;
          return acc;
         }
    

Multiplication, rotation, and a 2nd multiplication. These are 64-bit
multiplications, which is well known to mix bits well, but CPUs don't have
many 64-bit multipliers. Four rounds execute in parallel (see
XXH64_update_endian), which can hide the long latency (~5 cycles) associated
with a 64-bit multiply.

My own experiments show that multiplication is a great function for hashing.
Multiplication however only spreads the bits in "one direction" (towards the
most-significant bit), and very poorly spreads bits towards the LSB.

Rotations, as well as a 2nd multiplication, are helpful at spreading the bits
around better.

\----------

The 64-bit version is somewhat interesting to me, but the 32-bit version
probably has more potential for optimization. 32-bit vectorized multiplication
exists in AVX2, so the 32-bit version probably can be vectorized. I'd expect
the 32-bit version (if vectorized) would be faster, but the 64-bit version
probably mixes the bits around better.

Overall looks like a decent design. Uses ILP with 4x state variables
(state->v1, state->v2, etc. etc. to allow CPUs to process these
multiplications in parallel). Multiply + rotation + multiplication is good
from my personal experiments, but the numbers chosen need to be carefully
chosen.

Prime numbers are probably not necessary: they just need to be odd numbers to
ensure invertibility (Ex: I've achieved decent mixing by multiplying with
0xAAAAAAAA5, which probably isn't a prime number). I'm curious how the prime-
constants were selected. There could be some potential for "better numbers"
for mixing, depending on how PRIME64_1 and PRIME64_2 were chosen.

Overall checks all of my boxes with regards to good design. Nice! I'm just
curious about some details.

\----------

I'll note that in my experiments, multiplication seems to "mix" better with
XOR, rather than with ADD. I haven't done any tests with this hash yet, but
I'm curious how changing the "+=" into "^=" would have on the statistical
strength of the hash function.

~~~
terrelln
I believe you are looking at XXH64 and XXH32, which is the old version. The
new version XXH3 is located in
[https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h](https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h)
which exposes the public prototypes `XXH3_64bits()` and `XXH3_128bits()` (and
variants with seeds).

~~~
dragontamer
Thanks. This code is IMO far easier to read, despite being intrinsics. It just
requires some familiarity with shuffle. Anyone who wants to follow along
should use:
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_shuffle_epi32&expand=5113)

    
    
            assert(((size_t)acc) & 31 == 0);
            {   ALIGN(32) __m256i* const xacc  =       (__m256i *) acc;
                const     __m256i* const xdata = (const __m256i *) data;
                const     __m256i* const xkey  = (const __m256i *) key;
        
                size_t i;
                for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                    __m256i const d   = _mm256_loadu_si256 (xdata+i);
                    __m256i const k   = _mm256_loadu_si256 (xkey+i);
                    __m256i const dk  = _mm256_add_epi32 (d,k);                                  /* uint32 dk[8]  = {d0+k0, d1+k1, d2+k2, d3+k3, ...} */
                    __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
                    __m256i const add = _mm256_add_epi64(d, xacc[i]);
                    xacc[i]  = _mm256_add_epi64(res, add);
                }
            }
    

This is followed up by "ScrambleAcc":

    
    
            assert(((size_t)acc) & 31 == 0);
            {   ALIGN(32) __m256i* const xacc = (__m256i*) acc;
                const     __m256i* const xkey  = (const __m256i *) key;
        
                size_t i;
                for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                    __m256i data = xacc[i];
                    __m256i const shifted = _mm256_srli_epi64(data, 47);
                    data = _mm256_xor_si256(data, shifted);
        
                    {   __m256i const k   = _mm256_loadu_si256 (xkey+i);
                        __m256i const dk  = _mm256_mul_epu32 (data,k);          /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
        
                        __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
                        __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
                        __m256i const dk2 = _mm256_mul_epu32 (d2,k2);           /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
        
                        xacc[i]  = _mm256_xor_si256(dk, dk2);
                }   }
            }
    

I don't believe AVX2 rotation exists (correct me if I'm wrong though). So the
author has opted for _mm256_srli_epi64 (... 47), which is a 47-bit right
shift, followed up with XOR. This is how the author gets those "highly mixed"
high-bits back down to the lower-bits.

Vectorization has broadened to 512-bits of state (4x128) so that the
vectorized steps can be ILP'd.

The keys are:

    
    
        ALIGN(64) static const U32 kKey[KEYSET_DEFAULT_SIZE] = {
            0xb8fe6c39,0x23a44bbe,0x7c01812c,0xf721ad1c,
            0xded46de9,0x839097db,0x7240a4a4,0xb7b3671f,
            0xcb79e64e,0xccc0e578,0x825ad07d,0xccff7221,
            0xb8084674,0xf743248e,0xe03590e6,0x813a264c,
            0x3c2852bb,0x91c300cb,0x88d0658b,0x1b532ea3,
            0x71644897,0xa20df94e,0x3819ef46,0xa9deacd8,
            0xa8fa763f,0xe39c343f,0xf9dcbbc7,0xc70b4f1d,
            0x8a51e04b,0xcdb45931,0xc89f7ec9,0xd9787364,
            0xeac5ac83,0x34d3ebc3,0xc581a0ff,0xfa1363eb,
            0x170ddd51,0xb7f0da49,0xd3165526,0x29d4689e,
            0x2b16be58,0x7d47a1fc,0x8ff8b8d1,0x7ad031ce,
            0x45cb3a8f,0x95160428,0xafd7fbca,0xbb4b407e,
        };
    

Hmmm... this key is added with the data as it gets mixed in.

\---------

Definitely seems like cleaner code actually. Again, I'm not seeing any obvious
red-flags in the code, so it looks pretty good to me.

The only thing is that the 32-bit to 64-bit multiply step seems odd to me. It
probably isn't losing any information, but I'm kind of mind-farting and can't
see whether or not that step is potentially losing entropy or not... I'll
probably have to sleep on it.

Specifically this step:

    
    
        __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
    

Those numbers aren't necessarily odd. Its a 32-bit multiply that takes two
32-bit numbers, and outputs a 64-bit number (vectorized). It doesn't seem like
a "strong" hash, but maybe I need to think about it more. If this is a good
hash (and with all the tests done on it... it probably is a good hash), then
I'd be surprised.

It reminds me of a RNG that Knuth discussed in his book: where you multiply
two numbers and then take the "middle" results of the multiplication
([https://en.wikipedia.org/wiki/Middle-
square_method](https://en.wikipedia.org/wiki/Middle-square_method)). So I feel
like there's potential for weakness here. I can't see it one way or the other
yet, its just something I'm trying to think about...

~~~
Cyan4973
According to the UMAC paper, the 32x32=>64 multiplication only contains 32-bit
of entropy, even though it uses 64-bit space. That's understandable : most of
the entropy will be in the middle of the register.

That's enough for XXH3. Since it maintains a 512-bit internal state, that
means it transports 256-bit of entropy.

~~~
dragontamer
But there's 64-bits of input (32-bit A x 32-bit B). So if you have 64-bits of
input, but only result in 32-bits of output entropy, then you've lost
information.

I realize you have to compress data down in a Hash function somehow, but
ideally you want to minimize the loss of entropy / information from the input
bits. The ideal mixing function would have 512-bits of internal state, with
512-bits of entropy starting off... and ending with 512-bits of entropy once
all the mixing were done.

If your factoid is correct, then we're starting with 512-bits of entropy, but
only 256-bits of entropy after the multiply.

> That's enough for XXH3. Since it maintains a 512-bit internal state, that
> means it transports 256-bit of entropy.

Why not optimize the function, and aim for only 256-bits of internal state
(with 256-bits of entropy) ??

See: you can cut down on state and possibly improve performance. Maybe not on
Intel Skylake, but probably on ARM Neon / AMD Zen (which have 128-bit SIMD
registers internally). Hmmmm... 512-bit (aka 2x AVX2 registers) is probably
needed to get good ILP on Intel processors. So there probably wouldn't be much
improvement for Intel.

\-----------

Again though: I'm not sure if its losing information yet. Its just something
I'm thinking about... (32-bit XOR would be 32-bit + 32-bit input, with only
32-bits of entropy output. But unlike multiply, you only have 32-bits of
state)

32-bit multiplication with a constant, keeping only the bottom 32-bits, keeps
all 32-bits of entropy without being forced to expand to 64-bits. I definitely
like the bit-mixing properties of multiply, its just difficult to find
configurations of multiplication that saves every bit of entropy.

~~~
Cyan4973
It's not exactly "losing information". We are not trying to regenerate
original data, just make sure that all source bits can fairly influence the
result. It's more a question of bit contribution.

In the resulting 64-bit register, bit-0 can only be contributed by the first
bit-0 of each 32-bit input. So it's only representative of these 2 bits. Same
at the other end, bit-63 mostly depending on bit-31 of each 32-bit input, and
also on carry over from previous bits (making things more complex).

In the middle, many more bits participate, so that's where they are "well
mixed".

This question of bit distribution becomes critical if the 64-bit accumulator
was used "as is" to produce a hash, but fortunately it's not. Accumulators
will get mixed before that stage. When the mixing function is done correctly,
every bit will get redistributed to the point of saturation. After which
point, it does not matter that initially one accumulator's bit had less
contributions than another : all source bits contributes fully. This can be
easily verified with SMHasher's Avalanche test.

Finally, XXH3 does not follow UMAC too closely, and adds an operation which
ensures that all bits are necessarily present at least once in the
accumulator. This compensate from the risk of multiplying by zero, which _is_
dangerous for checksumming, as it would nullify a contribution.

~~~
dragontamer
> It's not exactly "losing information". We are not trying to regenerate
> original data, just make sure that all source bits can fairly influence the
> result. It's more a question of bit contribution.

I agree. But entropy is a great concept that helps allow us to calculate
whether or not "input bits" can affect "output bits".

Lets look at XXH3_scrambleAcc:

    
    
        __m256i const k   = _mm256_loadu_si256 (xkey+i);
        __m256i const dk  = _mm256_mul_epu32 (data,k);
        __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
        __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
        __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
        xacc[i] = _mm256_xor_si256(dk, dk2);
    

Assume you start with 512-bits of input entropy (that is: assume that all
512-bits of state are uniformly random). Will you have 512-bits of uniformly
random output after the above operations? Or to put it in your words: are you
100% sure that the above steps allow every input bit to affect every output
bit?

It appears not. My counter-example (not that I've debugged your code... but
based on my reading) is as follows: key[2] is 0x7c01812c, and key[3] is
0xf721ad1c. This means that key[2] AND key[3] are even numbers.

Which means dk.int64[1] bit0 will always be zero. And dk2.int64[1] bit0 will
ALSO always be zero (on iteration 0, i=0)

I bet you, at least... if I did my math correctly... that
xacc[0].int64[1].bit0 will ALWAYS be zero, no matter what state you start with
(bit#64 of xacc[0] == 0)

This is because xacc[0].int64[1].bit0 = dk.int64[1].bit0 XOR dk2.int64[1].bit0

And both of those component parts seem to always be zero. Which means you're
at LEAST wasting the 64th bit of state each time you perform XXH3_scrambleAcc.
It certainly seems like a weakness of the XXH3_scrambleAcc function to me.

That's a lot of operations you do, to literally throw away half your bits.
There's probably an optimization you can do here to get better mixing.

~~~
Cyan4973
Well, if you believe a better scrambling operation is possible, you are
certainly welcomed to suggest one. Considering feedbacks on the algorithm is
one of the objectives of the test phase. If the issue is about the default
keys, it's also possible to change them, though one will also have to consider
what happens with custom keys and if it implies ensuring some conditions
(custom keys is a long-term objective of XXH3). At the end, XXH3 is only
generating 64 and 128 bit hashes, and maybe 256 in the future, so compression
necessarily happens somewhere. I believe the issue would be more pressing if
the intention was to generate a 512-bit hash, but that has never been an
objective.

~~~
dragontamer
> Well, if you believe a better scrambling operation is possible, you are
> certainly welcomed to suggest one.

Oh, criticism of the algorithm is the easy part, especially when I'm being
vague and can't think of solid proof of my suppositions. Coming up with a
better scrambling operation is the hard part. :-)

As I stated earlier, there are no obvious "red flags" in your algorithm. Its
just this obscure part of the function that makes me think that optimization
potential exists.

Since you only have 256-bits of entropy in your 512-bit accumulator, my goal
if I were to go through your code is to cut the state down to 256-bits, while
keeping 256-bits of entropy.

1\. All multiplication keys should be odd (bottom bit set to 1). This ensures
that all multiplication steps that only keep the 32-bottom bits will keep all
of their entropy.

2\. Precisely throw away the top 32-bits of your multiplications, and only
keep the bottom 32-bits. This will be slower, but it ensures full entropy when
combined with odd numbers for the key.

3\. Now that we have full entropy on the multiplication steps, cut down the
512-bit accumulator down to 256-bits, which should improve performance on AMD
Zen / ARM NEON. It may help Intel Skylake, but Skylake is so wide that its
hard to get ILP.

\--------------------

Without testing the code, here's an idea.

    
    
        __m256i const k   = _mm256_loadu_si256 (xkey+i);
        __m256i const dk  = _mm256_mul_epu32 (data,k);
        __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
        __m256i const k2  = _mm256_shuffle_epi32 (k,0x31); // <---- This line can be precomputed by the way...
        __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
        // Above this line is the same code as before
        dk2= _mm256_shuffle_epi32 (dk2,0xc4);
        xacc[i] = _mm256_blend_epi32(dk, dk2, 0xaa);
    

As long as the keys were all odd (bottom bit set to 1), the above should mix
the bits without losing any entropy. So you can now have a 256-bit state.

\----------

The accumulate portion would be simply:

    
    
        __m256i const add = _mm256_xor_epi64(d, xacc);
        xacc = _mm256_xor_epi64(res, add);
    

XOR (or add) captures all the entropy from the input. In my experiments, XOR
works better with multiply (I don't know why, but its just something I've
noticed). But otherwise, its conceptually similar to your original "add" code.

\----------

I mean, I'm just shooting from the hip here. I don't know how well the above
code mixes things around. But those are kind of the steps that I would take to
move forward.

\----------

Although, if you want me to be 100% honest, anything I'd do "for real" would
revolve around _mm_aesenc_si128. That'd be

    
    
        xacc[0] = _mm_aesenc_si128(xacc[0], data[0]);
        xacc[1] = _mm_aesdec_si128(xacc[1], data[1]);
        
        // Finalization:
        __m128i finalizationA = _mm_aesenc_si128(xacc[0], const1);
        finalizationA = _mm_aesenc_si128(finalizationA , const2);
        finalizationA = _mm_aesenc_si128(finalizationA , const3);
        finalizationA = _mm_aesenc_si128(finalizationA , const4);
       
        __m128i finalizationB = _mm_aesdec_si128(xacc[1], const1);
        finalizationB = _mm_aesdec_si128(finalizationB , const2);
        finalizationB = _mm_aesdec_si128(finalizationB , const3);
        finalizationB = _mm_aesdec_si128(finalizationB , const4);
    
        __m128i finalHash = _mm128_xor_epi64(finalizationA, finalizationB);
    
        // Return bottom 32, 64, or 128-bits of finalHash
     
        // 4 rounds of finalization should be enough. 
        // Testing required to find the minimum for an avalanche condition. It will be more than 2, but probably less than 4...
    

And... the end. That's it. Its only uses the 128-bit functionality of the
execution units, but still gets ILP off of two parallel instances. So 256-bits
of state. Its probably a fast function, but I can't be bothered to test it
right now. I'm curious if the simpler functionality above would outspeed the
256-bit execution units on Skylake, but I don't have a Skylake box to test on
(just AMD Zen).

With AVX512, _mm512_aesenc_epi128 is possible to do this over 512-bits (4x
128-bit parallel instances of AES encryption). But I don't have a Skylake-X
machine to test this idea with.

See here for my experiment on AES functionality:
[https://github.com/dragontamer/AESRand](https://github.com/dragontamer/AESRand)

~~~
Cyan4973
Thanks for suggestions @dragontamer. We are genuine when saying the algorithm
is opened to suggestions, and can still change to improve its properties.
Let's review yours :

> my goal if I were to go through your code is to cut the state down to
> 256-bits, while keeping 256-bits of entropy.

The point is, it's a lot more difficult to keep the registers "fully loaded"
at maximum entropy _at every step_. By going double size, and accepting that
this entropy is leniently dispersed into the larger register, we make our
lives a lot easier, which translates into sensible speed gains. To be detailed
below

> All multiplication keys should be odd (bottom bit set to 1). This ensures
> that all multiplication steps that only keep the 32-bottom bits will keep
> all of their entropy.

One core idea of UMAC, which XXH3 is based upon, is that the keys could be any
random number (they are supposed to be secret). Forcing them to be odd reduces
available space by one bit. Not a very big deal, but still. This could also be
ensured by adding an `OR 1` operation on loading the key.

> All multiplication keys should be odd (bottom bit set to 1). This ensures
> that all multiplication steps that only keep the 32-bottom bits will keep
> all of their entropy.

OK, so that's where the notion of "bit contribution" becomes useful. By making
a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed
correctly the lowest bit. But the other ones do not contribute to lower bits.
At the other extreme, the highest bit only contribute to one (the highest) bit
in the resulting product. It's clearly not well mixed.

This can be compensated, with a rotation, or a right shift + add operation,
followed by another spreading (another multiplication), and another right
shift + add. But all this adds quite a few operations, right in the critical
loop, so this is no longer the same speed.

> This line can be precomputed by the way...

That's a great point ! It transforms the shuffle into a load, it's not
completely free but is probably faster. More importantly, it requires memory
to store the swapped table of keys, which can be troublesome if the size of
the table of keys can be customized. I'll look into it, thanks for the
suggestion !

> XOR (or add) captures all the entropy from the input. In my experiments, XOR
> works better with multiply

There are 2 parts here :

\- It's not possible to XOR `d` with `res` (which is what happens transitively
in your proposal). The whole point of adding d is that it avoids cancelling a
contributor, which can happen if there is a multiplication by zero. With XOR,
the same impact can still happen, but it requires a multiplication by 1
instead : `(1*d)^d = 0` . Same problem if `d` is subtracted. But with an add
operation, cancelling `d` contribution requires a multiplication by `-1`. And
that is precisely impossible when doing a 32x32=>64 multiplication.
Unfortunately, when doing a 32x32=>32 multiplication, it now becomes actually
possible : -1 == 0xFFFFFFFF. So the addition doesn't save the situation, and
it's now necessary to change the formula

\- Xoring `res` with `acc` seems more tractable. I tried it the early days of
XXH3, but unfortunately it proved worse in term of hash quality. At this
stage, I'm not too sure why. And maybe later changes indirectly solved an
underlying issue, so it might be worth trying again, and see if it proves any
better.

> to be 100% honest, anything I'd do "for real" would revolve around
> _mm_aesenc_si128

I fully agree. Leveraging dedicated hardware capabilities is most likely
efficient, and AES is doing a great job at mixing bits. This is more difficult
to emulate with simpler instructions. There are only 2 minor issues to be
aware of :

\- It relies on the presence of a hardware AES module. While not an issue when
the target platform basically guarantees its presence, it's a non-trivial
problem when targeting broader portability. Platforms without AES, or without
access to it (yes, even on Intel, some "systems" can't access the AES
instruction, think `WASM` or Kernel space for example), will pay a hefty
performance price while using a software backup. It's not necessarily a
killing issue, just something to be aware of. xxHash tries to target a very
broad portability. This is a "handicap", but with its own benefits.

\- AES instructions have excellent throughput, but latency is non negligible.
This is especially true on pre-Skylake CPU. Latency is hard to measure, so
it's frequently forgotten in benchmarks. In my own tests (on latest generation
Intel CPU, so very favorable to AES), using AES instructions ends in the 80M/s
region when measuring latency, which is not bad. To be compared with XXH3,
which ends in the 110-150M/s region.

Don't read me wrong, using AES is likely a good choice. The only reason XXH3
doesn't use it is that it targets very broad portability, including targets
without a hardware AES module.

~~~
dragontamer
> OK, so that's where the notion of "bit contribution" becomes useful. By
> making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you
> have mixed correctly the lowest bit. But the other ones do not contribute to
> lower bits. At the other extreme, the highest bit only contribute to one
> (the highest) bit in the resulting product. It's clearly not well mixed.

True, but this is also true for the top 16-bits and bottom 16-bits of a 64-bit
multiply. Its a problem innate to multiplication: the "middle" bit (bit#31)
will be best, while the "edge" bits (bit#0 or bit#63) will be awful.

> This can be compensated, with a rotation, or a right shift + add operation,
> followed by another spreading (another multiplication), and another right
> shift + add. But all this adds quite a few operations, right in the critical
> loop, so this is no longer the same speed.

There are quite a few other ways to compensate, which will remain efficient.
vpshufb is 1-latency and once-per-clock throughput. If this were ARM, NVidia
GPU, or AMD GPU, bit-reversal would work (RBIT on ARM).

But since this is Intel, the fastest way to spread bits around is to do
shuffle(state, [0,1,2,3]) (which is my special shorthand for
_mm256_shuffle_epi32 (k,0x1b)).

In general, you just need to map the "strongly mixed" bits (bit#31) to the
locations which will potentially affect the most bits (bit#0 in the next
multiplication). Bit-reversal is best, but almost any vpshufb should do the
trick.

> The whole point of adding d is that it avoids cancelling a contributor,
> which can happen if there is a multiplication by zero

"Cancelling a contributor" is bad, but I strongly disagree with your
characterization of it!

"Cancellation of a contributor" only happens when two inputs map to the same
output. By the pigeonhole principle, something MUST map to zero. Otherwise,
the "zero" output is wasted.

You've identified the singular inputs that result in zero in various ways on
32-bit -> 32-bit multiplication. That's a good thing!! There is ONLY ONE input
that maps to zero in my proposed multiplication.

Again: Its not so important that "nothing maps to zero". Its far more
important that "exactly one thing maps to zero".

\----------

In your algorithm: if d[0] and d[1] are zero, you get a zero as a result.
(d[0] * k + d[1] * k == dk[0] == 0 when d[0] == d[1] == 0).

But... (d[0] * k[0] + d[1] * k[1] == 0) defines a whole slew of "potential
mappings to zero". Its non-trivial to analyze this, especially with overflow
(the "zero" on the right hand side is 64-bit unsigned with overflow).

In a 16-bit case, 0xffff * 0xfffc (d[0] == 0xffff, k[0] == 0xfffc) will be
"inverted" by d[1] == 163838 k[1] == 2.

So in this case, we have a violation of the pigeonhole principle: d[0] == d[1]
== 0 will map to zero, AND d[0] == 0xffff, k[0] == 0xfffc, d[1] == 163838,
k[1] == 2 maps to zero as well.

That's how you "lose bits", with multiple inputs potentially mapping to the
same output (in this case: zero, but in general... any input).

With regards to multiplication by an odd number, someone already proved its
reversible: [https://lemire.me/blog/2017/09/18/computing-the-inverse-
of-o...](https://lemire.me/blog/2017/09/18/computing-the-inverse-of-odd-
integers/)

Which means there is exactly one input that maps to exactly one output for all
odd-multipliers.

> Don't read me wrong, using AES is likely a good choice. The only reason XXH3
> doesn't use it is that it targets very broad portability, including targets
> without a hardware AES module.

Hmmmm... in any case, a 32-bit multiply is certainly more portable. I think
the throughput / latency issue can be easily solved by increasing the state.
5x128 bits of state to unroll the inner-loop should be sufficient (5-cycles of
latency, once-per-cycle of throughput == 640-bit state should be best)

But in any case, you have a point about portability. But note that the AES
instruction is implemented on ARM, POWER9, and Intel machines now. So its
surprisingly portable, even if systems as a whole (ex: WASM or maybe Java
Virtual Machine) don't necessarily support it yet.

------
adrianratnapala
So it took me a while to find the actual algorithm. As far as I can tell it is
here:

[https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h](https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h)

Is there a description of the algorithm, the C-source is not only highly
optimised, it is also highly #ifdefed to take care of various scenarios. So
the file is not optimised for readability.

~~~
terrelln
I believe that this is just the initial release for testing, and the author
plans to clean up the header file to separate out the interface from the
implementation, and add documentation.

------
terrelln
I've tested XXH3 using xxhash's built-in benchmark tool with clang-7.0.1 and
gcc-8.2.1 on an Intel i9-9900K. The processor was otherwise idle, and was
running at 5 GHz. The command I tested is `xxhsum -b5i10`.

    
    
        SSE2: CFLAGS=-O3
        AVX2: CFLAGS="-O3 -mavx2"
        ARCH: CFLAGS="-O3 -march=native"
    
        Compiler  Mode  Speed
        gcc-8     SSE2  32.8 GB/s
        clang-7   SSE2  36.5 GB/s 
        gcc-8     AVX2  44.1 GB/s
        clang-7   AVX2  68.3 GB/s
        gcc-8     ARCH  60.9 GB/s
        clang-7   ARCH  69.7 GB/s
    

gcc nearly catches up with clang when compiled with -march=native, but with
only -mavx2 it isn't performing as well.

~~~
felixhandte
That is pretty dang fast!

------
joemanaco
Here is a comparsion with meowhash:
[https://twitter.com/mmozeiko/status/1106714583761215488](https://twitter.com/mmozeiko/status/1106714583761215488)

Meow is faster than xxh3 for all sizes, small or large.

~~~
injinj
Meow is a freak of nature, it has a constant overhead for all of the small
sizes.

I added Meow to my test graphs:
[https://github.com/injinj/smhasher](https://github.com/injinj/smhasher)

I still find xxh3 to slightly faster for small hashes on my machine
(i9-7960X), except at the 128bit throughput size > 32b.

~~~
injinj
Hmm, Meow fails the TwoLongNeighbors test. More information on this test here:
[https://github.com/hmakholm/smhasher/blob/master/src/LongNei...](https://github.com/hmakholm/smhasher/blob/master/src/LongNeighborTest.md)

    
    
      [[[ Keyset 'TestLongNeighbors' Tests ]]]
    
      Looking for 2-bit collisions in the last 2400 bits.
      Looking for 4-bit collisions in the last 2048 bits.
      Looking for 6-bit collisions in the last 160 bits.
      Trying bases of length 10 to 300, 5 of each length.
      ......[16]...............[32]............... [48]...............[64].
    
      Among 285 tested bases, we expected 4.25749e-06 bads. 
      Actually there were 5.
      The most striking collision has surprise score 3.52004e+06 and length 46 bytes:
      000: BD 8D 12 8C FB EE 75 2E 17 80 4E 26 8C E7 75 94 DA EF 12 8C 95 CA 91 0D
      018: 08 BD 8D 5F F2 78 FF 64 88 27 88 3F 8E 8F CE FC E2 8A 99 CE 2B E3
                      ^D2^                 ^7F^              ^18^     ^0B^
      The hashes are length 8 bytes:
      000: 35 C1 E6 5B 6B 66 C7 28
      *********FAIL*********

~~~
injinj
I can make Meow pass the LongNeighbors test by doubling the number of AES
rounds on each pass.

The original Meow algo:

    
    
        while(Len >= 64)
        {
            S0 = Meow128_AESDEC_Mem(S0, Source);
            S1 = Meow128_AESDEC_Mem(S1, Source + 16);
            S2 = Meow128_AESDEC_Mem(S2, Source + 32);
            S3 = Meow128_AESDEC_Mem(S3, Source + 48);
    
            Len -= 64;
            Source += 64;
        }
    

The modified algo:

    
    
        while(Len >= 64)
        {
            S0 = Meow128_AESDEC_Mem(S0, Source);
            S0 = Meow128_AESDEC_Mem(S0, Source);
            S1 = Meow128_AESDEC_Mem(S1, Source + 16);
            S1 = Meow128_AESDEC_Mem(S1, Source + 16);
            S2 = Meow128_AESDEC_Mem(S2, Source + 32);
            S2 = Meow128_AESDEC_Mem(S2, Source + 32);
            S3 = Meow128_AESDEC_Mem(S3, Source + 48);
            S3 = Meow128_AESDEC_Mem(S3, Source + 48);
    
            Len -= 64;
            Source += 64;
        }
    

(also changed the partial rounds, but this is not shown here)

This doesn't change it's performance that much, maybe 1 or 2 extra CPU cycles.
My i9 CPU must be able to run these extra rounds in parallel... I do have
hyperthreading disabled, if that matters.

Update:

Ah, it does cause the bulk speed test to be 50% of the original (48MiB vs
24Mib). The small key test only adds 1/2/3/4 extra AES instructions since they
are manipulating 16 bytes at a time. It looks like this is only taking a
cycle.

------
rurban
I've already looked at it last week and it looked very good indeed (minus some
debugging code Yann ripped out today). It should still be the best hash
function for large data digests. The prev. version was already the best for
this use case. Interesting is the new property for smaller keys (this was the
weak point in xxhash), and the extended tests.

------
nerpderp83
I would be interested in the avx2 transition penalty under real world
workloads.

------
kevin_thibedeau
Seems a bit funny that CRC32 is so slow.

~~~
corysama
I expect that's because it was designed very well for the days of serial, in-
order, static execution. But, CPUs today are all about pipelining and
parallelism. Herb Sutter's "Free Lunch" has been over for longer than most
people here have been writing software.
[http://www.gotw.ca/publications/concurrency-
ddj.htm](http://www.gotw.ca/publications/concurrency-ddj.htm)

~~~
mark-r
Sutter's solution was to write more concurrent code, but that's not what helps
in this case. Rather the CPU vendors have been throwing transistors at doing
more math operations in parallel. All you need to do is make sure you're
taking advantage of the vector instructions, whether that's done by the
compiler or with intrinsic calls.

Traditionally the fastest CRC code has used lookup tables, I wonder if that's
creating cache pressure these days? Or maybe that approach was abandoned while
I wasn't paying attention.

~~~
dragontamer
> Rather the CPU vendors have been throwing transistors at doing more math
> operations in parallel. All you need to do is make sure you're taking
> advantage of the vector instructions, whether that's done by the compiler or
> with intrinsic calls.

Instruction-level parallelism is incredibly important. I'd say that any
optimizing programmer needs to fully understand ILP, and how it interacts with
pipelines and dependency cutting (and register renaming).

Modern CPUs are extremely well parallelized with ILP. Any good, modern hash
function will take advantage of this feature of modern CPUs.

Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No
vectorization as far as I can tell, its purely using ILP to get its speed.

Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit
multipliers). I've theorized to myself that this 64-bit multiplier could lead
to better mixing than the vectorized instructions, and it seems like xxhash
goes for that.

The 32-bit version of xxhash can likely be vectorized and optimized even
further.

~~~
corysama
[https://moinakg.wordpress.com/2013/01/19/vectorizing-
xxhash-...](https://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-
fun-and-profit/)

------
aidenn0
Slightly OT, but I recently had the need for a rolling hash, and wasn't able
to find anything better than Fletcher. Is there anything of similar speed, but
better characteristics?

------
HocusLocus
Glad to see that part of the human race is still on the track to practical
innovation (collision avoidance, small footprint, performance) as opposed to
the perpetual Moore's Law paranoia arms race (crypto-secure, more bits, more
rounds, constant utilization regardless of input to deter side channel
attack).

If you're the slightest bit security conscious you realize that there is NO
level below which security is not a factor. If that is the deciding factor
these compromises (or as they would tell it, lack of compromise) will decrease
overall UTILITY of technology and also increase its FRAGILITY.

The move to aggressively force HTTP to HTTPS even for Joe's Who Cares Web Site
for no specifically stated reason is a prime example. If rounds and loops were
temperature, it is as if we're shoving the Internet into thermal overload
beyond convection cooling, without performing any additional useful work.

Meltdown mitigation is like economic inflation. A general performance decrease
for a worst case threat scenario.

What price, paranoia? Is there an upper limit?

------
cryptonector
Those images are unreadable. Can we get higher res. images, or larger fonts?
Thanks.

~~~
terrelln
The images are high res, but you have to download them to view the larger
image.

------
mitchtbaum
It's cool see these hashing algorithms coming out and excitement about
building new libraries and adding them to existing ones, but we also need
people gravitating them into an organized and useful project space.

------
switch33
[https://github.com/switch33?tab=repositories](https://github.com/switch33?tab=repositories)
I think mine might match up its written in lorem ipsum you can compile it with
utf8.

Can someone benchmark it for me? Because sha with higher values might be
faster.

------
jfk13
Interesting to see this, when just a few days ago we had another "new fast
hash" item on the front page:
[https://news.ycombinator.com/item?id=19357895](https://news.ycombinator.com/item?id=19357895).

Anyone have a TL;DR regarding how these two compare?

~~~
rurban
Working on it tomorrow. This just got in today. I expect xxh3 to beat
everybody on long keys, and because it uses optimized intrinsics it should be
close to the optimized th1a variants. wyhash is purely portable code.

~~~
rurban
Results are now at smhasher. I had to fix all the MSVC issues first.

wyhash still is the recommended hash function: portable and lowest
cycles/hash.

However xxxh3 is spectacularly fast, for small and esp. large key sizes. This
is now the recommended hash for file or db digests.

String hash table stats not yet done, there FNV1A is still the recommendation,
with safe collision handling.

