Hacker News new | past | comments | ask | show | jobs | submit login

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxh...

The core of the function is as follows:

    static U64 XXH64_round(U64 acc, U64 input)
     {
      acc += input * PRIME64_2;
      acc = XXH_rotl64(acc, 31);
      acc *= PRIME64_1;
      return acc;
     }
Multiplication, rotation, and a 2nd multiplication. These are 64-bit multiplications, which is well known to mix bits well, but CPUs don't have many 64-bit multipliers. Four rounds execute in parallel (see XXH64_update_endian), which can hide the long latency (~5 cycles) associated with a 64-bit multiply.

My own experiments show that multiplication is a great function for hashing. Multiplication however only spreads the bits in "one direction" (towards the most-significant bit), and very poorly spreads bits towards the LSB.

Rotations, as well as a 2nd multiplication, are helpful at spreading the bits around better.

----------

The 64-bit version is somewhat interesting to me, but the 32-bit version probably has more potential for optimization. 32-bit vectorized multiplication exists in AVX2, so the 32-bit version probably can be vectorized. I'd expect the 32-bit version (if vectorized) would be faster, but the 64-bit version probably mixes the bits around better.

Overall looks like a decent design. Uses ILP with 4x state variables (state->v1, state->v2, etc. etc. to allow CPUs to process these multiplications in parallel). Multiply + rotation + multiplication is good from my personal experiments, but the numbers chosen need to be carefully chosen.

Prime numbers are probably not necessary: they just need to be odd numbers to ensure invertibility (Ex: I've achieved decent mixing by multiplying with 0xAAAAAAAA5, which probably isn't a prime number). I'm curious how the prime-constants were selected. There could be some potential for "better numbers" for mixing, depending on how PRIME64_1 and PRIME64_2 were chosen.

Overall checks all of my boxes with regards to good design. Nice! I'm just curious about some details.

----------

I'll note that in my experiments, multiplication seems to "mix" better with XOR, rather than with ADD. I haven't done any tests with this hash yet, but I'm curious how changing the "+=" into "^=" would have on the statistical strength of the hash function.




I believe you are looking at XXH64 and XXH32, which is the old version. The new version XXH3 is located in https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h which exposes the public prototypes `XXH3_64bits()` and `XXH3_128bits()` (and variants with seeds).


Thanks. This code is IMO far easier to read, despite being intrinsics. It just requires some familiarity with shuffle. Anyone who wants to follow along should use: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

        assert(((size_t)acc) & 31 == 0);
        {   ALIGN(32) __m256i* const xacc  =       (__m256i *) acc;
            const     __m256i* const xdata = (const __m256i *) data;
            const     __m256i* const xkey  = (const __m256i *) key;
    
            size_t i;
            for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                __m256i const d   = _mm256_loadu_si256 (xdata+i);
                __m256i const k   = _mm256_loadu_si256 (xkey+i);
                __m256i const dk  = _mm256_add_epi32 (d,k);                                  /* uint32 dk[8]  = {d0+k0, d1+k1, d2+k2, d3+k3, ...} */
                __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
                __m256i const add = _mm256_add_epi64(d, xacc[i]);
                xacc[i]  = _mm256_add_epi64(res, add);
            }
        }
This is followed up by "ScrambleAcc":

        assert(((size_t)acc) & 31 == 0);
        {   ALIGN(32) __m256i* const xacc = (__m256i*) acc;
            const     __m256i* const xkey  = (const __m256i *) key;
    
            size_t i;
            for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                __m256i data = xacc[i];
                __m256i const shifted = _mm256_srli_epi64(data, 47);
                data = _mm256_xor_si256(data, shifted);
    
                {   __m256i const k   = _mm256_loadu_si256 (xkey+i);
                    __m256i const dk  = _mm256_mul_epu32 (data,k);          /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
    
                    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
                    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
                    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);           /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
    
                    xacc[i]  = _mm256_xor_si256(dk, dk2);
            }   }
        }
I don't believe AVX2 rotation exists (correct me if I'm wrong though). So the author has opted for _mm256_srli_epi64 (... 47), which is a 47-bit right shift, followed up with XOR. This is how the author gets those "highly mixed" high-bits back down to the lower-bits.

Vectorization has broadened to 512-bits of state (4x128) so that the vectorized steps can be ILP'd.

The keys are:

    ALIGN(64) static const U32 kKey[KEYSET_DEFAULT_SIZE] = {
        0xb8fe6c39,0x23a44bbe,0x7c01812c,0xf721ad1c,
        0xded46de9,0x839097db,0x7240a4a4,0xb7b3671f,
        0xcb79e64e,0xccc0e578,0x825ad07d,0xccff7221,
        0xb8084674,0xf743248e,0xe03590e6,0x813a264c,
        0x3c2852bb,0x91c300cb,0x88d0658b,0x1b532ea3,
        0x71644897,0xa20df94e,0x3819ef46,0xa9deacd8,
        0xa8fa763f,0xe39c343f,0xf9dcbbc7,0xc70b4f1d,
        0x8a51e04b,0xcdb45931,0xc89f7ec9,0xd9787364,
        0xeac5ac83,0x34d3ebc3,0xc581a0ff,0xfa1363eb,
        0x170ddd51,0xb7f0da49,0xd3165526,0x29d4689e,
        0x2b16be58,0x7d47a1fc,0x8ff8b8d1,0x7ad031ce,
        0x45cb3a8f,0x95160428,0xafd7fbca,0xbb4b407e,
    };
Hmmm... this key is added with the data as it gets mixed in.

---------

Definitely seems like cleaner code actually. Again, I'm not seeing any obvious red-flags in the code, so it looks pretty good to me.

The only thing is that the 32-bit to 64-bit multiply step seems odd to me. It probably isn't losing any information, but I'm kind of mind-farting and can't see whether or not that step is potentially losing entropy or not... I'll probably have to sleep on it.

Specifically this step:

    __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
Those numbers aren't necessarily odd. Its a 32-bit multiply that takes two 32-bit numbers, and outputs a 64-bit number (vectorized). It doesn't seem like a "strong" hash, but maybe I need to think about it more. If this is a good hash (and with all the tests done on it... it probably is a good hash), then I'd be surprised.

It reminds me of a RNG that Knuth discussed in his book: where you multiply two numbers and then take the "middle" results of the multiplication (https://en.wikipedia.org/wiki/Middle-square_method). So I feel like there's potential for weakness here. I can't see it one way or the other yet, its just something I'm trying to think about...


According to the UMAC paper, the 32x32=>64 multiplication only contains 32-bit of entropy, even though it uses 64-bit space. That's understandable : most of the entropy will be in the middle of the register.

That's enough for XXH3. Since it maintains a 512-bit internal state, that means it transports 256-bit of entropy.


But there's 64-bits of input (32-bit A x 32-bit B). So if you have 64-bits of input, but only result in 32-bits of output entropy, then you've lost information.

I realize you have to compress data down in a Hash function somehow, but ideally you want to minimize the loss of entropy / information from the input bits. The ideal mixing function would have 512-bits of internal state, with 512-bits of entropy starting off... and ending with 512-bits of entropy once all the mixing were done.

If your factoid is correct, then we're starting with 512-bits of entropy, but only 256-bits of entropy after the multiply.

> That's enough for XXH3. Since it maintains a 512-bit internal state, that means it transports 256-bit of entropy.

Why not optimize the function, and aim for only 256-bits of internal state (with 256-bits of entropy) ??

See: you can cut down on state and possibly improve performance. Maybe not on Intel Skylake, but probably on ARM Neon / AMD Zen (which have 128-bit SIMD registers internally). Hmmmm... 512-bit (aka 2x AVX2 registers) is probably needed to get good ILP on Intel processors. So there probably wouldn't be much improvement for Intel.

-----------

Again though: I'm not sure if its losing information yet. Its just something I'm thinking about... (32-bit XOR would be 32-bit + 32-bit input, with only 32-bits of entropy output. But unlike multiply, you only have 32-bits of state)

32-bit multiplication with a constant, keeping only the bottom 32-bits, keeps all 32-bits of entropy without being forced to expand to 64-bits. I definitely like the bit-mixing properties of multiply, its just difficult to find configurations of multiplication that saves every bit of entropy.


It's not exactly "losing information". We are not trying to regenerate original data, just make sure that all source bits can fairly influence the result. It's more a question of bit contribution.

In the resulting 64-bit register, bit-0 can only be contributed by the first bit-0 of each 32-bit input. So it's only representative of these 2 bits. Same at the other end, bit-63 mostly depending on bit-31 of each 32-bit input, and also on carry over from previous bits (making things more complex).

In the middle, many more bits participate, so that's where they are "well mixed".

This question of bit distribution becomes critical if the 64-bit accumulator was used "as is" to produce a hash, but fortunately it's not. Accumulators will get mixed before that stage. When the mixing function is done correctly, every bit will get redistributed to the point of saturation. After which point, it does not matter that initially one accumulator's bit had less contributions than another : all source bits contributes fully. This can be easily verified with SMHasher's Avalanche test.

Finally, XXH3 does not follow UMAC too closely, and adds an operation which ensures that all bits are necessarily present at least once in the accumulator. This compensate from the risk of multiplying by zero, which _is_ dangerous for checksumming, as it would nullify a contribution.


> It's not exactly "losing information". We are not trying to regenerate original data, just make sure that all source bits can fairly influence the result. It's more a question of bit contribution.

I agree. But entropy is a great concept that helps allow us to calculate whether or not "input bits" can affect "output bits".

Lets look at XXH3_scrambleAcc:

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    xacc[i] = _mm256_xor_si256(dk, dk2);
Assume you start with 512-bits of input entropy (that is: assume that all 512-bits of state are uniformly random). Will you have 512-bits of uniformly random output after the above operations? Or to put it in your words: are you 100% sure that the above steps allow every input bit to affect every output bit?

It appears not. My counter-example (not that I've debugged your code... but based on my reading) is as follows: key[2] is 0x7c01812c, and key[3] is 0xf721ad1c. This means that key[2] AND key[3] are even numbers.

Which means dk.int64[1] bit0 will always be zero. And dk2.int64[1] bit0 will ALSO always be zero (on iteration 0, i=0)

I bet you, at least... if I did my math correctly... that xacc[0].int64[1].bit0 will ALWAYS be zero, no matter what state you start with (bit#64 of xacc[0] == 0)

This is because xacc[0].int64[1].bit0 = dk.int64[1].bit0 XOR dk2.int64[1].bit0

And both of those component parts seem to always be zero. Which means you're at LEAST wasting the 64th bit of state each time you perform XXH3_scrambleAcc. It certainly seems like a weakness of the XXH3_scrambleAcc function to me.

That's a lot of operations you do, to literally throw away half your bits. There's probably an optimization you can do here to get better mixing.


Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one. Considering feedbacks on the algorithm is one of the objectives of the test phase. If the issue is about the default keys, it's also possible to change them, though one will also have to consider what happens with custom keys and if it implies ensuring some conditions (custom keys is a long-term objective of XXH3). At the end, XXH3 is only generating 64 and 128 bit hashes, and maybe 256 in the future, so compression necessarily happens somewhere. I believe the issue would be more pressing if the intention was to generate a 512-bit hash, but that has never been an objective.


HighwayHash uses PSHUFB to spread entropy from twin 32x32=64 muls across 128-bit vectors [or vector halves, on AVX2]. See ZipperMerge in https://github.com/google/highwayhash/blob/master/highwayhas....


> Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one.

Oh, criticism of the algorithm is the easy part, especially when I'm being vague and can't think of solid proof of my suppositions. Coming up with a better scrambling operation is the hard part. :-)

As I stated earlier, there are no obvious "red flags" in your algorithm. Its just this obscure part of the function that makes me think that optimization potential exists.

Since you only have 256-bits of entropy in your 512-bit accumulator, my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

1. All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

2. Precisely throw away the top 32-bits of your multiplications, and only keep the bottom 32-bits. This will be slower, but it ensures full entropy when combined with odd numbers for the key.

3. Now that we have full entropy on the multiplication steps, cut down the 512-bit accumulator down to 256-bits, which should improve performance on AMD Zen / ARM NEON. It may help Intel Skylake, but Skylake is so wide that its hard to get ILP.

--------------------

Without testing the code, here's an idea.

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31); // <---- This line can be precomputed by the way...
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    // Above this line is the same code as before
    dk2= _mm256_shuffle_epi32 (dk2,0xc4);
    xacc[i] = _mm256_blend_epi32(dk, dk2, 0xaa);
As long as the keys were all odd (bottom bit set to 1), the above should mix the bits without losing any entropy. So you can now have a 256-bit state.

----------

The accumulate portion would be simply:

    __m256i const add = _mm256_xor_epi64(d, xacc);
    xacc = _mm256_xor_epi64(res, add);
XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply (I don't know why, but its just something I've noticed). But otherwise, its conceptually similar to your original "add" code.

----------

I mean, I'm just shooting from the hip here. I don't know how well the above code mixes things around. But those are kind of the steps that I would take to move forward.

----------

Although, if you want me to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128. That'd be

    xacc[0] = _mm_aesenc_si128(xacc[0], data[0]);
    xacc[1] = _mm_aesdec_si128(xacc[1], data[1]);
    
    // Finalization:
    __m128i finalizationA = _mm_aesenc_si128(xacc[0], const1);
    finalizationA = _mm_aesenc_si128(finalizationA , const2);
    finalizationA = _mm_aesenc_si128(finalizationA , const3);
    finalizationA = _mm_aesenc_si128(finalizationA , const4);
   
    __m128i finalizationB = _mm_aesdec_si128(xacc[1], const1);
    finalizationB = _mm_aesdec_si128(finalizationB , const2);
    finalizationB = _mm_aesdec_si128(finalizationB , const3);
    finalizationB = _mm_aesdec_si128(finalizationB , const4);

    __m128i finalHash = _mm128_xor_epi64(finalizationA, finalizationB);

    // Return bottom 32, 64, or 128-bits of finalHash
 
    // 4 rounds of finalization should be enough. 
    // Testing required to find the minimum for an avalanche condition. It will be more than 2, but probably less than 4...
And... the end. That's it. Its only uses the 128-bit functionality of the execution units, but still gets ILP off of two parallel instances. So 256-bits of state. Its probably a fast function, but I can't be bothered to test it right now. I'm curious if the simpler functionality above would outspeed the 256-bit execution units on Skylake, but I don't have a Skylake box to test on (just AMD Zen).

With AVX512, _mm512_aesenc_epi128 is possible to do this over 512-bits (4x 128-bit parallel instances of AES encryption). But I don't have a Skylake-X machine to test this idea with.

See here for my experiment on AES functionality: https://github.com/dragontamer/AESRand


Thanks for suggestions @dragontamer. We are genuine when saying the algorithm is opened to suggestions, and can still change to improve its properties. Let's review yours :

> my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

The point is, it's a lot more difficult to keep the registers "fully loaded" at maximum entropy _at every step_. By going double size, and accepting that this entropy is leniently dispersed into the larger register, we make our lives a lot easier, which translates into sensible speed gains. To be detailed below

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

One core idea of UMAC, which XXH3 is based upon, is that the keys could be any random number (they are supposed to be secret). Forcing them to be odd reduces available space by one bit. Not a very big deal, but still. This could also be ensured by adding an `OR 1` operation on loading the key.

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

> This line can be precomputed by the way...

That's a great point ! It transforms the shuffle into a load, it's not completely free but is probably faster. More importantly, it requires memory to store the swapped table of keys, which can be troublesome if the size of the table of keys can be customized. I'll look into it, thanks for the suggestion !

> XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply

There are 2 parts here :

- It's not possible to XOR `d` with `res` (which is what happens transitively in your proposal). The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero. With XOR, the same impact can still happen, but it requires a multiplication by 1 instead : `(1*d)^d = 0` . Same problem if `d` is subtracted. But with an add operation, cancelling `d` contribution requires a multiplication by `-1`. And that is precisely impossible when doing a 32x32=>64 multiplication. Unfortunately, when doing a 32x32=>32 multiplication, it now becomes actually possible : -1 == 0xFFFFFFFF. So the addition doesn't save the situation, and it's now necessary to change the formula

- Xoring `res` with `acc` seems more tractable. I tried it the early days of XXH3, but unfortunately it proved worse in term of hash quality. At this stage, I'm not too sure why. And maybe later changes indirectly solved an underlying issue, so it might be worth trying again, and see if it proves any better.

> to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128

I fully agree. Leveraging dedicated hardware capabilities is most likely efficient, and AES is doing a great job at mixing bits. This is more difficult to emulate with simpler instructions. There are only 2 minor issues to be aware of :

- It relies on the presence of a hardware AES module. While not an issue when the target platform basically guarantees its presence, it's a non-trivial problem when targeting broader portability. Platforms without AES, or without access to it (yes, even on Intel, some "systems" can't access the AES instruction, think `WASM` or Kernel space for example), will pay a hefty performance price while using a software backup. It's not necessarily a killing issue, just something to be aware of. xxHash tries to target a very broad portability. This is a "handicap", but with its own benefits.

- AES instructions have excellent throughput, but latency is non negligible. This is especially true on pre-Skylake CPU. Latency is hard to measure, so it's frequently forgotten in benchmarks. In my own tests (on latest generation Intel CPU, so very favorable to AES), using AES instructions ends in the 80M/s region when measuring latency, which is not bad. To be compared with XXH3, which ends in the 110-150M/s region.

Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.


> OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

True, but this is also true for the top 16-bits and bottom 16-bits of a 64-bit multiply. Its a problem innate to multiplication: the "middle" bit (bit#31) will be best, while the "edge" bits (bit#0 or bit#63) will be awful.

> This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

There are quite a few other ways to compensate, which will remain efficient. vpshufb is 1-latency and once-per-clock throughput. If this were ARM, NVidia GPU, or AMD GPU, bit-reversal would work (RBIT on ARM).

But since this is Intel, the fastest way to spread bits around is to do shuffle(state, [0,1,2,3]) (which is my special shorthand for _mm256_shuffle_epi32 (k,0x1b)).

In general, you just need to map the "strongly mixed" bits (bit#31) to the locations which will potentially affect the most bits (bit#0 in the next multiplication). Bit-reversal is best, but almost any vpshufb should do the trick.

> The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero

"Cancelling a contributor" is bad, but I strongly disagree with your characterization of it!

"Cancellation of a contributor" only happens when two inputs map to the same output. By the pigeonhole principle, something MUST map to zero. Otherwise, the "zero" output is wasted.

You've identified the singular inputs that result in zero in various ways on 32-bit -> 32-bit multiplication. That's a good thing!! There is ONLY ONE input that maps to zero in my proposed multiplication.

Again: Its not so important that "nothing maps to zero". Its far more important that "exactly one thing maps to zero".

----------

In your algorithm: if d[0] and d[1] are zero, you get a zero as a result. (d[0] * k + d[1] * k == dk[0] == 0 when d[0] == d[1] == 0).

But... (d[0] * k[0] + d[1] * k[1] == 0) defines a whole slew of "potential mappings to zero". Its non-trivial to analyze this, especially with overflow (the "zero" on the right hand side is 64-bit unsigned with overflow).

In a 16-bit case, 0xffff * 0xfffc (d[0] == 0xffff, k[0] == 0xfffc) will be "inverted" by d[1] == 163838 k[1] == 2.

So in this case, we have a violation of the pigeonhole principle: d[0] == d[1] == 0 will map to zero, AND d[0] == 0xffff, k[0] == 0xfffc, d[1] == 163838, k[1] == 2 maps to zero as well.

That's how you "lose bits", with multiple inputs potentially mapping to the same output (in this case: zero, but in general... any input).

With regards to multiplication by an odd number, someone already proved its reversible: https://lemire.me/blog/2017/09/18/computing-the-inverse-of-o...

Which means there is exactly one input that maps to exactly one output for all odd-multipliers.

> Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.

Hmmmm... in any case, a 32-bit multiply is certainly more portable. I think the throughput / latency issue can be easily solved by increasing the state. 5x128 bits of state to unroll the inner-loop should be sufficient (5-cycles of latency, once-per-cycle of throughput == 640-bit state should be best)

But in any case, you have a point about portability. But note that the AES instruction is implemented on ARM, POWER9, and Intel machines now. So its surprisingly portable, even if systems as a whole (ex: WASM or maybe Java Virtual Machine) don't necessarily support it yet.


Following your suggestion, I went ahead and modified the scrambling formula. The crux of the problem is that the secret is allowed to be absolutely anything (it's user provided). As a consequence, it's a bad idea to depend on it for the multiplier. Now, the secret key only function is to be a mask, while the multiplier is under implementation control, guaranteeing a good prime.


> Why not optimize the function, and aim for only 256-bits of internal state (with 256-bits of entropy) ??

It's a lot more difficult to ensure that accumulators contain full-width entropy at all times. The trade-off in this algorithm is that we intentionally use bigger accumulators and don't even try to maintain the entropy at full level all the time. All it needs is to ensure that the level of entropy is _at least_ 32 bits per accumulator, which is much easier. This leads to better speed.


If I am not mistaken 0xAAAAAAAA5 is 45812984485 = 5 * 28297 * 323801. At least that is what

    factor $(echo -e "ibase=16\nAAAAAAAA5" | bc -l)
claims.


Oh man, a typo. Lol. I meant 0xAAAAAAA5... sorry about that.

Lemme muse on the concept: "multiplying by 0xAAAAAAA5" is that the binary is 1010,1010,1010...0101.

A multiplication at its core, is equivalent to a LOT of bitshift-left and add. Multiplying by "1000" is the same as bitshift left by 4. Multiplying by "1010" is bitshift left by 4 + bitshift left by 2.

This is why its important to multiply by an odd number: all those bitshifts lose the top-most bit. You only keep the topmost bit if the bottom-most bit is set.

So multiply by 0xAAAAAAA5 is the same as "bitshift left 31 + bitshift left 29 + bitshift left 27 ... + bitshift left 2 + bitshift left 0", all at once.

Addition itself is a great "mixing" instruction, its an XOR innately with the potential for carries. The carries make the XORs far more complicated to analyze. The 31st bit after all of those bitshifts are not only the XOR of a bunch of bits... but also the "propagated carry bit".

The topmost bit of any multiplication operation is very complex, while the bottom most bit is always the original bit (bit0 is only affected by bit0 from the input). That's why a rotation + 2nd multiply is needed.


0xAAAAAAA5 make a lot more sense as it is a 32 bit integer. Btw it is 36 * 52 * 157109, so not an integer, just like you expected. Then again you don't care about that, but the bit pattern.

Any reason why use use 0xAAAAAAA5 instead of 0xAAAAAAAA or 0xA5A5A5A5?


> 0xAAAAAAA5 instead of 0xAAAAAAAA or 0xA5A5A5A5?

0xAAAAAAAA isn't odd. Every time you multiply with 0xAAAAAAAA, you lose the top-most bit of your state (alternatively: the bottom-most bit will ALWAYS be zero. So you lost a bit). If you start with 32-bits of entropy, multiplying with 0xAAAAAAAA results with 31-bits of entropy.

> 0xA5A5A5A5

That's where statistical tests need to come into the picture, to differentiate. I've written GPGPU programs to brute-force search for constants based on a few metrics.

In these brute-force searches, even a simple number like 0xAAAAAAA5 works out decently. Its suboptimal in my experience, but its surprisingly "not bad". My only point is that you shouldn't limit your search to prime numbers. Any odd number works pretty decently as a multiplication constant.


Addition and multiplication also leak more information via side channels than bit shift and xor.


I'd be surprised if that's true.

I'm pretty sure addition is always 1-clock tick latency / 3-instructions per cycle. While Multiplication is 5-clock tick latency / 1-instruction per cycle IIRC.

So if the timing of those instructions are constant, it seems difficult to side-channel / timing attack them.


there's also power, 0xffffffff + 0x1 results in 32 bit flips, 0x0 + 0x1 results in 1, they don't consume the same amount of energy.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: