> Very cool! Independent of the cool use of `aesenc` and `aesdec`, the features ...

sujayakar · on March 9, 2021

> Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough.

You totally nerd-sniped me! I implemented a basic "reverse 128-bit SIMD register" routine with `packed_simd` in Rust. The ideas to process 4 bits a time:

    let lo_nibbles = input & u8x16::splat(0x0F);
    let hi_nibbles = input >> 4;

Then, we can use `pshufb` to implement a lookup table for reversing each vector of nibbles.

    let lut = u8x16::new(0b0000, 0b1000, 0b0100, 0b1100, 0b0010, 0b1010, 0b0110, 0b1110, 0b0001, 0b1001, 0b0101, 0b1101, 0b0011, 0b1011, 0b0111, 0b1111);
    let lo_reversed = lut.shuffle1_dyn(lo_nibbles);
    let hi_reversed = lut.shuffle1_dyn(hi_nibbles);

Now that each nibble is reversed, we can flip the lo and hi nibbles within a byte when reassembling our u8x16.

    let bytes_reversed = (lo_reversed << 4) | hi_reversed;

Then, we can shuffle the bytes to get the final order. We could use a different permutation for simulating reversing f64s in a f64x2, too.

    let rev_bytes = u8x16::new(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    return bytes_reversed.shuffle1_dyn(rev_bytes);

Looking at the disassembly, if we assumed our LUT and shuffle vectors are already in registers, the core shuffle should be pretty fast. (I haven't actually benchmarked this or run it through llvm-mca, though :-p)

    __ZN8arbolito7reverse17he044c5155cfe877bE:
    push    rbp
    mov     rbp, rsp
    movdqa  xmm1, xmmword ptr [rip + 557012]
    movdqa  xmm2, xmm0
    pand    xmm2, xmm1
    psrlw   xmm0, 4
    pand    xmm0, xmm1
    movdqa  xmm1, xmmword ptr [rip + 557003]
    movdqa  xmm3, xmm1
    pshufb  xmm3, xmm2
    pshufb  xmm1, xmm0
    psllw   xmm3, 4
    pand    xmm3, xmmword ptr [rip + 556992]
    por     xmm3, xmm1
    pshufb  xmm3, xmmword ptr [rip + 556995]
    movdqa  xmmword ptr [rdi], xmm3
    pop     rbp
    ret

And, this does a full bitstring reversal, not just reversing the bytes like `bswap64`, right?

> The key is that multiplying by an odd number (bottom-bit == 1) results in a fully invertible (aka: no information loss) operation.

This is neat! How do you choose odd numbers so the final generated numbers are high quality?

dragontamer · on March 9, 2021

> This is neat! How do you choose odd numbers so the final generated numbers are high quality?

This was years ago, so I forget the details. But it was along the lines of...

    uint32_t hash(uint32_t seed, uint32_t k1, uint32_t k2){
        return (brev(seed*k1) * k2);
    }

    uint32_t evaluate(uint32_t seed, uint32_t k1, uint32_t k2){
        return popcnt(hash(seed, k1, k2) ^ seed);
    }

The goal is to find the values of k1 and k2 that resulted in an evaluate(seed, k1, k2) close to 16-bits (aka: 50% of bits change, the definition of "avalanche condition"). There's probably some statistical test I could have done that'd be better, but GPUs have single-cycle popcount and single-cycle XOR.

I forgot exactly which search methods I used, but note that a Vega64 GPU easily reaches 10 Trillion-multiplies / second, so you can exhaustively search a 32-bit space in a ~millisecond, and a 40-bit space in just a couple of seconds.

You can therefore search the values of k1 and k2 ~8-bits at a time every few seconds. From there, plug-and-play your favorite search algorithm (genetic algorithms? Gradient descent? Random search? Simulated annealing?).

--------

After that, I'd of course run it through PractRand or BigCrush (and other tests). In all honesty: random numbers (with bottom bit set to 1) from /dev/urandom are already really good.

---------

Exhausting the 64-bit space seems unreasonable however. I was researching FNV-hashes (another multiplication-based hash), trying to understand how they chose their constants.