> Very cool! Independent of the cool use of `aesenc` and `aesdec`, the features for skipping ahead in the random stream and forking a separate stream are awesome.
Ah yeah, those features... I forgot about them until you mentioned them, lol.
I was thinking about 4x (512-bits per iteration) with enc(enc), enc(dec), dec(enc), and dec(dec) as the four 128-bit results (going from 256-bits per iteration to 512-bits per iteration, with only 3-more instructions). I don't think I ever tested that...
But honestly, the thing that really made me stop playing with AESRAND was discovering multiply-bitreverse-multiply random number generators (still unpublished... just sitting in a directory in my home computer).
Bit-reverse is single-cycle on GPUs (NVidia and AMD), and perfectly fixes the "multiplication only randomizes the top bits" problem.
Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough. Since bswap64() and multiply64-bit are both implemented really fast on x86-64-bit, a multiply-bswap64-multiply generator probably is fastest for typical x86 code (since there are penalties for going between x86 64-bit registers and AVX 256-bit registers).
---------
The key is that multiplying by an odd number (bottom-bit == 1) results in a fully invertible (aka: no information loss) operation.
So multiply-bitreverse-multiply is a 1-to-1 bijection in the 64-bit integer space: all 64-bit integers have a singular, UNIQUE multiply-bitreverse-multiply analog. (with multiply-bitreverse-multiply(0) == 0 being the one edge case where things don't really workout. An XOR or ADD instruction might fix that problem...).
---------
> This is a great idea. I wonder if we could speed up LuaJIT even more by SIMD accelerating the GC's mark and/or sweep phases...
Mark and Sweep looks hard to SIMD-accelerate in my opinion. At least, harder than a bump-allocator. I'm not entirely sure how a SIMD-accelerated traversal of the heap is even supposed to look like (aka: simd-malloc() looks pretty hard).
If all allocs are prefix-sum'd across the SIMD-units (ex: malloc ({1, 4, 5, 1, 2, 3, 20, 10}) == return (memory + {0, 1, 5, 10, 11, 13, 14, 34, 44})... for a bump-allocator like strategy... its clear to me that such a mark/sweep allocator would have fragmentation issues. But I guess it would work...
Semispace collectors innately fix the fragmentation problem. So prefix-sum(size+header) allocators are just simple and obvious.
--------
On the "free" side of Mark/sweep... I think the Mark-and-sweep itself can be implemented in GPU-SIMD thanks to easy gather/scatter on GPUs.
However, because gather/scatter is missing (scatter is missing from AVX2), or slow (AVX512 doesn't seem to implement a very efficient vgather or vscatter), I'm not sure if SIMD on CPU-based Mark/Sweep would be a big advantage.
------------
Yup yup. Semispace GC or bust, IMO anyway for the SIMD-world. Maybe mark-compact (since mark-compact would also fix the fragmentation issue).
The mark-phase is just breadth-first-search, which seems like a doable SIMD-pattern with the right data-structure (breadth-first is easier to parallelize than depth-first)
> Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough.
You totally nerd-sniped me! I implemented a basic "reverse 128-bit SIMD register" routine with `packed_simd` in Rust. The ideas to process 4 bits a time:
let lo_nibbles = input & u8x16::splat(0x0F);
let hi_nibbles = input >> 4;
Then, we can use `pshufb` to implement a lookup table for reversing each vector of nibbles.
let lut = u8x16::new(0b0000, 0b1000, 0b0100, 0b1100, 0b0010, 0b1010, 0b0110, 0b1110, 0b0001, 0b1001, 0b0101, 0b1101, 0b0011, 0b1011, 0b0111, 0b1111);
let lo_reversed = lut.shuffle1_dyn(lo_nibbles);
let hi_reversed = lut.shuffle1_dyn(hi_nibbles);
Now that each nibble is reversed, we can flip the lo and hi nibbles within a byte when reassembling our u8x16.
let bytes_reversed = (lo_reversed << 4) | hi_reversed;
Then, we can shuffle the bytes to get the final order. We could use a different permutation for simulating reversing f64s in a f64x2, too.
Looking at the disassembly, if we assumed our LUT and shuffle vectors are already in registers, the core shuffle should be pretty fast. (I haven't actually benchmarked this or run it through llvm-mca, though :-p)
The goal is to find the values of k1 and k2 that resulted in an evaluate(seed, k1, k2) close to 16-bits (aka: 50% of bits change, the definition of "avalanche condition"). There's probably some statistical test I could have done that'd be better, but GPUs have single-cycle popcount and single-cycle XOR.
I forgot exactly which search methods I used, but note that a Vega64 GPU easily reaches 10 Trillion-multiplies / second, so you can exhaustively search a 32-bit space in a ~millisecond, and a 40-bit space in just a couple of seconds.
You can therefore search the values of k1 and k2 ~8-bits at a time every few seconds. From there, plug-and-play your favorite search algorithm (genetic algorithms? Gradient descent? Random search? Simulated annealing?).
--------
After that, I'd of course run it through PractRand or BigCrush (and other tests). In all honesty: random numbers (with bottom bit set to 1) from /dev/urandom are already really good.
---------
Exhausting the 64-bit space seems unreasonable however. I was researching FNV-hashes (another multiplication-based hash), trying to understand how they chose their constants.
Ah yeah, those features... I forgot about them until you mentioned them, lol.
I was thinking about 4x (512-bits per iteration) with enc(enc), enc(dec), dec(enc), and dec(dec) as the four 128-bit results (going from 256-bits per iteration to 512-bits per iteration, with only 3-more instructions). I don't think I ever tested that...
But honestly, the thing that really made me stop playing with AESRAND was discovering multiply-bitreverse-multiply random number generators (still unpublished... just sitting in a directory in my home computer).
Bit-reverse is single-cycle on GPUs (NVidia and AMD), and perfectly fixes the "multiplication only randomizes the top bits" problem.
Bit-reverse is unimplemented on x86 for some reason, but bswap64() is good enough. Since bswap64() and multiply64-bit are both implemented really fast on x86-64-bit, a multiply-bswap64-multiply generator probably is fastest for typical x86 code (since there are penalties for going between x86 64-bit registers and AVX 256-bit registers).
---------
The key is that multiplying by an odd number (bottom-bit == 1) results in a fully invertible (aka: no information loss) operation.
So multiply-bitreverse-multiply is a 1-to-1 bijection in the 64-bit integer space: all 64-bit integers have a singular, UNIQUE multiply-bitreverse-multiply analog. (with multiply-bitreverse-multiply(0) == 0 being the one edge case where things don't really workout. An XOR or ADD instruction might fix that problem...).
---------
> This is a great idea. I wonder if we could speed up LuaJIT even more by SIMD accelerating the GC's mark and/or sweep phases...
Mark and Sweep looks hard to SIMD-accelerate in my opinion. At least, harder than a bump-allocator. I'm not entirely sure how a SIMD-accelerated traversal of the heap is even supposed to look like (aka: simd-malloc() looks pretty hard).
If all allocs are prefix-sum'd across the SIMD-units (ex: malloc ({1, 4, 5, 1, 2, 3, 20, 10}) == return (memory + {0, 1, 5, 10, 11, 13, 14, 34, 44})... for a bump-allocator like strategy... its clear to me that such a mark/sweep allocator would have fragmentation issues. But I guess it would work...
Semispace collectors innately fix the fragmentation problem. So prefix-sum(size+header) allocators are just simple and obvious.
--------
On the "free" side of Mark/sweep... I think the Mark-and-sweep itself can be implemented in GPU-SIMD thanks to easy gather/scatter on GPUs.
However, because gather/scatter is missing (scatter is missing from AVX2), or slow (AVX512 doesn't seem to implement a very efficient vgather or vscatter), I'm not sure if SIMD on CPU-based Mark/Sweep would be a big advantage.
------------
Yup yup. Semispace GC or bust, IMO anyway for the SIMD-world. Maybe mark-compact (since mark-compact would also fix the fragmentation issue).
The mark-phase is just breadth-first-search, which seems like a doable SIMD-pattern with the right data-structure (breadth-first is easier to parallelize than depth-first)