> I suspect you would bottleneck on the PCIe bus before you saw improvement.

Heh, you're right. Its not even close. PCIe x16 is slightly less than 16GB/s. And that's a 1-way transfer, the 2nd transfer back effectively halves the speed.

At 4GHz, this Base64 encoding / decoding scheme is doing 20GB/s of encoding (round numbers for simplicity). So literally, it is slower to transfer the data to the GPU than to use those AVX2 instructions.

Heck, its slower on the one-way trip to the GPU, before it even comes back (and before the GPU even touches the data!)

