In short, they spend roughly (EDIT: Herp-derp) 1/5th a cycle per byte of base64 through the use of AVX2 instructions to encode / decode base64. So one-cycle every 5 bytes.
This includes robust error checking (allegedly)
Because SIMD-based execution was so successful in this example, I do wonder if a GPGPU implementation would be worthwhile. If some very big data were Base64-encoded / decoded (like MBs worth), then would it be worthwhile to spend a costly PCIe Transaction to transfer the data to the GPU and back? Something like an email-attachment is Base64 encoded for example.
I'd expect most web programs to be only a few kilobytes (at most) of base64 data however. A few hundred bytes for Cookies and the like. So the typical web case of base64 doesn't seem "big" enough to warrant the use of a GPU, so the AVX2 methodology would be ideal.
If your data is already in GPU memory for some reason then yeah, it's going to blast through the data at an insane rate, but getting it there in the first place is the problem.
Heh, you're right. Its not even close. PCIe x16 is slightly less than 16GB/s. And that's a 1-way transfer, the 2nd transfer back effectively halves the speed.
At 4GHz, this Base64 encoding / decoding scheme is doing 20GB/s of encoding (round numbers for simplicity). So literally, it is slower to transfer the data to the GPU than to use those AVX2 instructions.
Heck, its slower on the one-way trip to the GPU, before it even comes back (and before the GPU even touches the data!)