
Base64 encoding and decoding at almost the speed of a memory copy - beefhash
https://arxiv.org/abs/1910.05109
======
cperciva
The code:
[https://github.com/WojciechMula/base64-avx512](https://github.com/WojciechMula/base64-avx512)

------
emirp
Great use of AVX-512 VBMI!

Note that not many Intel processors support the two specific CPU instructions
used (vpermb and vpmultishiftqb).

Cannon Lake Core i3 (not many of these around due to delays with intel's 10nm
fabrication)

Ice Lake Core i3, i5 and i7: Released in September 2019

~~~
darkwater
Why do they say then "We use the SIMD (Single Instruction Multiple Data)
instruction set AVX-512 available on commodity processors" ?

~~~
brazzy
Because those _are_ commodity processors nonetheless. The distinction relevant
for a research paper is between mainstream processors you can just buy vs.
custom made ones, not about market penetration.

~~~
dickeytk
I still think "commodity" is not the correct word even in that context. Mass-
market, or commercially available might be better. How can it be a commodity
when there is only one manufacturer?

~~~
penagwin
Just going off wikipedia (also other sources support the definition):

> In economics, a commodity is an economic good or service that has full or
> substantial fungibility: that is, the market treats instances of the good as
> equivalent or nearly so with no regard to who produced them.

Basically it's a common good, that is sold and purchased, with no other
modifications required for this use-case. This commodity is currently low in
supply, but is still a commodity.

~~~
bloomer
It’s missing the fungibility where it is produced by multiple suppliers and
can be treated equivalently by the end buyer. Large grade A eggs are a
commodity as it doesn’t matter which farm produced them. A specific set of
processors from Intel are not a commodity since there are no fungible
equivalents. Widely available would probably be a better description in this
case.

~~~
dickeytk
I would argue processors are never commodities anymore (maybe back in the 486
days it was closer). You can't swap an intel for an amd processor into the
same motherboard ever today. They don't have the same performance
characteristics from one to another. You could have 2 "3.5Ghz 4-core
processors" that have wildly different real-world performance. IMO this is the
opposite of "fungible"

I think hard drives, memory, power supplies are much closer to being fungible.

------
londons_explore
If base64 decoding speed makes a difference in your application, you should be
considering _why_ you are transmitting data in base64, a neat hack from the
1970's, which is non-human-readable, wastes 30% of the network bandwidth,
kills compression, wastes RAM, is typically hard to random-access, and is
generally a bad idea all round.

~~~
_pmf_
How does it kill compression?

~~~
londons_explore
Try it...

    
    
        wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | gzip -9 | wc -c
        355350
        wget http://mattmahoney.net/dc/enwik8.zip -O - | gunzip | head -c 1000000 | base64 | gzip -9 | wc -c
        528781
    

base64 is 48.8% larger after compression on english text, whereas it would
only be 33.3% larger without compression. The reason is because compression
finds and eliminates repeating patterns, but base64 can make the same input
data look totally different depending on which of the 4 input alignments it
has.

~~~
xxs
oddly enough lz/deflate are as ancient as base64. 16bit deflate is quite poor
and slow, yet predominant.

------
nathell
Wojciech Muła really groks SIMD. It's well worth exploring his other work in
that area:
[http://0x80.pl/articles/index.html](http://0x80.pl/articles/index.html)

------
usr1106
Why is that an achievement? Isn't it so that during a memory copy the CPU is
basically idling because memory is so much slower than the CPU? Weren't that
hundreds of instructions per memory access? So instead of having it wait it
can also do computations (just standard instructions). Or is base64
computationally so heavy that it cannot fit into the "gaps"? I certainly have
not tried, just thinking from the high level view that CPUs are incredibly
fast and main memory is incredibly slow in comparison. And of course assuming
that the data does not fit into any cache.

~~~
vardump
You're mostly right. It's often not that hard to completely saturate memory
bus.

However, base64 has weird mappings that take some processing to undo in SIMD –
can't use lookup tables and need to regroup bits from 8 bit to 6 bit width.
That does take a lot of cycles without specialized bit manipulation
instructions.

Also the data you'll need to process is often probably already hot in the CPU
caches. Disk/network I/O can be DMA'd directly into L3 cache.

So I think this decoder is a useful contribution. But also somewhat no-brainer
once you have those AVX-512 instructions available.

~~~
londons_explore
Why can't you use lookup tables?

~~~
fanf2
Part of the point of this paper is that the lookup tables for base64 fit into
a 512 bit vector register.

~~~
vardump
Notably this technique only works in _this_ special case (or other small
LUTs). Generic (larger) lookup tables can't be vectorized yet.

------
rurban
Not yet usable asis. Normally AVX-512 instructions are subject to
downclocking. Now they tried a very recent CPU which is not subject to
downclocking anymore. They didn't add code to test for the CPU revision which
stopped downclocking, and they didn't benchmark with those older CPU's. We can
only guess the older AVX256 code is faster there, but not how much.

~~~
wmu
Speaking of benchmarking against older CPUs: the paper falls into the "short
communication" category, we didn't want to discuss all possible hardware
configurations.

------
m0zg
Many people don't realize this but today's memory is dramatically underpowered
for today's CPU. Consider the following: you can, theoretically, get ~80GB/sec
of memory bandwidth from a modern Intel CPU (assuming you use all channels,
and there are no interrupts, etc). That's 20 billion floats or int32s per
second. The same CPU can do ~2.3 fp32 TFLOPS. That's 115 ops for every fp32.
And it's getting worse as more and more cores are put on the same memory bus.

~~~
Const-me
Your numbers look strange. Did you quoted from a $10000 xeon platinum very few
people use?

In my current desktop PC, it’s 700 GFlops theoretical compute versus 50GB/sec
of theoretical RAM bandwidth.

~~~
m0zg
For i9-7980XE Intel quotes 1.3TFLops of fp64. So it ought to do about twice
that at fp32. Yours at $1500+tax from Newegg:
[https://www.newegg.com/core-i9-x-series-extreme-edition-
inte...](https://www.newegg.com/core-i9-x-series-extreme-edition-intel-
core-i9-7980xe/p/N82E16819117836). Quad channel RAM, too.

For a flea check, consider that the chip has 18 cores and runs at ~4.2GHz. To
achieve the theoretical throughput each core has to do ~30 single precision
flops per cycle. Which sounds about right: HEDT intel chips can
(theoretically) do 32 of those per cycle.

This, of course, ignores whatever throttling Intel would need to apply to
maintain TDP, so a purely theoretical back of the envelope.

------
taspeotis
The authors have been doing good work in this area for a while:
[https://arxiv.org/abs/1704.00605](https://arxiv.org/abs/1704.00605)

------
powturbo
Turbobase64: A portable scalar implementation can beat SIMD and saturates the
fastest networks and fastest SSDs.

It is faster than SSE on Intel/AMD and SIMD NEON on ARM see benchmark at Turbo
Base64:
[https://github.com/powturbo/TurboBase64](https://github.com/powturbo/TurboBase64)

~~~
tomsmeding
In the readme for Turbobase64, they state:

> Only the non portable AVX2 based libraries are faster

The paper in OP is about 2 times as fast as their previous AVX2 code. I think
they are superior to Turbobase64 here (though with more stringent hardware
requirements).

~~~
powturbo
I agree and according to a steam hardware survey only 1% are using AVX2.
AVX-512 is practically unused. And don't forget the billions ARM CPU's.

~~~
sambe
Their numbers are surely wrong? Firstly, AVX2 jumped from 1% to 5% in a month.
Secondly, it looks to have been on every Intel processor this decade.

~~~
eMSF
Even 5% sounds sketchy indeed. However, AVX2 (or Haswell New Instructions)
hasn't quite been on every Intel processor this decade, rather only since the
4th generation (late 2013), and only in Core-branded processors.

------
mensetmanusman
iOS 13 allows the Shortcuts app to encode/decode information as Base64 as part
of the scripting language.

People are using this to do clever things like encoding a watermark image to
overlay on a picture, because the Shortcuts app does not allow file
attachments as part of the scripting language.

~~~
ZeikJT
So this could've been done in the past with a manually written decoder but now
it's a built in? That's great

------
vardump
Now show me memory speed SIMD uint8_t histogram, and I'll be happy. Computing
histogram is slow and it doesn't really vectorize. Yet it's required by many
important algorithms like data compression.

~~~
powturbo
Not memory speed but extremely fast:
[https://github.com/powturbo/TurboHist](https://github.com/powturbo/TurboHist)

~~~
vardump
Yeah, it's nice, about what you can achieve. I'd be happy if it was about 8
times faster. Less than 0.2 cycles per byte would be good.

Unfortunately, that's just not achievable with current x86 instruction set.

~~~
powturbo
There is new AVX512 instruction (_mm512_conflict_epi32) supposed to solve
this, but it can't make the histogram construction faster than the scalar
functions.

------
robocat
Unfortunately using AVX512 instructions only gets a speedup in _very_ specific
situations and for many real world use cases it actually underperforms due to
oddities of scaling and switching delays. Profiling for more than a few
milliseconds is one place you see phantom gains, so take care not to be
deceived.

See

[https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

[https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-
us...](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-
new-instructions/)

[https://news.ycombinator.com/item?id=21029417](https://news.ycombinator.com/item?id=21029417)

Edit: not saying this isn't a true benefit here, just that claims of speed
when using AVX512 need to be treated with fair scepticism for actual use
cases.

~~~
cperciva
_[https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-
us...](https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-
new-instructions/) _

Considering that the author of that blog is one of the authors of this paper,
I think he's very aware of the benchmarking issues.

~~~
rurban
He is aware, but sidestepped these issues. so this code is only recommended on
the newest Cannon Lake processors, but we really want to know for which CPU
which method is best. What about AMD Rome e.g.?

~~~
justin66
Since AVX-512 does not exist on AMD Rome, that question answers itself.

------
vagab0nd
Question: with this kind of optimization, how does the program run different
functions on different CPUs? Like the ones that don't support AVX512? Some
kind of runtime dispatch based on CPU ID?

------
pabs3
I wonder if the authors plan to get this merged into commonly used
implementations of base64 so that folks can benefit from their research.

~~~
wmu
For instance our SSE/AVX2 algorithms have been included in a great, mature
library written and maintained by Alfred Klomp:
[https://github.com/aklomp/base64](https://github.com/aklomp/base64) (the
library includes also vectorized code for ARM CPUs).

~~~
pabs3
Is that library widely used, for example is it used by Firefox or Chromium?

------
jason0597
Now this is the programming I like, low-level assembly optimisation of long
operations, none of this web development rubbish

~~~
panic
This is relevant to webdev, though -- binary data in JSON is often encoded as
a Base64 string.

~~~
wingi
Why use 6-bit data, if JSON can transport 8-bit (-1 for quoting the ") ?

~~~
cryptonector
Byte values that are ASCII control characters need to be escaped, and byte
values that are not valid UTF-8 can't be represented in JSON strings.

------
zuck9
Also worth checking out:
[https://github.com/superhuman/fast64](https://github.com/superhuman/fast64)

That is what Superhuman uses for decoding base64 in browser.

~~~
vardump
I don't see how that's relevant, it looks pretty standard approach for base64
decoding. You can find thousands of similar examples.

~~~
lifthrasiir
Not only it is just a standard approach, it even misses a relatively common
optimization for base64 decoding: instead of computing `(lut[a] << 6) |
lut[b]` etc., one can precompute `lut6[x] = lut[x] << 6` and compute `lut6[a]
| lut[b]` to avoid shifting. This optimization is famously used by Nick
Galbreath's MODP_B64 decoder, which is used by Chromium [1] and turns out to
be the most performant non-SIMD decoder according to Lemire et al. [2]

[1]
[https://github.com/chromium/chromium/tree/master/third_party...](https://github.com/chromium/chromium/tree/master/third_party/modp_b64)

[2]
[https://github.com/lemire/fastbase64](https://github.com/lemire/fastbase64)

~~~
powturbo
You can do better, without SIMD:
[https://github.com/powturbo/TurboBase64](https://github.com/powturbo/TurboBase64)

The simple base64 scalar version is also faster the chromium implementation.

