
Fast base64 encoding and decoding - joeyespo
https://lemire.me/blog/2018/01/17/ridiculously-fast-base64-encoding-and-decoding/
======
pmarcelll
Last fall I tried to figure out why Rust is slower than D in a base64
benchmark
([https://github.com/kostya/benchmarks](https://github.com/kostya/benchmarks)).
It turned out that decoding in Rust is really fast but encoding is a bit slow.
I looked at the assembly output of both programs and I could more or less
follow the steps in the Rust version, but the D version was only about 5 SIMD
instructions (without a single bit shift). I also checked the source code, the
Rust version is fairly long, because a part of the hot loop is manually
unrolled, but the D version looked like the simple, naive implementation in C.
It might be a result of newer LLVM version (LDC was already using LLVM 5 and
rustc is still on LLVM 4) or rustc still might have issues with optimizer
settings (both binaries were compiled for Intel Haswell).

~~~
jnordwick
Do you have LLVM IR output? There are definitely places where rust's hinting
isn't optimal, and that might not be too difficult to fix.

~~~
pmarcelll
I didn't look at the LLVM IR (since I'm not really familiar with it), but I
might investigate it a bit further in the near future.

~~~
jnordwick
I would encourage it. LLVM IR is fairly easy to understand. One of my first
forays into rustc was looking at the IR for iterators to see why one version
was show (turned out to be that the bounds on the index variable couldn't
proven tight enough to eliminate the bounds check, iirc).

Try writing the same simple code as D and compare the two IR representations.

And no matter how much I slag on rust being difficult, the community is really
good about helping and teaching.

------
floatboth
I made Go bindings for that library [https://github.com/myfreeweb/go-
base64-simd](https://github.com/myfreeweb/go-base64-simd) …

…to use in a mail indexer project…

…only to discover that half of my mail was not being decoded, because that
decoder does not skip whitespace.

~~~
jorangreef
Did you find any bugs in the library when implementing your binding?

Daniel Lemire has done brilliant work. His
[https://github.com/lemire/despacer](https://github.com/lemire/despacer) looks
great.

I wonder though if their Base64 SIMD algorithms could be tweaked to do the
decode in a single pass?

Like you, also email related, I have written a native Base64 addon for Node in
C optimized for decoding MIME wrapped Base64 in a single pass
([https://github.com/ronomon/base64](https://github.com/ronomon/base64)). It's
not yet using SIMD instructions but it's 1.5-2x as fast as Node's routines
with top speeds of 1398.18 MB/s decoding and 1048.58 MB/s encoding on a single
core (Intel Xeon E5-1620 v3 @ 3.50GHz). Here is an explanation of the
difference:
[https://github.com/nodejs/node/issues/12114](https://github.com/nodejs/node/issues/12114)

The Ronomon Base64 addon also ships with a fuzz test
([https://github.com/ronomon/base64/blob/master/test.js](https://github.com/ronomon/base64/blob/master/test.js))
which you are welcome to try out on your Go binding. It picked up an
interesting bug in Node's decode routine where some white space was actually
decoded as data
([https://github.com/nodejs/node/issues/11987](https://github.com/nodejs/node/issues/11987)).

------
dchest
_Base64 is commonly used in cryptography to exchange keys._

Note that if you're encoding or decoding secret data, such as keys, you most
likely want a _constant-time_ encoder (example:
[https://boringssl.googlesource.com/boringssl/+/master/crypto...](https://boringssl.googlesource.com/boringssl/+/master/crypto/base64/base64.c)),
not the fastest one.

~~~
sd8dgf8ds8g8dsg
You seem to be confusing encoding (base64 for example), and encryption (AES
for example).

Where a side channel attack studying the decryption speed/power use can reveal
information, the encoding of said encrypted data can not as far as I ever
heard about. (It will still be encrypted after base64 decoding has taken
place).

Please correct me if I'm wrong.

EDIT: I see you updated your original message with different wording, please
ignore this rant.

~~~
CiPHPerCoder
Side-channel attacks against base64 encoding have yet to be proven, but
constant-time implementations are available _just in case_.

My contribution is here:
[https://github.com/paragonie/constant_time_encoding](https://github.com/paragonie/constant_time_encoding)

~~~
nickpsecurity
I'll add a high-assurance implementation from Galois to that which is probably
not constant time. Their blog and Github has quite a few useful tools.

[https://galois.com/blog/2013/09/high-assurance-
base64/](https://galois.com/blog/2013/09/high-assurance-base64/)

Also, anyone wanting constant time implementation might just run a verified
implementation through something like FaCT or Jasmin:

[https://cseweb.ucsd.edu/~dstefan/pubs/cauligi:2017:fact.pdf](https://cseweb.ucsd.edu/~dstefan/pubs/cauligi:2017:fact.pdf)

[https://acmccs.github.io/papers/p1807-almeidaA.pdf](https://acmccs.github.io/papers/p1807-almeidaA.pdf)

------
dragontamer
[https://arxiv.org/abs/1704.00605](https://arxiv.org/abs/1704.00605)

In short, they spend roughly (EDIT: Herp-derp) 1/5th a cycle per byte of
base64 through the use of AVX2 instructions to encode / decode base64. So one-
cycle every 5 bytes.

This includes robust error checking (allegedly)

\---------

Because SIMD-based execution was so successful in this example, I do wonder if
a GPGPU implementation would be worthwhile. If some very big data were
Base64-encoded / decoded (like MBs worth), then would it be worthwhile to
spend a costly PCIe Transaction to transfer the data to the GPU and back?
Something like an email-attachment is Base64 encoded for example.

I'd expect most web programs to be only a few kilobytes (at most) of base64
data however. A few hundred bytes for Cookies and the like. So the typical web
case of base64 doesn't seem "big" enough to warrant the use of a GPU, so the
AVX2 methodology would be ideal.

~~~
jandrese
I suspect you would bottleneck on the PCIe bus before you saw improvement.
GPUs usually aren't great at problems where you have a huge stream of data to
work with and not a huge amount of processing needed per byte.

If your data is already in GPU memory for some reason then yeah, it's going to
blast through the data at an insane rate, but getting it there in the first
place is the problem.

~~~
dragontamer
> I suspect you would bottleneck on the PCIe bus before you saw improvement.

Heh, you're right. Its not even close. PCIe x16 is slightly less than 16GB/s.
And that's a 1-way transfer, the 2nd transfer back effectively halves the
speed.

At 4GHz, this Base64 encoding / decoding scheme is doing 20GB/s of encoding
(round numbers for simplicity). So literally, it is slower to transfer the
data to the GPU than to use those AVX2 instructions.

Heck, its slower on the one-way trip to the GPU, before it even comes back
(and before the GPU even touches the data!)

------
skerit
It's a shame nobody is really using yEnc instead of base64. yEnc only has a
overhead cost of about 2%, compared to 33% for base64, and would be perfect
for use on webpages.

~~~
NL807
yEnc has a fair share of problems related to reliable encoding and decoding
for some character sets (UTF-8 for example). Normally this is solved with the
MIME types, but yEnc is not officially registered. Nor does it have an
official RFC standard. base64 on the other hand "just works".

------
cperciva
Note that this is considerably faster than the amount of memory bandwidth
available per core.

~~~
dragontamer
No its not, at least if we're talking L2 cache or faster.

Intel processors support two loads + one write per core. That can be 2x256-bit
reads + 1x256-bit write. Each clock, every clock. At least to L1 and L2 (the
only levels of cache which have that level of bandwidth).

AMD Ryzen supports 2x128-bit reads OR 1x128-bit write (1-read+1write is fine).

\----------

The kicker is that you must use AVX2 registers to achieve this bandwidth. If
you're loading / storing 64-bit registers like RAX, you can't get there.

A raw "copy" between locations (assuming they're entirely in L2 memory or
faster) can be done by Skylake at 32b per cycle. The speed of this Base64
encoder however is "just" at 5bytes per cycle.

I'm curious how their code would perform on AMD's Ryzen platform, because of
the lower-bandwidth of Ryzen. Ryzen's AVX is also split into 2x128-bit ports,
instead of Intel's 2x256-bit ports. I'd expect AMD's Ryzen to do worse, but
maybe there are other bottlenecks in the code????

~~~
BeeOnRope
> Intel processors support two loads + one write per core. That can be
> 2x256-bit reads + 1x256-bit write. Each clock, every clock. At least to L1
> and L2...

You can approach 2 reads + 1 write in L1, but you won't get close for data in
L2, especially when writes are involved. Writes dirty lines in the L1, which
later need to be evicted, which eat up both L1 read-port and L2 write-port
bandwidth.

You'll find that with pure writes to a data set that fits in L2 but not L1,
you get only 1 cache line every 3 cycles, so nowhere close to 1 per cycle. If
you add reads, things will slow down a bit more.

Even Intel documents their L2 bandwidth at only 32 bytes/cycle max, and 25
bytes per cycle sustained (on Broadwell - slightly better for Skylake client),
and that's "one way" i.e., read-only loads. When you have writes and have to
share with evictions it gets worse. So again nowhere close to the 96
bytes/cycle if you could do 2 32-byte reads and 1 32-byte store a cycle.

Some details are discussed here (although it's for Haswell, it mostly applies
to everything up to but not including Skylake-X):

[https://software.intel.com/en-us/forums/software-tuning-
perf...](https://software.intel.com/en-us/forums/software-tuning-performance-
optimization-platform-monitoring/topic/532346)

~~~
dragontamer
I appreciate your thoughts on this.

I agree with your argument. But... I feel like I should defend my thoughts a
bit. My statements were based on this Intel Article:
[https://software.intel.com/en-us/articles/why-efficient-
use-...](https://software.intel.com/en-us/articles/why-efficient-use-of-the-
memory-subsystem-is-critical-to-performance)

Which suggests L1 and L2 cache bandwidth was the same. But when I look at the
sources you point out, it seems like your statements are correct. I think I'll
trust your sources more, since they're benchmark-based... than a single graph
(probably doing theoretical analysis). Thanks for the info.

~~~
BeeOnRope
To be fair, unless I missed it, that article is mostly just saying "The L1 and
L2 are much faster than L3 and beyond". That's correct - but it doesn't imply
that the L2 is somehow as fast as the L1 (if that were true, why have the
distinction between L1 and L2 at all?).

Once you go to the L3, you suffer a large jump in latency (something like 12
cycles for L2 to ~40 cycles for L3 - and on some multi-socket configurations
this gets worse) and you are now sharing the cache with all the other cores on
the chip, so the advice to "Keep in L1 and L2" make a lot of sense.

Intel used to claim their L2 had 64 bytes per cycle bandwidth - and indeed
there is probably a bus between the L1 and L2 capable of transferring 64 bytes
per cycle in some direction: but you could never achieve that since unlike the
core to L1 interface, which has a "simple" performance model, the L1<->L2
interface is complicated by the fact that you need to (a) accept evictions
from L1, and (b) that the L1 itself doesn't have an unlimited number of ports
to simultaneously accept incoming 64-byte lines from L2 and (c) everything is
cache-line based.

The upshot is that even ignoring (c) you never get more than 32 bytes per
cycle from L2 and usually less. Intel recently changed all the mentions of "64
bytes per cycle for L2" into "32 bytes max, ~2x bytes sustained" in their
optimization manual in recognition of this.

Once you consider (c) you might get even way less bandwidth from L2: in L1, on
modern Intel, your requests can be scattered around more or less randomly and
you'll get the expected performance (there is some small penalty for splitting
a cache line: it counts "double" against your read/write limits), but with L2
you will get worse performance with scattered reads and writes since every
access is essentially amplified to a full cache line.

It turns out that L2 writes are especially bad:

[https://stackoverflow.com/q/47851120/149138](https://stackoverflow.com/q/47851120/149138)

------
nwmcsween
The benchmarks shows the 'generic' aka SWAR version w or w/o openmp is the
best for almost all purposes. The only thing I would change is the func sig to
size_t base64_enc(const char * src, char * dst, size_t len) to ret written
size.

------
faragon
Another fast implementation (plain C99, without using SIMD), supporting buffer
aliasing (i.e. allowing in-place encoding/decoding, check senc_b64() and
sdec_b64() functions):
[https://github.com/faragon/libsrt/blob/master/src/saux/senc....](https://github.com/faragon/libsrt/blob/master/src/saux/senc.c)

------
powturbo
In my benchmarks only the AVX2 is faster than the portable scalar
"turbobase64"
[https://github.com/powturbo/TurboBase64](https://github.com/powturbo/TurboBase64)

SIMD does not always mean faster.

------
tzahola
I hate base64. If you have to embed binary data within human-readable text,
then you’re either

A: doing something horribly wrong

B: or forced to use a monkey-patched legacy standard

Otherwise just use hexadecimal digits. At least it will be _somewhat_
readable.

I know, I know... Base64 encodes 3 bytes on 4 characters, while hex encodes 2
bytes on 4. Big fucking deal. It will almost surely be gzipped down the line,
and if you’re storing amounts of data where this would count, then see point
“A” above.

I hope one day we will laugh at the mere idea of base64 encoding like how we
do with EBCDIC or Windows-12XX codepages.

~~~
_wmd
Base64 has been around forever, and hex has significantly lower information
density than it - consistently more than 30% smaller.

As for doing something horribly wrong, I dunno, ensuring literally billions of
moms and pops can exchange baby pictures by e-mail, for the first time in
human history, and doing so since the time when we were using 33.6kbit modems,
that seems like a resounding success.†

† Yes, if e-mail had not paid such close attention to compatibility, and just
ripped up every old standard on every chance it got (i.e. pretty much like any
technology invented since the commoditization of the web), interoperability
would never have happened. Legacy crap is fundamentally a part of that
success, and Base64 is fundamentally a part of making those hacks practical
for the technology we had at the time††

†† To head off an obvious counterargument: yes, there are contemporary
examples of why this strategy is still important (it's, so far, the only one
that has ever worked in the decentralized environment of the Internet), and is
very likely to come in useful in future, time and time again

~~~
hackcasual
Hex has lower information density, however it is more compressible,
particularly if your data is byte aligned. For example the string "aaaaaa" is
b64 encoded as "YWFhYWFh" but in hex is "616161616161". On a project I'm on,
we switched from base64 to hex for some binary data embedded in JSON, and saw
~20% size reduction, since its always compressed by the web server.

~~~
_wmd
Today I learned!

    
    
        >>> import random,zlib,codecs
        >>> sample = random.getrandbits(1024).to_bytes(1024, 'big') * 1024
        >>> len(zlib.compress(codecs.encode(sample, 'base64')))
        22083
        >>> len(zlib.compress(codecs.encode(sample, 'hex')))
        4326
        >>>
    

edit: (25 minutes of digging around the Internet, and I find this):

    
    
        >>> sample = os.urandom(1024) * 1024
    
        >>> len(sample.encode('base64'))
        1416501
        >>> len(sample.encode('hex'))
        2097152
        >>> len(sample.encode('ascii85'))
        1310720
    
        >>> len(zlib.compress(sample.encode('base64')))
        82169
        >>> len(zlib.compress(sample.encode('hex')))
        12537
        >>> len(zlib.compress(sample.encode('ascii85')))
        8212
    

Had to use Python 2.x to make use of the 'hackercodecs' package that
implements ascii85, but looks like it's the best of both worlds, assuming its
character set suits whatever medium you are transferring over, and decoding it
on the other end doesn't require some slow code.

final edit: I'm guessing it was an accidental trick of the repetitive input
data lining up well. On real data hex still wins out (which should have been
obvious in hindsight):

    
    
        >>> sample = ''.join(sorted(open('/usr/share/dict/words').readlines()))
    
        >>> len(zlib.compress(sample.encode('ascii85')))
        1320809
        >>> len(zlib.compress(sample.encode('base64')))
        1116678
        >>> len(zlib.compress(sample.encode('hex')))
        880651

~~~
andreareina
Huh, my results are drastically different:

    
    
        $ dd if=/dev/urandom bs=512 count=2048 | base64 | gzip | wc
        ...
        4088   31928 1059085
    
        $ dd if=/dev/urandom bs=512 count=2048 | xxd -p | gzip | wc    ...
        5019   33798 1231268
    

1 megabyte of random data consistently results in ~1 megabyte of compressed
base64 text, ~1.2 megabytes of compressed hex.

~~~
Twirrim
Repeating the exercise using a photograph
([https://imgur.com/B4tqkrZ](https://imgur.com/B4tqkrZ)):

    
    
        $ pv lock-your-screen.png | base64 | gzip | wc
         2.1MiB 0:00:00 [  13MiB/s] [================================>] 100%
           8641   48904 2264151
    
    
        $ pv lock-your-screen.png | xxd -p | gzip | wc
         2.1MiB 0:00:00 [4.41MiB/s] [================================>] 100%
          10109   49956 2573293
    
    

See the same against the jpg version (I've no idea why I have kept both a jpg
and png of the same image around, especially given jpg is much better suited
format):

    
    
        $ pv lock-your-screen.jpg | base64 | gzip | wc
         377KiB 0:00:00 [19.2MiB/s] [================================>] 100%
           1420    8373  392796
    
    
        $ pv lock-your-screen.jpg | xxd -p | gzip | wc
         377KiB 0:00:00 [9.36MiB/s] [================================>] 100%
           1487    8935  441077
    

In both cases base64 is both faster and compresses smaller.

~~~
BeeOnRope
This test isn't very informative because both .png and .jpg are already
compressed formats, with "better than gzip" strength so gzip/deflate isn't
going to be able to compress the underlying data.

You only see some compression because gzip is just backing out some of the
redundancy added by the hex or base64 encoding, and the way the huffman coding
works favors base64 slightly.

Try with uncompressed data and you'll get a different result.

Your speed comparison seems disingenuous: you are benchmarking "xxd", a
generalized hex dump tool, against base64, a dedicated base-64 library. I
wouldn't expect their speeds to have any interesting relationship with best
possible speed of a tuned algorithm.

There is little doubt that base-16 encoding is going to be very fast, and
trivially vectorizable (in a much simpler way than base-64).

~~~
dnet
> both .png and .jpg are already compressed formats, with "better than gzip"
> strength

FWIW, PNG and gzip both use the DEFLATE algorithm, so I wouldn't call PNG's
compression "better than gzip".

Source:
[https://en.wikipedia.org/wiki/DEFLATE](https://en.wikipedia.org/wiki/DEFLATE)

> This has led to its widespread use, for example in gzip compressed files,
> PNG image files and the ZIP file format for which Katz originally designed
> it.

~~~
BeeOnRope
Like any domain-specific algorithms PNG uses deflate as a _final step_ after
using image-specific filters which take advantage of typical 2D features. So
in general png will do much better than gzip on image data, but it will
generally always do _at least_ as well (perhaps I should have said that
originally). In particular, the worse case .png compression (e.g., if you pass
it random data, or text or something) is to use the "no op" filter followed by
the usual deflate compression, which will end up right around plain gzip.

Now _at least as good_ is enough for my point: by compressing a .png file with
gzip you aren't going to see additional compression in general. When
compressing a base-64 or hex encoding .png file, the additional compression
you see is largely only a result of removing the redundancy of the encoding,
not any compression of the underlying image.

~~~
BeeOnRope
Ooops, that should read "Like _many_ domain-specific algorithms" not "Like
_any_ ..."

------
top_post
[https://github.com/monobeard/libasmb64](https://github.com/monobeard/libasmb64)

~~~
Twirrim
I'm not sure what your point is here?

I mean, we could all go and dig up links to base64 encoders and decoders, but
the article you're commenting on is specifically about how fast they've been
able to get using vector instructions on modern processors.

~~~
faragon
They provide an optimized non-SIMD generic implementation. The version I
posted is similar, speed-wise, and it supports in-place base64
encoding/decoding (i.e. allowing aliasing, supporting the case of using one
buffer for both input and output).

