
What is the space overhead of Base64 encoding? - ingve
https://lemire.me/blog/2019/01/30/what-is-the-space-overhead-of-base64-encoding/
======
Jaruzel
I think the angle he's going for here is that if you have inline base64 blobs
in your html/css code, and that's then served by a HTTP(S) server using gzip
compression, _are_ you wasting lots of space/bandwidth? and the answer is +5%
to +2.5% overhead, which isn't much considering the larger advantage of a
single round trip to the server as all the assets are embedded in the html
document.

Of course there's also the overhead of decoding all those base64 blobs in the
browser, but I'm sure that's a topic for another blog post in the future :)

~~~
_petronius
For the first load of the page, sure. But if you're embedding HTML and CSS,
then you can't cache it separately, which means having to fetch it all on
_every_ request where anything has changed, in addition to the decoding etc on
every request.

~~~
masklinn
That can also be an advantage, if the connection has a high throughput but a
high latency (e.g. cellular networks, still), you'd much rather fetch
everything in a single request. There's a slight inefficiency in the inability
to independently cache assets, but even that's not necessarily a big issue:
you'd mostly inline small assets.

~~~
skohan
It still feels like a hack to work around a problem that should be solved at
the protocol level. There's no technical reason that multiple assets couldn't
be streamed over a single TCP connection. If it's a common use-case that a
number of assets need to be loaded at once to display a web page, then this
should be supported by HTTP.

~~~
jenscow
Great idea. Let's call it HTTP/2.

[https://en.wikipedia.org/wiki/HTTP/2](https://en.wikipedia.org/wiki/HTTP/2)

------
NelsonMinar
I wonder if gzip has improved? 15 years ago when I tested this, it was more
efficient to gzip base16 encoded data than base64 encoded data. (At least, the
English dictionary.) I assume that was because the 3:4 encoding broke up
patterns in the source text and messed with the compressor.

But I just tested this again and it's not true anymore.

    
    
      $ wc -c american-english  
      971578 american-english
      
      $ gzip < american-english | wc -c  
      259977
    
      $ base64 < american-english | gzip | wc -c  
      429263
    
      $ hexdump -e '"%x"' < american-english | gzip | wc -c  
      478411

~~~
kbaker
Actually, it is still true. hexdump has some confusing format options... what
you were using was converting it to little-endian first before printing the
hex representation, which really messed with gzip. Try this:

    
    
        $ hexdump -e '"%x"' < american-english | gzip | wc -c
        463871
        $ hexdump -v -e '/1 "%02x"' < american-english | gzip | wc -c
        302515
        $ base64 < american-english | gzip | wc -c
        415737

~~~
eadmund
> converting it to little-endian first

Ack, is there _anything_ little-endian doesn't ruin?

/me misses 68000 assembler …

------
tln
If you just compress the base64, the overhead is 5%. But if you embed within
an HTML file, with a different set of character frequencies, that increases.

    
    
        $ cat index.html | gzip -9 | wc -c
        14116
        $ cat index.html bing.base64 | gzip -9 | wc -c
        15994
        $ expr 15994 - 14116
        1878
        $ cat bing.base64 | gzip -9 | wc -c
        1432

~~~
tinus_hn
If you’re going to do an experiment like that you will probably want the file
to be a whole lot bigger for more accurate results.

------
kstenerud
I wrote some revised radix-based text encoding specifications to make them
interoperate better with modern text processor standards (SGML, string
literals, filenames, URIs, etc). I also included a representative test for how
they fare on uncompressed vs pre-compressed data when compressed with gzip:

[https://github.com/kstenerud/safe-
encoding/blob/master/READM...](https://github.com/kstenerud/safe-
encoding/blob/master/README.md#compression)

radix-64 of course fares the best, but radix-85 isn't much different, and is
10% smaller uncompressed.

------
kilovoltaire
The article compares raw vs base64+gzip, but I'd be interested to see gzip vs
base64+gzip

~~~
leetbulb
Did some further investigation:

i included some text files, hex encoding, and other compression as well :)

raw file list:
[https://hastebin.com/vazonowuvo.txt](https://hastebin.com/vazonowuvo.txt)

tsv (c/p to spreadsheet:
[https://hastebin.com/ewohafucem.tsv](https://hastebin.com/ewohafucem.tsv)

\---

(interesting) using bzip2, compression is better when the following file are
encoded first with base64 or hex: bing.png googlelogo.png peppers_color.jpg

Useless takeaways:

\- prefer base64 over hex when encoding already compressed images before
further compression

\- prefer hex over base64 when encoding plain text / low entropy data before
further compression

~~~
spurgu
Heh, so gzipped png's are generally larger than non-gzipped. Samples are prob.
heavily compressed though.

~~~
zrm
Not just that, PNG and gzip use the same compression algorithm:

[https://en.wikipedia.org/wiki/DEFLATE](https://en.wikipedia.org/wiki/DEFLATE)

------
ggm
We used to argue about the cost of enforcing a linewrap under 80 in email
days, without really discussing how elision of the \r\n embeds in the PEM
encoded or Base64 would make this a non-issue.

=Endmarker

------
sshine
That reminds me of when someone asked if strings that are base64-encoded twice
are regular. They are.

[https://stackoverflow.com/questions/49650847/determine-if-
st...](https://stackoverflow.com/questions/49650847/determine-if-string-is-
base64-encoded-twice)

~~~
tedunangst
What makes base64 encoded random data not regular?

~~~
zamadatix
Nothing, the top answer in the link covers that question trivially. The
question on if twice encoded strings are regular is simply the more
interesting one as it is not as trivial.

------
aboutruby
From archive.org:
[https://web.archive.org/web/20190131000036/https://lemire.me...](https://web.archive.org/web/20190131000036/https://lemire.me/blog/2019/01/30/what-
is-the-space-overhead-of-base64-encoding/)

------
bdhess
> Privacy-wise, base64 encoding can have benefits since it hides the content
> you access in larger encrypted bundles.

Uh, no.

~~~
zamadatix
Based on your selected quote and short comment I think you're reading that
differently than the author intended. Note the start of that paragraph (which
you left out):

> In some instances, base64 encoding might even improve performance, because
> it avoids the need for distinct server requests.

I.e. they are arguing inlining all resources and grabbing them in a single
request has a smaller fingerprint. This is probably less true with HTTP/2 or
QUIC

~~~
maxk42
Yes, but it's not encrypted.

Furthermore the article begins with the premise of using text-only data
transfer protocols such as MIME, then goes on to talk about base64-encoding
_then_ gzipping data. If he were to continue talking about text-only data
transfer, then he should've talked about gzipping, _then_ base64-encoding the
data or if he were talking about reducing the size of the data he should've
been talking about gzipping _instead_ of base64-encoding the data. Instead he
seems to be talking about something which isn't compatible with MIME. So the
article doesn't really seem to have a clear direction. If the point was to cut
down on a single round-trip to the server then congratulations -- you've
increased your page size by 2.5% and added the overhead of compression to the
request -- a much higher toll than the cost of a 2nd request for non-trivial
assets of the sort gzip would be effective for since it has its own overhead
requirements.

~~~
zamadatix
> Yes, but it's not encrypted.

Would your opinion change if you reread the content with the understanding
HTTPS is assumed when talking about web privacy in 2019? The author made no
claim whatsoever base64 was encrypting the data, just bundling it into one
generic request.

> Furthermore the article begins with the premise of using text-only data
> transfer protocols such as MIME, then goes on to talk about base64-encoding
> then gzipping data...

Would your opinion change if you reread the content with the assumption the
author means to set "Content-Encoding: gzip" in the server config rather than
literally gzipping the content?

I think both of these are fair assumptions for the target audience of the
article to assume but your disagreements with the article are only true
without them.

~~~
maxk42
> Would your opinion change if you reread the content with the understanding
> HTTPS is assumed when talking about web privacy in 2019?

Well he began by talking about email, so no. If we want to talk about HTTPS
and 2019, then let's serve the whole shebang with HTTP/2 and not worry about
reducing the number of requests, which seems to be the only advantage offered.

> Would your opinion change if you reread the content with the assumption

That's the assumption I was forced to make, and the crux of my argument.
Content-Encoding: gzip works in most servers by compressing the content on the
fly -- not by precompiling. Hence my comment about adding the overhead of
compression to the request.

------
ggm
Another gem, is the practice of seeking the minimal encoded instance which is
correctly identified as "legal" by magic. People do this for a.out, JPG, PNG
&c..

------
leetbulb
The site is down, but the answer is 33%. (four output bytes for every three
input bytes).

It could be slightly less if you want to remove the padding and calculate what
the padding _should be_ on the decoding side: base64string.length % 4.

Edit: If no entropy is added, it's still ~33%. C'mon :P

~~~
BeeOnRope
When it's up, you'll see the article goes a bit beyond a literal reading of
the title: the topic is the overhead of base64 under the assumption that the
output ends up compressed with something like gzip.

~~~
leetbulb
Figured as much. Interested to see the results. My assumption is that it
doesn't change much. However, I suppose that depends on the compression
algorithm.

~~~
BeeOnRope
In principle a good byte-wise entropy coder should recover nearly 100% of the
base-64 "inflation" since no entropy is added. In practice gzip doesn't get
all of it since it is an imperfect entropy coder for several reasons.

~~~
hermitdev
You are correct, but gzip is a generally happy medium of cost to encode/decode
and encoded size when transmitting over a fast network. For a 10 Mbps or
greater link, if you use something like bzip2 or 7z, you'll likely spend more
time encoding than transmitting a payload.

This is from my personal experience utilizing gzip and bzip2 via Boost
Iostreams to compress network payloads over a 1Gbps link. End to end latency
was far superior with gzip than bzip2, despite bzip2 having a smaller
transmission size.

~~~
BeeOnRope
gzip is only a happy medium if your other candidates are 7z (LZMA) or bzip2,
which are both stronger but slower compressors. bzip2 is essentially obsolete
(off the pareto frontier), and LZMA is good but slow so will only be best if
your transmission speed is low.

Near the space/time tradeoff point that gzip lives, however, it is thoroughly
outclassed by more modern compressors such as zstd or brotli with the
appropriate settings.

------
speedplane
This article compares base64 with gzipped base64 and posits that there isn't
too much difference in size. That's not terribly insightful. The only reason
base64 exists is because of a lack of standardized binary distribution
formats, especially over the internet. There is literally no substantive
difference between data, and base64 data. I'm actually quite surprised that
gzipped data is 2% larger, I'd expect a smaller margin.

