
Base-122 – A space efficient alternative to base-64 - kevinAlbs
http://blog.kevinalbs.com/base122
======
zackmorris
[http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_c...](http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_character)

Almost a perfect standard, but the prepended one byte header is a mistake
IMHO. It makes it impossible to encode when the input size is unknown. Better
to encode whether the last chunk is one byte or two at the end of the stream.

Please whoever is involved with this, revise the standard to not have a header
and call this existing spec a beta. Otherwise, good work.

Edit: I have opened an issue for this:
[https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...](https://github.com/kevinAlbs/Base122/issues/3#issue-191881594)

~~~
dmbarbour
It would be easy to do. That 3-bit word just needs to start at 1 (since it
encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether
another 7-bit segment follows.

------
userbinator
The use of codepoints below 32 (space, start of what's usually considered
"printable") makes me a bit hesitant. A lot of systems won't preserve those
characters. Base85 is a more efficient alternative to base64, and doesn't use
that lower range:

[https://en.wikipedia.org/wiki/Ascii85](https://en.wikipedia.org/wiki/Ascii85)

~~~
jws
base 85 also has the interesting property that 4 original bytes fit in 5
encoded bytes. Depending on your processor's memory model and the cost of
multiplies compared to shifts this can make it the best performer.

This was true on Vax 8200 hardware back in the day. In the same software, with
Huffman decoding of JPEGs it was also fastest to create a finite state machine
with an 8 bit symbol size. I suspect that is no longer true since it would
kill your L1 cache and be well into your L2 cache on modern x86 machines. It
is probably better to take the instruction count hit and process as bits or
nibbles.

~~~
feadog
_base 85 also has the interesting property that 4 original bytes fit in 5
encoded bytes._

Still useful for Javascript, as the bit shift operators work on 32 bit
"registers".

------
poizan42
A lot of people have experimented with a lot of different ways of encoding
binary data as printable text. Wikipedia has a list of different encoding
schemes[0].

The most efficient one is yEnc[1]. Still the simplest ones such as base64 or
good old hex may actually work better once compression comes into the picture.

[0]: [https://en.wikipedia.org/wiki/Binary-to-
text_encoding](https://en.wikipedia.org/wiki/Binary-to-text_encoding)

[1]: [https://en.wikipedia.org/wiki/YEnc](https://en.wikipedia.org/wiki/YEnc)

~~~
NelsonMinar
It's crucial to evaluate encoding space usage in the context of compression.
For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for
practical data. Even though base64 is more efficient than base16, it breaks up
data across byte boundaries which then makes gzip significantly less
efficient.

~~~
bunderbunder
When would you gzip encoded data instead of encoding gzipped data? Doesn't
gzip after encoding defeat the whole idea of encoding the data in a format
that won't get mangled by systems that expect to be handling text?

~~~
dleslie
When serving gzip-compressed pages to browsers that support it.

~~~
takeda
He meant that you typically use base64 when the medium you use (e.g. email)
doesn't support binary data. When you compress base64 encoded data you get
back binary output. If binary output is ok to transfer, then why would you use
base64 in the first place? Why not just compress the raw data?

~~~
detaro
If your embedding encoded data in another file format which forces
restrictions on it. The encoding in the article is very explicitly optimized
to be embedded in HTML attributes, which have a limited character range. The
full HTML document is later compressed for transport, over a protocol that a)
is aware of the compression and b) can transport binary data.

------
throwbsidbdk
I appreciate the author building a more efficient alternative to base64, but I
had to laugh for a moment at the suggested use case.

Using an alternative to base64 encoded data for web pages and node is a
horrible idea. If I came across a code base using that I would scream. A lot.
None of my tools work with it, and now my browser has to run a bunch of js for
something that's normally native and very fast. Its a 1-2kb savings per page
that's going to make somebody jump off bridge one day.

More terrifying is that the JavaScript community is so fascinated by shiny
objects that thousands of people are going to use this. I'm not sure if it's
more funny or terrifying.

Anyways, this encoding is still useful if you need to pass data between legacy
systems. I've used html escapes and ASCII85 before to get around annoying old
stuff that doesn't use Unicode many times

~~~
RobIII
> None of my tools work with it, and now my browser has to run a bunch of js
> for something that's normally native and very fast.

It's not that I don't agree with the general idea of your post, but I disagree
with this part. If nobody every tries something new we'd still be in the stone
age. I think base122 is a bit silly but _if_ it had it's use and adoption
became widespread then browsers and tooling etc. would follow, just as they
now support base64.

------
daenney
Does anyone else find it odd that both

 _Base-122 was created with the web in mind._

And

 _As §3 shows, base-122 is not recommended to be used on gzip compressed
pages, which is the majority of served web pages._

Occur in just a few lines from each other? I get there's more use cases like
email and such but if you're going to create something for the web but it
can't be used on the majority of web pages that seems like a fairly large
oversight/caveat.

~~~
gayprogrammer
I agree. Including:

> Base-122 encoded strings contain characters which did not seem to play well
> with copy-pasting.

A very important part of web development is being able to manipulate text
documents. It seems that using UTF-8 in more places can reveal cracks in
implementations for browsers/DE's/editors/terminals/etc.

------
CiPHPerCoder
My problem with base-122 is simply that it's not an even power of 2.

It's very easy to write a cache-timing-safe version of base{16,32,64} encoding
for use in encoding/decoding cryptographic keys in configuration files. To
wit:
[https://github.com/paragonie/constant_time_encoding](https://github.com/paragonie/constant_time_encoding)

Base-122? Not sure if it's even possible.

~~~
kobeya
I'm pretty sure that Bitcoin has dealt with this issue for its base58
encoding. It might be worth checking if their algorithm is generalizable to
other radix sizes.

~~~
deckar01
Check out digit-array if you are interested in the generalized algorithm. The
source code is commented with formal math notation for the operations.

[https://github.com/deckar01/digit-
array/blob/master/README.m...](https://github.com/deckar01/digit-
array/blob/master/README.md)

~~~
CiPHPerCoder
This doesn't appear to address the problem. At all.

[https://paragonie.com/blog/2016/06/constant-time-encoding-
bo...](https://paragonie.com/blog/2016/06/constant-time-encoding-boring-
cryptography-rfc-4648-and-you)

~~~
deckar01
I misread your original comment. Nothing I posted is related to constant time
encoding. I apologize for the confusion.

~~~
CiPHPerCoder
It's OK. I just want to be clear that my objection to base-122 is the same as
my objection to base-N where N != 2^m for some integer m.

------
xg15
It's an interesting technical exercise, but I don't think this is the right
approach for optimizing HTML load times.

If the goal is to reduce latency for small images, wouldn't it make it more
sense to extend data URIs so the same base64 string can be referenced in
multiple places?

Actually, as HTTP2 can effectively return multiple resurces in the answer of
one request, do we still need embedded images for latency reduction at all?

~~~
paulddraper
Compression takes care of exactly this.

~~~
xg15
Unless compression cannot be used:
[https://news.ycombinator.com/item?id=13049898](https://news.ycombinator.com/item?id=13049898)

Also, there are still cases where compression cannot be applied, e.g. if a
script naively queries innerHTML. (This wouldn't affect loading time but it
could inflate the page's RAM unnecessarily)

~~~
paulddraper
Sure, but I'd consider those edge cases to be situations where the treatment
is worse than the disease

~~~
pixl97
HTTPS is an edge condition? Over 50 percent of the web is now transferred via
HTTPS.

~~~
paulddraper
It's not HTTPS. It's HTTPS cross-site requests with secret cookies.

------
zihotki
tldr, no revolution, base64+gzip still far better than base122+gzip. But the
article worth reading.

~~~
sikosmurf
Aren't there security concerns using gzip and HTTPS? Could this be a space
efficient alternative when gzip is disabled?

~~~
krschultz
I would be interested to learn more about those concerns. My "default" setup
these days is HTTPS & gzip everything, but I can't say I've read any white
papers on the security implications of that.

~~~
zrm
If the attacker knows or controls any part of the data then the compressed
size leaks information about the unknown data because the compressed size will
be smaller if the known data shares bytes with the unknown data.

~~~
Drdrdrq
Thanks for the explanation! Attack:
[https://en.m.wikipedia.org/wiki/BREACH_(security_exploit)](https://en.m.wikipedia.org/wiki/BREACH_\(security_exploit\))

------
wildfire
Something that is isn't really discussed is that the need for base64 arose so
that binary data could transit 7-bit email systems.

Base122 does not have that property - as far as I could tell.

Whilst we are all, basically, living in an 8-bit world; I suspect it will be
sometime before people feel comfortable assuming that an 8-bit transport is
viable over email.

[edit: spelling ]

------
faragon
Base85 is between base64 and base122, while using printable characters. Any
statistics of compressed base85?

~~~
Dylan16807
I don't have an easy way to try a test corpus but I fed this comment page into
it.

    
    
      Raw size:     37409
      gzip(raw):    6170
      gzip(base16): 7482
      gzip(base64): 10675
      gzip(base85): 10549
    

So it works better than base64, and it has the advantage of working 4 bytes at
a time rather than 3.

Also note that when you're feeding in binary data the gzip sizes for raw and
base-xx data get a lot closer together.

------
lalaithion
Note that if you're not using UTF-8 to encode your characters, you can beat
Base64, Base85, and Base122 with
[https://github.com/ferno/base65536](https://github.com/ferno/base65536)

------
jwatte
Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85,
packing 4 bytes into 5 uucp/mail/Usenet safe characters. For some reason, base
64 became the standard, and I don't think there's space for anything else
these days. (Most things do just fine with binary now, anyway!)

------
Rafert
Interesting, but please add a license :)

------
diafygi
Why not embrace the Unicode world we live in now and just offset bytes to a
Unicode region that has 256 non-space characters?

[https://github.com/diafygi/Offset248](https://github.com/diafygi/Offset248)

Pros: simple, no ratio or ending calculation, easy copy/paste, same character
length as original bytes

Cons: has to be percent encoded in urls, some fonts might not render all
characters

~~~
timdierks
In utf8, these characters are all encoded as two bytes. So this encoding makes
the byte count twice as long, it's no more efficient (in bytes) than hex
encoding.

------
randomslob
Base64 is widespread and works fine, while nobody groks base122.

Base122 requires UTF-8, and while that's pretty common, it's not universal, so
base64 can't ever go away in favor of base122.

Compressed base64 is more efficient than base122 (with or without
compression).

Conclusion: Big nope.

------
web007
Are single-quotes allowed in the spec?

There is an explicit call out for "double quote == bad", but single quotes are
also valid property delimeters in HTML.

~~~
majewsky
If your property string is enclosed in a double quote, then a single quote in
the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes
would break. JS allows both types of quotes on string literals for exactly
this reason.)

Still, single quotes are somewhat asking for trouble.

------
Hydraulix989
I always wondered why people didn't do this.

------
amelius
I'm looking for an efficient encoding that is guaranteed to not contain
certain substrings, such as improper language.

~~~
jzwinck
You can probably get most of what you want by simply not using vowels. A few
non-English ones will still get through like knnbccb and there's also shlt and
so on....

------
logicallee
meanwhile the Unicode consortium has been hard at work since 1991 to make it
possible to encode up to 2.8 MB -- more than enough for most images, short
videos, or many PDF files -- in a single character.

~~~
qwename
Are you referring to UTF-8? If so, this is misleading as you can encode up to
2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as
encoding bytes in a single character.

~~~
logicallee
I was making a joke about how ridiculously large Unicode is. Obviously it is
not 2 million+ bytes per character! (But the fact that you didn't consider my
joke obvious speaks volumes).

