
Show HN: Base24 binary-to-text encoding for humans - kuon
https://www.kuon.ch/post/2020-02-27-base24/
======
wp381640
Microsoft product keys were base-24 with the following alphabet:

> B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9

they were 115 bits encoded in 24 characters

see also human-oriented base32 encoding:

[https://philzimmermann.com/docs/human-oriented-
base-32-encod...](https://philzimmermann.com/docs/human-oriented-
base-32-encoding.txt)

which includes this nice trick:

> We have permuted the alphabet to make the more commonly occuring characters
> also be those that we think are easier to read, write, speak, and remember.

edit: to add, an interesting human-readable and memorable base52 alphabet that
I've never found a use for is to use playing cards

~~~
hajimemash
Brings me back to the trusty old FCKGW-RHQQ2-YXRKT-8TG6W-2B7Q8

~~~
raxxorrax
I wonder who owns that license. Some people allegedly used this as their
password...

~~~
genera1
It was a special volume license key that didn't require online or phone
activation, so presumably an OEM that got a release version couple weeks early

------
directionless
I'm fond us using a base100, made of up 2 letter syllables. It results in a
vaguely pronounceable string.

For syllables, I use: syllables: %w[ ba be bi bo bu ca ce ci co cu da de di do
du fa fe fi fo fu ga ge gi go gu ha he hi ho hu ja je ji jo ju ka ke ki ko ku
la le li lo lu ma me mi mo mu na ne ni no nu pa pe pi po pu ra re ri ro ru sa
se si so su ta te ti to tu va ve vi vo vu wa we wi wo wu xa xe xi xo xu ya ye
yi yo yu za ze zi zo zu ],

I can dump an implementation somewhere if people are really curious

~~~
yongjik
Be careful, wiser(??) people have tried the path, and their tale is told in...

[https://thedailywtf.com/articles/The-Automated-Curse-
Generat...](https://thedailywtf.com/articles/The-Automated-Curse-Generator)

~~~
klingonopera
Vuluva, fuca, fucu and in my second language, German, also fiki, pipi, kaka,
and countless more.

Not OP, but me, personally, I don't care about accidental obscenity. It is
accidental, after all.

Then again, I live in Germany, and we don't censor swear words on TV either,
so this is likely a cultural thing.

------
excitedleigh
Another interesting solution to this problem is that used by plus codes [1]:

> The characters that are used in Open Location Codes were chosen by computing
> all possible 20 character combinations from 0-9A-Z and scoring them on how
> well they spell 10,000 words from over 30 languages. This was to avoid, as
> far as possible, Open Location Codes being generated that included
> recognisable words. The selected 20 character set is made up of
> "23456789CFGHJMPQRVWX". [2]

[1]: [https://plus.codes](https://plus.codes) [2]:
[https://github.com/google/open-location-
code/blob/master/doc...](https://github.com/google/open-location-
code/blob/master/docs/olc_definition.adoc#open-location-code)

~~~
cortesoft
Does not help with the ambiguous character problem at all, though

~~~
IAmEveryone
No I,1,0, or O in their alphabet, though. So they did probably consider the
problem at some point. Or got lucky.

------
tlhunter
> The final alphabet I came up with is ZAC2B3EF4NH5TKL7P8RS9WXY. As I required
> 24 characters, I kept G and 6 which are the least ambiguous in the list.

I've read this a dozen times. Isn't OP saying that their character list
includes G and 6, which are _not_ present in that list?

Update: It appears to be a typo in the article. Here's the real alphabet (N
replaced by G and L replaced by 6): ZAC2B3EF4GH5TK67P8RS9WXY

[https://github.com/kuon/java-
base24/blob/0c25905414f1598a0ed...](https://github.com/kuon/java-
base24/blob/0c25905414f1598a0ed245eb8d727d9d0b1427ed/src/main/kotlin/ch/kuon/commons/Library.kt#L5)

~~~
anonsivalley652
S 5 6 G

P R

2 Z

8 B

look similar, depending on the font

It would be better to include some lower case characters which have more
visual variability than trying to obsess over an arbitrary, inflexible
stylistic "design."

------
Groxx
Though clearly there are some advantages with removing ambiguous chars... I
feel like it's more of a UI / UX thing-to-polish than a _problem_. Lack of
polish creates the problem, the ambiguous chars themselves are not inherently
an issue.

If it's ambiguous, you could accept either and transform it to the correct
value (implicitly, or as entered, or whenever makes sense. your users don't
ever have to know). Or if you can't do that / the differences matter, do
something like 1password does with chars and letters: show them differently
[https://www.dropbox.com/s/a29g2uiggqujzjl/screen%20shot%2020...](https://www.dropbox.com/s/a29g2uiggqujzjl/screen%20shot%202020-02-26%20at%207.12.41%20pm.png?dl=0)

~~~
oefrha
> do something like 1password does with chars and letters: show them
> differently

That’s missing the point. _You_ can show them differently, but the point of
keys / recovery codes is that they’ll be stored somewhere and later re-
entered. Users could store them in any program (including writing them down or
printing them out), you can’t control how they are displayed over there. Then
when they need to use them, there’s a chance the ambiguous characters can’t be
easily discerned.

~~~
Groxx
Since you can't control the display there, but you can control how it's
interpreted, you make it a non-issue by mapping them to the same thing in
whatever is consuming the input.

Or just try all combinations, unless they entered o0o0o0o0o0o0o0o0o0o0o0
you're probably only going to have to try a small handful.

------
GeertB
I really like the super high efficiency at the important multiple-of-4-byte
increments. Using 7 base-24 characters to encode 32 bits is 99.7% efficient.
However, I'd recommend using 7 base-24 digits followed by a blank as standard
output format. This would allow for efficient 8 character <=> 32 bit
conversions. Also, I think padding output to a multiple of 7 characters would
be good, for similar reasons that it's good for base-64. Now you can
concatenate encoded streams like you could byte streams, and recover on
decode. As multiples of 32 bits are so common, padding would be used little in
practice. On input, it would be fine to accept unpadded base-24 sequences, but
valid base-24 output should always pad to a multiple of 7 chars (excluding the
blanks that should be just for readability and not significant otherwise).

However, I strongly dislike the arbitrary mapping between character values and
base-24 digits. There is a strong reason for using the order
2345679ABCEFGHKRSTWXYZ, which is that now encoded values compare the same as
the original binary values. I did appreciate the 0x00000000 == ZZZZZZZ
equivalence, but consistent ordering is just way more important IMO. Also
2222222 looks a lot like ZZZZZZZ. Just saying.

~~~
kuon
I thought about the comparison bit, and I wanted to go against it.

Ordered, your snippet look like the alphabet with a few missing letters, and
isn't searchable on google or anything. I really wanted the alphabet to stand
out.

I don't think that it is important that it can be sorted, it is intended for
randomly generated keys which by my experience, you won't be sorting.

------
tjchear
I think proquints [0] are pretty good at encoding for humans as well. For
example, when used to encode IP addresses, they result in pronouncible
identifiers like this:

    
    
      127.0.0.1       lusab-babad
      63.84.220.193   gutih-tugad
      63.118.7.35     gutuk-bisog
    

[0] [https://arxiv.org/html/0901.4016](https://arxiv.org/html/0901.4016)

------
kortex
Nice! I'd like to implement this in a key-recovery tool I have been working
on, Passcrux [1]. I actually started fleshing out a base24 encoding of my own,
but the padding/bit shuffling proved to be somewhat cumbersome, and I shifted
focus to abc16, which is like hex, but purely alphabetic.

[1] [https://github.com/xkortex/passcrux](https://github.com/xkortex/passcrux)

~~~
davidcollantes
Related; how to get Passcrux to compile? I couldn't find instructions on the
repository, and would love to try it out. Thanks!

------
kozak
If you consider the scenario of dictating over the phone, letters can be
confusing not just because of their written shape. For many non-English
speakers, for example, E can be confused with I, and V can be confused with W,
unless both sides use the same way of pronouncing them. Look how Microsoft's
base24 alphabet (from the other comment) has neither E nor I.

~~~
beojan
That's what the NATO phonetic alphabet is for.

~~~
kozak
In this case the phonetic alphabet will do the job only for English-speaking
countries (or at least those that are very well accustomed to Latin
characters).

~~~
beojan
The point of the NATO alphabet is everyone in NATO uses it, even if it isn't
their native alphabet, as in the case of Greece.

------
thelazydogsback
> decimal: 49894920630459842177293598641814316632

This 128-bits can also be represented in, let's say base-50K, by using five
words chosen from a 50,000 word dictionary. If you also make "this", "This"
and "THIS" separate, then you can get away with a 17K word dictionary.
Depending on the language, if you use roots and then vary morphology based
number, tense, etc., then the number of root words (and the choice you have in
making them simple) can be reduced. Such "pass phrases" can be easier to
remember, transcribe, etc. (Also you _will_ get random, humorous, offensive,
etc., phrases...)

~~~
strags
I recently needed to encode a 32-bit value into something easy for QA folks to
remember and report. I opted for 3 words out of an 11-bit (2048 entry)
dictionary of commonly used words.

How to build the dictionary? Well, in order to determine the most commonly
used English words, I downloaded a bunch of free texts from Project Gutenberg,
and did some simple filtering - nothing less than 5 letters, no duplication of
singular + plural, etc...

A valuable lesson that I learned during this process is that when your corpus
includes older english texts, you should always give your final list a visual
once-over and apply some judicious manual filtering. I'm looking at you, "The
Adventures of Tom Sawyer". (And, to a lesser extent, Moby Dick).

~~~
Dylan16807
In most cases if you need a short list it's better to use something like the
diceware or EFF lists than to make your own from scratch.

~~~
cyphar
Or use the BIP39 lists since they also encode 2048 bits. If you just use BIP39
you also get a checksum. RFC 1751[1] is the "standardised" option but IMHO the
wordlist they use is far too easy to misread (though this is because the words
are all less than 4 characters).

[1]:
[https://tools.ietf.org/html/rfc1751](https://tools.ietf.org/html/rfc1751)

------
earthboundkid
Crockford Base-32 is already perfect.
[https://www.crockford.com/base32.html](https://www.crockford.com/base32.html)

------
SlowRobotAhead
Not bad, I mean, I’m not lining up to implement it in C tomorrow, but if it
gets an RFC I could definitely see using it.

I have an application where I’m using a 32bit serial for the event someone has
to read it to sales staff over the phone. I would have liked to use 64bit and
encode some more details into the serial. This would satisfy that.

I like the idea of removing ambiguous chars. I have a Base64 system that
prints where I and l use the same font (infuriating).

------
souenzzo
Looks like this [2017] [https://github.com/tonsky/compact-
uuids](https://github.com/tonsky/compact-uuids)

------
rini17
What about Base10 emoji encoding? There is already more than 1024 of them ;)

128bit can be then represented by just 13 characters, or even much less with
modifiers.

~~~
WorldMaker
It's certainly a fun idea. If the goal is human readable, humans are
surprisingly bad at differentiating emoji just by looking at them, especially
all of the subtly different variation face ones. Describing them over
something like a phone call could lead to all sorts of transcription mistakes.
Not to mention that there's a variety of different emoji input
systems/keyboards and the amount of user skill in finding/picking emoji for
text entry are hugely variable.

------
Aardwolf
> The data length must be multiple of 32 bits. There is no padding mechanism
> in the encoder.

Such padding mechanism should not be necessary, and the padding from standard
base64 is also not necessary. If you remove the ==='s you can still
unambiguously decode it (despite the error some tools will give). URL-safe
base64 (RFC 4648 §5) does not require padding and can represent any data
length.

------
mildmelon
I've implemented this encoding format in Python,
[https://github.com/mildmelon/pybase24](https://github.com/mildmelon/pybase24)

------
bograt
In a similar vein, this is an encoding I designed specifically for 256 bit
keys; my design includes checksumming and some consideration to consistent
verbalization:

[https://github.com/tomgibara/keycode](https://github.com/tomgibara/keycode)

------
garganzol
An important note to the author: please put test vectors right to the spec!

The classics like "", "f", "fo", "foo", ..., "foobar" would suffice. If the
encoding specifically works on numbers, put test vectors for those too.

------
aabbcc1241
The author mentioned it's confusing even when one of similar character is used
(like base 10), the program can indeed automatically resolve typo (e.g. treat
O as zero).

This doesn't require the user to be technical.

------
layoutIfNeeded
Base24 is already the name of a pretty old payment processing standard:
[https://en.m.wikipedia.org/wiki/BASE24](https://en.m.wikipedia.org/wiki/BASE24)

------
Robin_Message
Is the ambiguity of 1Ii a problem though? Can't your input routine just
normalise those to I say? It is slightly confusing to enter though.

Also, any string like this should have at least a check digit and ideally some
ECC digits.

------
m0netize
Similar idea to base 58, introduced with Bitcoin.

[https://en.wikipedia.org/wiki/Base58](https://en.wikipedia.org/wiki/Base58)

------
nullc
mod 24 isn't a field, so it's not easy to add good error protection to base24
using a regular cyclic code.

You can add a single check digit with good performance using the Damm
algorithm:
[https://en.wikipedia.org/wiki/Damm_algorithm](https://en.wikipedia.org/wiki/Damm_algorithm)
one of the external links on that article has a suitable quasigroup matrix for
Z_24.

------
jhvkjhk
I'm wondering maybe in the future, we will use these baseN methods to
represent any numbers in everyday life? With the data explosion, I think we
will have more opportunity to describe really big numbers than today, so we
will abandon the usage of decimal eventually, speak and write in this base24
method or some upcoming base128, base1024?

I know there's Mega or Giga that can describe how big the number is in
decimal, but they can do better (represent bigger numbers) in the baseN method
where N > 10\. So will we shift to these methods?

------
kuon
I published the article a bit in a hurry before going to bed and I managed to
put the wrong alphabet in the article.

I am sorry, and I updated it.

