Show HN: Base24 binary-to-text encoding for humans

wp381640 · on Feb 27, 2020

Microsoft product keys were base-24 with the following alphabet:

> B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9

they were 115 bits encoded in 24 characters

see also human-oriented base32 encoding:

https://philzimmermann.com/docs/human-oriented-base-32-encod...

which includes this nice trick:

> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.

edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

VMG · on Feb 27, 2020

See also Bech32 which includes error correction and detection:

https://github.com/bitcoin/bips/blob/master/bip-0173.mediawi...

> Why not use an existing character set like RFC3548 or z-base-32? The character set is chosen to minimize ambiguity according to this visual similarity data, and the ordering is chosen to minimize the number of pairs of similar characters (according to the same data) that differ in more than 1 bit. As the checksum is chosen to maximize detection capabilities for low numbers of bit errors, this choice improves its performance under some error models.

hajimemash · on Feb 27, 2020

Brings me back to the trusty old FCKGW-RHQQ2-YXRKT-8TG6W-2B7Q8

raxxorrax · on Feb 27, 2020

I wonder who owns that license. Some people allegedly used this as their password...

genera1 · on Feb 27, 2020

It was a special volume license key that didn't require online or phone activation, so presumably an OEM that got a release version couple weeks early

jansan · on Feb 27, 2020

I never heard of this before. At first I thought it's some kind of joke with the "FCK" at the beginning.

Fnoord · on Feb 27, 2020

> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.

Basically they removed vowels (except for y, if it counts as one) as non-vowels often include a vowel in their sound. A fact reinforced while teaching my toddler daughter letters, words, and numbers. On top of that, they removed l/1(/i, and also o/0), m/n, s/5, z. Not sure why they removed z. Perhaps because of 2?

I'm not sure this is universal either, sound-wise. I suppose it does count for English. Because 7 ("zeven") and 9 ("negen") in Dutch get confused when spoken, some people say "zeuven" instead of "zeven".

twanvl · on Feb 27, 2020

> Not sure why they removed z. Perhaps because of 2?

My guess would be because of C, since Z can be pronounced as "zee"

WorldMaker · on Feb 27, 2020

> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

The most famous example is Schneier's Solitaire [1]. It was a common encoding I liked to play with in HS classes even before reading Cryptonomicon. I still think about it sometimes when I read through a Duplicate Bridge story in that long syndicated newspaper column. (One of these days I will actually learn Bridge, maybe.)

[1] https://www.schneier.com/academic/solitaire/

vsnf · on Feb 27, 2020

> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards

I encountered something similar to this a while ago when watching a multiplayer mod of Ocarina of Time[0]. They use a string of inventory item symbols to denote the identity of the server to connect to. “Hook shot, hook shot, master sword, deku nut” is a whole lot easier to remember than a long string of ascii.

I guess it’d be something like a Base62 encoding.

[0]https://youtu.be/FLjIiVGPo_0

garganzol · on Feb 27, 2020

Microsoft's Base24 has a much saner alphabet as it avoids ambiguities between 5 and S, 7 and Z symbols while the alphabet suggested by the OP falls into pitfall.

PeterisP · on Feb 27, 2020

It would make all sense to ensure that you don't have B and 8 in the same alphabet. Just as you don't want 1 and I, and 0 and O - pick any one of them, but not both.

directionless · on Feb 27, 2020

I'm fond us using a base100, made of up 2 letter syllables. It results in a vaguely pronounceable string.

For syllables, I use: syllables: %w[ ba be bi bo bu ca ce ci co cu da de di do du fa fe fi fo fu ga ge gi go gu ha he hi ho hu ja je ji jo ju ka ke ki ko ku la le li lo lu ma me mi mo mu na ne ni no nu pa pe pi po pu ra re ri ro ru sa se si so su ta te ti to tu va ve vi vo vu wa we wi wo wu xa xe xi xo xu ya ye yi yo yu za ze zi zo zu ],

I can dump an implementation somewhere if people are really curious

codyb · on Feb 27, 2020

This is very similar to Dominic O’Brien’s techniques for remembering numbers where he assigns each number in base 10 a letter, then you go through and make characters and actions for each number 0 - 99.

So 0 - O, 1 - A, 2 -B... (with S for 6, and N for 9).

Then 00 -> Double O holding a pistol

OA - Your friend Oliver Anderson doing whatever Oliver Anderson does

Etc.

Then 0100 becomes your friend Oliver Anderson holding a pistol in your imagination and that’s easier to remember and you can make stories with your characters to remember phone numbers etc.

Once I started this comment I realized it may be just a tad bit more involved, but I wonder if you could combine the two and have characters for every number from 0000 - 9999.

yongjik · on Feb 27, 2020

Be careful, wiser(??) people have tried the path, and their tale is told in...

https://thedailywtf.com/articles/The-Automated-Curse-Generat...

klingonopera · on Feb 27, 2020

Vuluva, fuca, fucu and in my second language, German, also fiki, pipi, kaka, and countless more.

Not OP, but me, personally, I don't care about accidental obscenity. It is accidental, after all.

Then again, I live in Germany, and we don't censor swear words on TV either, so this is likely a cultural thing.

shakna · on Feb 27, 2020

> I can dump an implementation somewhere if people are really curious

Interested.

I've put together a few encoding libraries for fun when I get bored. (base16, morse, etc.)

This one looks fun, particularly because it _might_ be possible to serialise it to sound and back, if I put in a little bit of effort, which is something I've done [0] once or twice.

[0] https://git.sr.ht/~shakna/soundofsilence

directionless · on Feb 27, 2020

(commented above -- https://github.com/kolide/kit/tree/master/munemo)

numpad0 · on Feb 27, 2020

That’s just Japanese “Roman” alphabet:

Consonants [(None), k, s, t, n, h, m, y, r, w], followed by,

Vowels [a, i, u, e, o] forming 5x10 matrix,

+ semi-voiced ゜(p replaces h) and voiced ゛(g, z, d, b replaces k, a, t, h)signs,

+ silent “nn”,

- wi wu we.

(aka NES Dragon Quest spell of resurrection)

directionless · on Feb 27, 2020

https://github.com/kolide/kit/tree/master/munemo

Implements two dialects. An original one compatible with the inspiration, and used in some of our earlier product. And the replacement (syllables above) which is alphabetically sortable.

thelazydogsback · on Feb 27, 2020

And pipe them into TTS for all kinds of fun...

jcims · on Feb 27, 2020

It is - https://cloud.google.com/text-to-speech

Try different languages. ra re ri ro ru is my favorite little run of most of them.

kortex · on Feb 27, 2020

The postalveolar approximant (r in "red" in General American) is so weird linguistically.

Also it's always amused me how dog noises are onomonopoeitized in GA English as "bark" or "woof", when dogs lack lips to make a labial plosive, and their tongues can't really form proper postalveolar approximants or velar stops. I think it has to do with how we hear the third formant.

I guess I'd transcribe it like...

/ɚa◌˞'/

It's almost like "rorch" but more glottal less velar.

https://en.m.wikipedia.org/wiki/Voiced_alveolar_and_postalve...

anonsivalley652 · on Feb 27, 2020

That's not "base-100" in terms of symbols or storage, it's functionally-identical to base-10! Be honest about how terrible it is.

kortex · on Feb 27, 2020

What are you talking about? The symbol table has 100 symbols. A phoneme is a symbol, not a letter. Just because a symbol comprises 2 ascii letters does not make it base10. That's like saying 1 and 0 are symbols in Manchester code.

You could (almost) easily replace every symbol with a single unicode rune from an abugida like katakana/hiragana (you'd need to pull from several langs as japanese famously lacks distinction between La-li-lu-le-lo and Ra-ri-ru-re-ro (らりるれろ) but there's no reason why you couldn't encode one-rune-per-phoneme.

https://en.m.wikipedia.org/wiki/Manchester_code

https://metalgear.fandom.com/wiki/The_Patriots

appwiz · on Feb 27, 2020

I’d love to see your implementation.

directionless · on Feb 27, 2020

(commented above -- https://github.com/kolide/kit/tree/master/munemo)

excitedleigh · on Feb 27, 2020

Another interesting solution to this problem is that used by plus codes [1]:

> The characters that are used in Open Location Codes were chosen by computing all possible 20 character combinations from 0-9A-Z and scoring them on how well they spell 10,000 words from over 30 languages. This was to avoid, as far as possible, Open Location Codes being generated that included recognisable words. The selected 20 character set is made up of "23456789CFGHJMPQRVWX". [2]

[1]: https://plus.codes [2]: https://github.com/google/open-location-code/blob/master/doc...

cortesoft · on Feb 27, 2020

Does not help with the ambiguous character problem at all, though

IAmEveryone · on Feb 27, 2020

No I,1,0, or O in their alphabet, though. So they did probably consider the problem at some point. Or got lucky.

tlhunter · on Feb 27, 2020

> The final alphabet I came up with is ZAC2B3EF4NH5TKL7P8RS9WXY. As I required 24 characters, I kept G and 6 which are the least ambiguous in the list.

I've read this a dozen times. Isn't OP saying that their character list includes G and 6, which are _not_ present in that list?

Update: It appears to be a typo in the article. Here's the real alphabet (N replaced by G and L replaced by 6): ZAC2B3EF4GH5TK67P8RS9WXY

https://github.com/kuon/java-base24/blob/0c25905414f1598a0ed...

anonsivalley652 · on Feb 27, 2020

S 5 6 G

P R

2 Z

8 B

look similar, depending on the font

It would be better to include some lower case characters which have more visual variability than trying to obsess over an arbitrary, inflexible stylistic "design."

kuon · on Feb 27, 2020

Oh, yes my bad, I will correct the article. I did use G and 6 in the end, I copy pasted one of the other candidate. NL was a try.

Sorry about that.

Groxx · on Feb 27, 2020

Though clearly there are some advantages with removing ambiguous chars... I feel like it's more of a UI / UX thing-to-polish than a problem. Lack of polish creates the problem, the ambiguous chars themselves are not inherently an issue.

If it's ambiguous, you could accept either and transform it to the correct value (implicitly, or as entered, or whenever makes sense. your users don't ever have to know). Or if you can't do that / the differences matter, do something like 1password does with chars and letters: show them differently https://www.dropbox.com/s/a29g2uiggqujzjl/screen%20shot%2020...

oefrha · on Feb 27, 2020

> do something like 1password does with chars and letters: show them differently

That’s missing the point. You can show them differently, but the point of keys / recovery codes is that they’ll be stored somewhere and later re-entered. Users could store them in any program (including writing them down or printing them out), you can’t control how they are displayed over there. Then when they need to use them, there’s a chance the ambiguous characters can’t be easily discerned.

Groxx · on Feb 27, 2020

Since you can't control the display there, but you can control how it's interpreted, you make it a non-issue by mapping them to the same thing in whatever is consuming the input.

Or just try all combinations, unless they entered o0o0o0o0o0o0o0o0o0o0o0 you're probably only going to have to try a small handful.

jansan · on Feb 27, 2020

With my old shareware product that really did not sell a lot, I got one phone call from a customer who was not able to enter the correct license code. Of course he had mixed up 0 and O. So yes, for some people it solves a problem.

GeertB · on Feb 27, 2020

I really like the super high efficiency at the important multiple-of-4-byte increments. Using 7 base-24 characters to encode 32 bits is 99.7% efficient. However, I'd recommend using 7 base-24 digits followed by a blank as standard output format. This would allow for efficient 8 character <=> 32 bit conversions. Also, I think padding output to a multiple of 7 characters would be good, for similar reasons that it's good for base-64. Now you can concatenate encoded streams like you could byte streams, and recover on decode. As multiples of 32 bits are so common, padding would be used little in practice. On input, it would be fine to accept unpadded base-24 sequences, but valid base-24 output should always pad to a multiple of 7 chars (excluding the blanks that should be just for readability and not significant otherwise).

However, I strongly dislike the arbitrary mapping between character values and base-24 digits. There is a strong reason for using the order 2345679ABCEFGHKRSTWXYZ, which is that now encoded values compare the same as the original binary values. I did appreciate the 0x00000000 == ZZZZZZZ equivalence, but consistent ordering is just way more important IMO. Also 2222222 looks a lot like ZZZZZZZ. Just saying.

kuon · on Feb 27, 2020

I thought about the comparison bit, and I wanted to go against it.

Ordered, your snippet look like the alphabet with a few missing letters, and isn't searchable on google or anything. I really wanted the alphabet to stand out.

I don't think that it is important that it can be sorted, it is intended for randomly generated keys which by my experience, you won't be sorting.

tjchear · on Feb 27, 2020

I think proquints [0] are pretty good at encoding for humans as well. For example, when used to encode IP addresses, they result in pronouncible identifiers like this:

  127.0.0.1       lusab-babad
  63.84.220.193   gutih-tugad
  63.118.7.35     gutuk-bisog

[0] https://arxiv.org/html/0901.4016

kortex · on Feb 27, 2020

Nice! I'd like to implement this in a key-recovery tool I have been working on, Passcrux [1]. I actually started fleshing out a base24 encoding of my own, but the padding/bit shuffling proved to be somewhat cumbersome, and I shifted focus to abc16, which is like hex, but purely alphabetic.

[1] https://github.com/xkortex/passcrux

davidcollantes · on Feb 27, 2020

Related; how to get Passcrux to compile? I couldn't find instructions on the repository, and would love to try it out. Thanks!

kozak · on Feb 27, 2020

If you consider the scenario of dictating over the phone, letters can be confusing not just because of their written shape. For many non-English speakers, for example, E can be confused with I, and V can be confused with W, unless both sides use the same way of pronouncing them. Look how Microsoft's base24 alphabet (from the other comment) has neither E nor I.

beojan · on Feb 27, 2020

That's what the NATO phonetic alphabet is for.

kozak · on Feb 27, 2020

In this case the phonetic alphabet will do the job only for English-speaking countries (or at least those that are very well accustomed to Latin characters).

beojan · on Feb 29, 2020

The point of the NATO alphabet is everyone in NATO uses it, even if it isn't their native alphabet, as in the case of Greece.

thelazydogsback · on Feb 27, 2020

> decimal: 49894920630459842177293598641814316632

This 128-bits can also be represented in, let's say base-50K, by using five words chosen from a 50,000 word dictionary. If you also make "this", "This" and "THIS" separate, then you can get away with a 17K word dictionary. Depending on the language, if you use roots and then vary morphology based number, tense, etc., then the number of root words (and the choice you have in making them simple) can be reduced. Such "pass phrases" can be easier to remember, transcribe, etc. (Also you will get random, humorous, offensive, etc., phrases...)

strags · on Feb 27, 2020

I recently needed to encode a 32-bit value into something easy for QA folks to remember and report. I opted for 3 words out of an 11-bit (2048 entry) dictionary of commonly used words.

How to build the dictionary? Well, in order to determine the most commonly used English words, I downloaded a bunch of free texts from Project Gutenberg, and did some simple filtering - nothing less than 5 letters, no duplication of singular + plural, etc...

A valuable lesson that I learned during this process is that when your corpus includes older english texts, you should always give your final list a visual once-over and apply some judicious manual filtering. I'm looking at you, "The Adventures of Tom Sawyer". (And, to a lesser extent, Moby Dick).

ratboy666 · on Feb 27, 2020

I like this. Hacked together a quick implementation in javascript (using quickjs as the interpreter):

https://github.com/ratboy666/qjs-3word

Dylan16807 · on Feb 27, 2020

In most cases if you need a short list it's better to use something like the diceware or EFF lists than to make your own from scratch.

cyphar · on Feb 27, 2020

Or use the BIP39 lists since they also encode 2048 bits. If you just use BIP39 you also get a checksum. RFC 1751[1] is the "standardised" option but IMHO the wordlist they use is far too easy to misread (though this is because the words are all less than 4 characters).

[1]: https://tools.ietf.org/html/rfc1751

edoceo · on Feb 27, 2020

Here is one: https://gist.github.com/fogleman/c4a1f69f34c7e8a00da8

earthboundkid · on Feb 27, 2020

Crockford Base-32 is already perfect. https://www.crockford.com/base32.html

SlowRobotAhead · on Feb 27, 2020

Not bad, I mean, I’m not lining up to implement it in C tomorrow, but if it gets an RFC I could definitely see using it.

I have an application where I’m using a 32bit serial for the event someone has to read it to sales staff over the phone. I would have liked to use 64bit and encode some more details into the serial. This would satisfy that.

I like the idea of removing ambiguous chars. I have a Base64 system that prints where I and l use the same font (infuriating).

souenzzo · on Feb 27, 2020

Looks like this [2017] https://github.com/tonsky/compact-uuids

rini17 · on Feb 27, 2020

What about Base10 emoji encoding? There is already more than 1024 of them ;)

128bit can be then represented by just 13 characters, or even much less with modifiers.

WorldMaker · on Feb 27, 2020

It's certainly a fun idea. If the goal is human readable, humans are surprisingly bad at differentiating emoji just by looking at them, especially all of the subtly different variation face ones. Describing them over something like a phone call could lead to all sorts of transcription mistakes. Not to mention that there's a variety of different emoji input systems/keyboards and the amount of user skill in finding/picking emoji for text entry are hugely variable.

Aardwolf · on Feb 27, 2020

> The data length must be multiple of 32 bits. There is no padding mechanism in the encoder.

Such padding mechanism should not be necessary, and the padding from standard base64 is also not necessary. If you remove the ==='s you can still unambiguously decode it (despite the error some tools will give). URL-safe base64 (RFC 4648 §5) does not require padding and can represent any data length.

mildmelon · on March 2, 2020

I've implemented this encoding format in Python, https://github.com/mildmelon/pybase24

bograt · on Feb 27, 2020

In a similar vein, this is an encoding I designed specifically for 256 bit keys; my design includes checksumming and some consideration to consistent verbalization:

https://github.com/tomgibara/keycode

garganzol · on Feb 27, 2020

An important note to the author: please put test vectors right to the spec!

The classics like "", "f", "fo", "foo", ..., "foobar" would suffice. If the encoding specifically works on numbers, put test vectors for those too.

aabbcc1241 · on Feb 28, 2020

The author mentioned it's confusing even when one of similar character is used (like base 10), the program can indeed automatically resolve typo (e.g. treat O as zero).

This doesn't require the user to be technical.

layoutIfNeeded · on Feb 27, 2020

Base24 is already the name of a pretty old payment processing standard: https://en.m.wikipedia.org/wiki/BASE24

Robin_Message · on Feb 27, 2020

Is the ambiguity of 1Ii a problem though? Can't your input routine just normalise those to I say? It is slightly confusing to enter though.

Also, any string like this should have at least a check digit and ideally some ECC digits.

m0netize · on Feb 27, 2020

Similar idea to base 58, introduced with Bitcoin.

https://en.wikipedia.org/wiki/Base58

nullc · on Feb 28, 2020

mod 24 isn't a field, so it's not easy to add good error protection to base24 using a regular cyclic code.

You can add a single check digit with good performance using the Damm algorithm: https://en.wikipedia.org/wiki/Damm_algorithm one of the external links on that article has a suitable quasigroup matrix for Z_24.

jhvkjhk · on Feb 27, 2020

I'm wondering maybe in the future, we will use these baseN methods to represent any numbers in everyday life? With the data explosion, I think we will have more opportunity to describe really big numbers than today, so we will abandon the usage of decimal eventually, speak and write in this base24 method or some upcoming base128, base1024?

I know there's Mega or Giga that can describe how big the number is in decimal, but they can do better (represent bigger numbers) in the baseN method where N > 10. So will we shift to these methods?

kuon · on Feb 27, 2020

I published the article a bit in a hurry before going to bed and I managed to put the wrong alphabet in the article.

I am sorry, and I updated it.