> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.
edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards
> Why not use an existing character set like RFC3548 or z-base-32? The character set is chosen to minimize ambiguity according to this visual similarity data, and the ordering is chosen to minimize the number of pairs of similar characters (according to the same data) that differ in more than 1 bit. As the checksum is chosen to maximize detection capabilities for low numbers of bit errors, this choice improves its performance under some error models.
> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.
Basically they removed vowels (except for y, if it counts as one) as non-vowels often include a vowel in their sound. A fact reinforced while teaching my toddler daughter letters, words, and numbers. On top of that, they removed l/1(/i, and also o/0), m/n, s/5, z. Not sure why they removed z. Perhaps because of 2?
I'm not sure this is universal either, sound-wise. I suppose it does count for English. Because 7 ("zeven") and 9 ("negen") in Dutch get confused when spoken, some people say "zeuven" instead of "zeven".
> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards
The most famous example is Schneier's Solitaire [1]. It was a common encoding I liked to play with in HS classes even before reading Cryptonomicon. I still think about it sometimes when I read through a Duplicate Bridge story in that long syndicated newspaper column. (One of these days I will actually learn Bridge, maybe.)
> edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards
I encountered something similar to this a while ago when watching a multiplayer mod of Ocarina of Time[0]. They use a string of inventory item symbols to denote the identity of the server to connect to. “Hook shot, hook shot, master sword, deku nut” is a whole lot easier to remember than a long string of ascii.
Microsoft's Base24 has a much saner alphabet as it avoids ambiguities between 5 and S, 7 and Z symbols while the alphabet suggested by the OP falls into pitfall.
It would make all sense to ensure that you don't have B and 8 in the same alphabet. Just as you don't want 1 and I, and 0 and O - pick any one of them, but not both.
I'm fond us using a base100, made of up 2 letter syllables. It results in a vaguely pronounceable string.
For syllables, I use:
syllables: %w[
ba be bi bo bu
ca ce ci co cu
da de di do du
fa fe fi fo fu
ga ge gi go gu
ha he hi ho hu
ja je ji jo ju
ka ke ki ko ku
la le li lo lu
ma me mi mo mu
na ne ni no nu
pa pe pi po pu
ra re ri ro ru
sa se si so su
ta te ti to tu
va ve vi vo vu
wa we wi wo wu
xa xe xi xo xu
ya ye yi yo yu
za ze zi zo zu
],
I can dump an implementation somewhere if people are really curious
This is very similar to Dominic O’Brien’s techniques for remembering numbers where he assigns each number in base 10 a letter, then you go through and make characters and actions for each number 0 - 99.
So 0 - O, 1 - A, 2 -B... (with S for 6, and N for 9).
Then 00 -> Double O holding a pistol
OA - Your friend Oliver Anderson doing whatever Oliver Anderson does
Etc.
Then 0100 becomes your friend Oliver Anderson holding a pistol in your imagination and that’s easier to remember and you can make stories with your characters to remember phone numbers etc.
Once I started this comment I realized it may be just a tad bit more involved, but I wonder if you could combine the two and have characters for every number from 0000 - 9999.
> I can dump an implementation somewhere if people are really curious
Interested.
I've put together a few encoding libraries for fun when I get bored. (base16, morse, etc.)
This one looks fun, particularly because it _might_ be possible to serialise it to sound and back, if I put in a little bit of effort, which is something I've done [0] once or twice.
Implements two dialects. An original one compatible with the inspiration, and used in some of our earlier product. And the replacement (syllables above) which is alphabetically sortable.
The postalveolar approximant (r in "red" in General American) is so weird linguistically.
Also it's always amused me how dog noises are onomonopoeitized in GA English as "bark" or "woof", when dogs lack lips to make a labial plosive, and their tongues can't really form proper postalveolar approximants or velar stops. I think it has to do with how we hear the third formant.
I guess I'd transcribe it like...
/ɚa◌˞'/
It's almost like "rorch" but more glottal less velar.
What are you talking about? The symbol table has 100 symbols. A phoneme is a symbol, not a letter. Just because a symbol comprises 2 ascii letters does not make it base10. That's like saying 1 and 0 are symbols in Manchester code.
You could (almost) easily replace every symbol with a single unicode rune from an abugida like katakana/hiragana (you'd need to pull from several langs as japanese famously lacks distinction between La-li-lu-le-lo and Ra-ri-ru-re-ro (らりるれろ) but there's no reason why you couldn't encode one-rune-per-phoneme.
Another interesting solution to this problem is that used by plus codes [1]:
> The characters that are used in Open Location Codes were chosen by computing all possible 20 character combinations from 0-9A-Z and scoring them on how well they spell 10,000 words from over 30 languages. This was to avoid, as far as possible, Open Location Codes being generated that included recognisable words. The selected 20 character set is made up of "23456789CFGHJMPQRVWX". [2]
> The final alphabet I came up with is ZAC2B3EF4NH5TKL7P8RS9WXY. As I required 24 characters, I kept G and 6 which are the least ambiguous in the list.
I've read this a dozen times. Isn't OP saying that their character list includes G and 6, which are _not_ present in that list?
Update:
It appears to be a typo in the article. Here's the real alphabet (N replaced by G and L replaced by 6):
ZAC2B3EF4GH5TK67P8RS9WXY
It would be better to include some lower case characters which have more visual variability than trying to obsess over an arbitrary, inflexible stylistic "design."
Though clearly there are some advantages with removing ambiguous chars... I feel like it's more of a UI / UX thing-to-polish than a problem. Lack of polish creates the problem, the ambiguous chars themselves are not inherently an issue.
If it's ambiguous, you could accept either and transform it to the correct value (implicitly, or as entered, or whenever makes sense. your users don't ever have to know). Or if you can't do that / the differences matter, do something like 1password does with chars and letters: show them differently https://www.dropbox.com/s/a29g2uiggqujzjl/screen%20shot%2020...
> do something like 1password does with chars and letters: show them differently
That’s missing the point. You can show them differently, but the point of keys / recovery codes is that they’ll be stored somewhere and later re-entered. Users could store them in any program (including writing them down or printing them out), you can’t control how they are displayed over there. Then when they need to use them, there’s a chance the ambiguous characters can’t be easily discerned.
Since you can't control the display there, but you can control how it's interpreted, you make it a non-issue by mapping them to the same thing in whatever is consuming the input.
Or just try all combinations, unless they entered o0o0o0o0o0o0o0o0o0o0o0 you're probably only going to have to try a small handful.
With my old shareware product that really did not sell a lot, I got one phone call from a customer who was not able to enter the correct license code. Of course he had mixed up 0 and O. So yes, for some people it solves a problem.
I really like the super high efficiency at the important multiple-of-4-byte increments. Using 7 base-24 characters to encode 32 bits is 99.7% efficient. However, I'd recommend using 7 base-24 digits followed by a blank as standard output format. This would allow for efficient 8 character <=> 32 bit conversions. Also, I think padding output to a multiple of 7 characters would be good, for similar reasons that it's good for base-64. Now you can concatenate encoded streams like you could byte streams, and recover on decode. As multiples of 32 bits are so common, padding would be used little in practice. On input, it would be fine to accept unpadded base-24 sequences, but valid base-24 output should always pad to a multiple of 7 chars (excluding the blanks that should be just for readability and not significant otherwise).
However, I strongly dislike the arbitrary mapping between character values and base-24 digits. There is a strong reason for using the order 2345679ABCEFGHKRSTWXYZ, which is that now encoded values compare the same as the original binary values. I did appreciate the 0x00000000 == ZZZZZZZ equivalence, but consistent ordering is just way more important IMO. Also 2222222 looks a lot like ZZZZZZZ. Just saying.
I thought about the comparison bit, and I wanted to go against it.
Ordered, your snippet look like the alphabet with a few missing letters, and isn't searchable on google or anything. I really wanted the alphabet to stand out.
I don't think that it is important that it can be sorted, it is intended for randomly generated keys which by my experience, you won't be sorting.
I think proquints [0] are pretty good at encoding for humans as well. For example, when used to encode IP addresses, they result in pronouncible identifiers like this:
Nice! I'd like to implement this in a key-recovery tool I have been working on, Passcrux [1]. I actually started fleshing out a base24 encoding of my own, but the padding/bit shuffling proved to be somewhat cumbersome, and I shifted focus to abc16, which is like hex, but purely alphabetic.
If you consider the scenario of dictating over the phone, letters can be confusing not just because of their written shape. For many non-English speakers, for example, E can be confused with I, and V can be confused with W, unless both sides use the same way of pronouncing them. Look how Microsoft's base24 alphabet (from the other comment) has neither E nor I.
In this case the phonetic alphabet will do the job only for English-speaking countries (or at least those that are very well accustomed to Latin characters).
This 128-bits can also be represented in, let's say base-50K, by using five words chosen from a 50,000 word dictionary. If you also make "this", "This" and "THIS" separate, then you can get away with a 17K word dictionary. Depending on the language, if you use roots and then vary morphology based number, tense, etc., then the number of root words (and the choice you have in making them simple) can be reduced. Such "pass phrases" can be easier to remember, transcribe, etc. (Also you will get random, humorous, offensive, etc., phrases...)
I recently needed to encode a 32-bit value into something easy for QA folks to remember and report. I opted for 3 words out of an 11-bit (2048 entry) dictionary of commonly used words.
How to build the dictionary? Well, in order to determine the most commonly used English words, I downloaded a bunch of free texts from Project Gutenberg, and did some simple filtering - nothing less than 5 letters, no duplication of singular + plural, etc...
A valuable lesson that I learned during this process is that when your corpus includes older english texts, you should always give your final list a visual once-over and apply some judicious manual filtering. I'm looking at you, "The Adventures of Tom Sawyer". (And, to a lesser extent, Moby Dick).
Or use the BIP39 lists since they also encode 2048 bits. If you just use BIP39 you also get a checksum. RFC 1751[1] is the "standardised" option but IMHO the wordlist they use is far too easy to misread (though this is because the words are all less than 4 characters).
Not bad, I mean, I’m not lining up to implement it in C tomorrow, but if it gets an RFC I could definitely see using it.
I have an application where I’m using a 32bit serial for the event someone has to read it to sales staff over the phone. I would have liked to use 64bit and encode some more details into the serial. This would satisfy that.
I like the idea of removing ambiguous chars. I have a Base64 system that prints where I and l use the same font (infuriating).
It's certainly a fun idea. If the goal is human readable, humans are surprisingly bad at differentiating emoji just by looking at them, especially all of the subtly different variation face ones. Describing them over something like a phone call could lead to all sorts of transcription mistakes. Not to mention that there's a variety of different emoji input systems/keyboards and the amount of user skill in finding/picking emoji for text entry are hugely variable.
> The data length must be multiple of 32 bits. There is no padding mechanism in the encoder.
Such padding mechanism should not be necessary, and the padding from standard base64 is also not necessary. If you remove the ==='s you can still unambiguously decode it (despite the error some tools will give). URL-safe base64 (RFC 4648 §5) does not require padding and can represent any data length.
In a similar vein, this is an encoding I designed specifically for 256 bit keys; my design includes checksumming and some consideration to consistent verbalization:
The author mentioned it's confusing even when one of similar character is used (like base 10), the program can indeed automatically resolve typo (e.g. treat O as zero).
mod 24 isn't a field, so it's not easy to add good error protection to base24 using a regular cyclic code.
You can add a single check digit with good performance using the Damm algorithm: https://en.wikipedia.org/wiki/Damm_algorithm one of the external links on that article has a suitable quasigroup matrix for Z_24.
I'm wondering maybe in the future, we will use these baseN methods to represent any numbers in everyday life? With the data explosion, I think we will have more opportunity to describe really big numbers than today, so we will abandon the usage of decimal eventually, speak and write in this base24 method or some upcoming base128, base1024?
I know there's Mega or Giga that can describe how big the number is in decimal, but they can do better (represent bigger numbers) in the baseN method where N > 10. So will we shift to these methods?
> B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
they were 115 bits encoded in 24 characters
see also human-oriented base32 encoding:
https://philzimmermann.com/docs/human-oriented-base-32-encod...
which includes this nice trick:
> We have permuted the alphabet to make the more commonly occuring characters also be those that we think are easier to read, write, speak, and remember.
edit: to add, an interesting human-readable and memorable base52 alphabet that I've never found a use for is to use playing cards