
Show HN: Base32H, a human-friendly duotrigesimal number system - yellowapple
https://base32h.github.io
======
coolreader18
Huh, interesting. For stuff like this (human-friendly numbers/ids/codes) I
tend to use google's base20 Open Location Code alphabet[0] that I read an
explanation of a long time ago on their website[1] :P. Whereas this project
aliases some similar letters together, the OLC alphabet avoids similar looking
letters altogether and tries to avoid spelling any words at all. Keep in mind
that "oh, letters are aliases for each other" doesn't necessarily help now
that "passwords are case sensitive" is also drilled into users' heads, and no
distinction between I/l is even more rare than case sensitivity, so users
might unnecessarily remember the difference or make a fuss about it (speaking
from experience working with my grandmother and technology).

Anyway, usually I just copy/paste a piece of code like below to use the OLC
alphabet and randomly generate an id, and you could just copy the literal to
use as a radix alphabet as well.

    
    
        "".join(random.choices("23456789CFGHJMPQRVWX", k=5))
        # check against id in dict
        if userinput.toupper() in idmapping: ...
    

[0]: [https://github.com/google/open-location-
code/blob/master/js/...](https://github.com/google/open-location-
code/blob/master/js/src/openlocationcode.js#L100)

[1]: [https://github.com/google/open-location-
code/blob/master/doc...](https://github.com/google/open-location-
code/blob/master/docs/olc_definition.adoc#introduction), under "Easy to use"

------
sedatk
I don't get the use case. We already have Base32. We already have human-
readable Base32 (Crockford's flavor). And Crockford makes the better tradeoff
by addressing I/1/l ambiguity instead of U/V. We don't need a new hex format
either. I don't get why this exists.

~~~
AQXt
> And Crockford makes the better tradeoff by addressing I/1/l ambiguity
> instead of U/V.

Exactly!

It is much easier to mistake "l" for "1" (lowercase L for number 1) than "5"
for "S" or "U" for "V".

~~~
sedatk
Crockford doesn't have U in its alphabet so it doesn't have U/V problem
either. It just doesn't have aliasing for U because it's used for check
digits.

------
jarym
I threw 0x2059B7DEDB800C03 (2331096449934167043) in there as a Postgres int8 I
had lying around and I get the message: "The number you're encoding is bigger
than what Javascript can accurately represent, so the below value is probably
incorrect."

However, it is (now) possible to represent this in JavaScript as a BigInt[0]

[0] [https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Refe...](https://developer.mozilla.org/en-
US/docs/Web/JavaScript/Reference/Global_Objects/BigInt)

~~~
yellowapple
Yeah, I wanted to keep the initial pass at a JS implementation as universal
and dependency-free as possible, which precluded the use of BigInt for this
attempt (a surprising number of JS runtimes out there don't ship with it,
including both any version of Internet Explorer and the ancient version of
Node.js that's the default on my system). PRs are certainly welcome if someone
wants to have a go at hacking it in (on the condition that it's able to fall
back to a non-BigInt approach should it be unavailable).

------
sharpercoder
The V v U u alias should be split into V v and U u. The l should be used as
alias for 1.

I totally see the historic and soundex analogue between V v U u, but it seems
to me that the visual similarity of 1 L l i has precedence.

~~~
yellowapple
I considered that, but wanted to preserve Crockford's partial defense against
accidental profanity generation while still allowing U/u to be decoded (if
Crockford's aliased it I probably wouldn't have bothered to whip up Base32H,
lol).

As for 1/l/I, yeah, that's definitely the main flaw. The workaround (as
described in the FAQ) would be to always emit uppercase and take advantage of
L being pretty visually distinct. A bit ugly for URLs, but for things like
asset tags, product keys, and item codes / SKUs (the main things I had in
mind) that's already the norm.

~~~
sharpercoder
A language is a tool. Is special-casing the U u to prevent a single case of
profanity worth it?

People will come up with many ways to generate profane language. See
guids/uuids for example, or l33tsp34k. A major change to hypothetically
prevent a single case seems unbalanced.

~~~
yellowapple
> A language is a tool. Is special-casing the U u to prevent a single case of
> profanity worth it?

It was important enough that Douglas Crockford entirely removed the letter U
from his system except as one of several choices for a check digit. I opted
for the same tradeoff for Base32H; if anything, keeping U/u as a separate
digit would be the major change. V and U look close enough together (more so
than I and L, especially in real-world conditions where legibility is poor)
that it didn't seem unreasonable for the latter to be an alias of the former.

------
benatkin
It gets rid of a lot of swear words. I think I might prefer this to NanoID for
showing to the user. I really like the choice of uppercase for making it
visually boring. Slack does the same thing for its ids, in that it's a single
capital letter for the type followed by a sequence of decimal digits (which
could be mistaken for an integer).

For the example as well as in the library I suggest using big integers. That
way if you use NanoID or BSON (MongoDB) IDs or even uuids behind the scenes
you could always handle the default size ones.

It seems a lot are missing that it doesn't work as well on lowercase. Ever
notice that the old product key codes on the boxes of shrinkwrapped software
were in uppercase?

------
jechol
I created Elixir library for this today.
[https://github.com/jechol/base32h](https://github.com/jechol/base32h)

~~~
yellowapple
Nice! Beat me to the punch :) I'll try to get that added to the implementation
list as soon as I've got a sec.

------
yellowapple
Huh, this blew up unexpectedly (posted this the other day and it seemed to fly
under the radar, so I figured that was that; imagine my surprise when I
suddenly see it's got a bunch of attention and an inexplicably-more-recent
timestamp). (EDIT: apparently the powers-that-be put it in the "second chance
pool"; thanks!)

Thanks for all the feedback, y'all! I'll try to keep answering questions as
they come up.

------
mjevans
Why is a displaying 32 bits, rather than 16, per rune actually a useful idea?
Why even, when MIME and USENET encodings like yyenc are denser, and hex aligns
so well?

They toss out a 40 bit number as an example, but who even uses that? What even
is it, a truncated hash? A 32 or 64 bit truncation is just as valid, and using
a range of 1, 16, 256, 4096, or 65536 buckets (zero to four letters) seems to
be just as valid.

EDIT (add this paragraph): Use cases for labeling involve 4 and 8 characters
of 32-bit runes, but that's 4 vs 5 and 8 vs 10 hex characters. The hex
characters are practically self-documenting, and as I noted in another part of
the criticism an hour ago, the folding / skipping of some characters could
cause someone to wonder where missing elements of a set are. No array or
lookup table or spec definition is required for hex, which would be better in
this use case than any 32-bit rune system.

Additionally, Hex natively extends decimal in a natural way that doesn't skip
any letters. The skipped letters actively make me worry that in a list of
folders someone might believe there are missing locations.

Our computational systems are currently based on powers of 2, because we use
binary logic, and binary addresses. This is so powerful that the base set for
text standardized on a friendly power of 2 within the useful range, 8, instead
of other popular sizes at the time a couple decades ago.

Octal encoding numbers only represents 3 bits, this is inconvenient enough
that hex encoding became popular. Representing 4 bits per display character to
align with base 2 alignments (as well as expanding the encoding of a single
octet-byte, which was the defacto minimum unit of addressing in popular
architectures, to exactly twice it's normal storage / display footprint).

~~~
yellowapple
> Why is a displaying 32 bits, rather than 16, per rune actually a useful
> idea?

To be clear: base-32 systems (including Base32H) use 5 bits per rune, which
means...

> A 32 or 64 bit truncation is just as valid, and using a range of 1, 16, 256,
> 4096, or 65536 buckets (zero to four letters) seems to be just as valid.

...base-32 systems would naturally produce 1, 32, 1024, 32768, or 33554432
buckets for that same range of letter counts. Base-64 systems would produce
even better numbers of buckets within that range of letter counts, but (as
outlined in the "Comparisons" page) it comes at the expense of case-
sensitivity.

~~~
mjevans
I was just updating my earlier reply after I thought to look at their use
cases page.

I should write a better criticism about that.

Usernames: No, store them as given, just like real names. Also, use UTF-8.
Consult the visual rendering engine for decomposing the value to display
units, or a rendered result, etc.

Asset tags: Just use hex, or numbers. Mortal people looking at either of those
systems understand them quickly enough without further documentation. It's
only 4 vs 5 letters, worth the lack of a binder describing why you're not
missing crates.

Cheat codes: This is actually fine, I don't care, it isn't important. Though
hex or anything else is just as fine.

Cryptography: Less possible artifacts from less letters. 0123456789ABCDEF
typically don't contain any difficult to read or alias characters for dyslexic
or other non-perfect readers. If density matters (again 4 vs 5) no printed
media is going to work out.

Geographic coordinates / addresses: Everyone's using base 10 here, either as
big floats or as sets of numbers that are friendly to read. Those articles
that have average consumers use an app and send 3 words back from it are
making up for the missing safety feature of 911 sending a well formatted set
of SMS messages with the call indicating the handset's current GPS data.

------
keithlfrost
I have only one complaint about this Base32 encoding choice, and it stems from
the fact that I prefer to encode Base32 using lower case letters, instead of
the choice made here to make upper case canonical. When using lower case, the
main source of possible confusion is that it can be difficult to tell l and 1
apart, as in l1l1l1l... and this scheme uses both l (canonically "L") and 1.

~~~
edoceo
Hmm, other base32 system avoid that by not including I and L (and O) - and
some other refs I've read (ULID comes to mind) say produce UPPER output but
accept either case input.

And, like this spec, the values are aliases so 0/o are the same, 1/I/l are the
same, etc

[https://github.com/ulid/spec](https://github.com/ulid/spec)

~~~
keithlfrost
Yes. I'm surprised the author would be more concerned about confusion between
U/V (or u/v) than between 1/l ... the former has always seemed relatively far-
fetched to me, whereas depending on the font, the latter can be a real
problem. Again, I attribute the issue to the choice of upper case as
canonical, because L is not easily confused with any other letter or number.

~~~
yellowapple
> Again, I attribute the issue to the choice of upper case as canonical,
> because L is not easily confused with any other letter or number.

That is indeed why I settled for that tradeoff, yep; as long as it's all
VPPERCA5E it'll be distinct enough.

------
qrian
Why is l and I not aliased? They are easily confused in san-serif fonts.

~~~
tzs
26 letters + 10 digits - 4 letters lost to aliasing (o, i, s, u) = 32 symbols.
Making L and alias would leave them only enough for base 31 unless they
dropped one of the other aliases.

Personally, I'd be OK with that. I think U is much less likely to be confused
for V, at least in anything not handwritten, than lower-case L is likely to be
confused for 1.

------
thenines
I like this, though I agree with others that the minimally-confused U & V,
would be better traded for the oft-confused 1, I & l.

One slight additional issue not so far mentioned is what of the case of
needing to encode one of many now "NSFW numbers", such as the trigger-warning
(!) decimal 739787225?

~~~
yellowapple
Yeah, that's a hard-to-address problem with any number system like this (for
example, most other base-32 systems have this problem, as do Base64 and pretty
much everything else using the full alphanumeric range). Aliasing U and V is
an attempt to do the same for a different case (addressing e.g. decimal
numbers 519571 and 421594).

If I didn't have as strong of a want for a power-of-2 radix I'd have aliased
G/g to 6 (and re-aliased L/l to 1) to further address this. Maybe even aliased
B/b to 8, since sometimes that's a source of readability issues (and it
further mitigates 11594129).

------
dheera
Bitcoin addresses use base58 I believe, which is like base64 but avoids 0, O,
I, l, +, /. It arguably serves the human-friendly requirement well while being
more compact than Base32H. It is, however, not friendly to byte-aligning use
cases.

~~~
sedatk
Base58 isn't suitable for general purpose encoding though as its performance
degrades exponentially based on the length of the input. Base32 doesn't have
that problem.

~~~
Thorrez
Exponential? I would assume it would be more like O(n^3) or less, where n is
the number of bits of the input.

~~~
sedatk
Yes, polynomial would be more accurate, and I think it's O(N^2).

------
geoah
Interesting approach, thank you for sharing this. I especially like the 5Ss
aliasing.

A test vector file for implementers would be nice (something like what cbor
provides) so all possible edge cases can be checked for.

~~~
yellowapple
Good idea, agreed. There's a partial attempt at that in the JS
implementation's test suite, albeit written into the test code itself; that'd
probably be a decent enough starting point for a non-comprehensive approach.

At some point a full-blown test harness will be useful (i.e. to compare
implementations and make sure they have equivalent behavior, for e.g.
randomized or sequential tests). Haven't gotten that far yet :)

------
wieghant
I just recently understood the deep connection between bytes and hex, hex and
decimal.

Can someone ELI5 what would be the use case for having data represented in
Base32H? I understand it conveys more info in less runes but it feels hard to
keep in my head. Is this just something that takes practice and getting used
to or I'm not supposed to try to use this to read binaries?

~~~
yellowapple
> Can someone ELI5 what would be the use case for having data represented in
> Base32H?

Well, the main thing I wanted to do (that motivated me to come up with
Base32H, instead of using a different base-32 system) was be able to use
English words as numbers (and likewise, turn numbers into English words). This
is hard to do with hex because you (usually) only have 6 letters, but pretty
easy to do with Base32H because _every_ letter can be a digit.

It might seem silly at first glance, but it can be pretty handy for things
like adding prefixes to a number. For example, for asset tagging (the use case
that first motivated Base32H), figuring out whether or not a given asset is a
desktop PC, you could use 8 digits in total: 4 for the prefix "DTPC" (for
DeskTop PC) and 4 for the actual identifier. Since DTPC-0000 in Base32H is
475378221056 in decimal, and DTPC-ZZZZ is 475379269631, you can immediately
know that anything between those numbers (inclusively) is a desktop PC, and
anything outside that range is something else. Same deal with, say laptop PCs
(LTPC) being in the range of LTPC-0000 (715896389632) through LTPC-ZZZZ
(715897438207), or keyboards (KBRD) being in the range of KBRD-0000
(665498681344) through KBRD-ZZZZ (665499729919).

And yes, you could (and should!) absolutely do this as part of a database
schema, too (for example, by having an "asset_type" column in whatever table's
storing asset tags), but the advantage of this is that anyone and anything
encountering one of these asset IDs "in the wild" can figure out the type
without needing to access the database at all.

------
fsiefken
I miss the comparison to the duodecimal number system (base12). To me that
seems much better then both the decimal and the duotrigesimal number system.
[http://duodecimal.net/archives/duodecimal/duodecimal.html](http://duodecimal.net/archives/duodecimal/duodecimal.html)

~~~
jodrellblank
Time to link the ConLang Critic's video "A better way to count" where he
compares decimal and dozenal (base12) with seximal (base6) and base 6 wins.

[https://www.youtube.com/watch?v=qID2B4MK7Y0](https://www.youtube.com/watch?v=qID2B4MK7Y0)

One of the best videos of number-base related comedy.

------
nikeee
There is also a base24 that has many implementations in various languages (I
did the one for .NET):

[https://www.kuon.ch/post/2020-02-27-base24/](https://www.kuon.ch/post/2020-02-27-base24/)

------
foxylad
I often use base 36 (0-9 a-z), which is the most compact notation (highest
base) that python's int() function can decode.

------
nick_kline
Use the us ascii 0-9 and a-z (caps equivalent). O and zero (0) are the same,
as are numeral 1 and L _and_ I. S matches a and 5.

I like it.

------
inopinatus
Throw in some kind of erasure code and I’m sold.

~~~
yellowapple
Yep, that's the one major missing feature. The good news is there's nothing
stopping an application from implementing it (and indeed, with 8-character /
40-bit chunks being the recommendation, that lends itself well to sticking
8-bit checksums/ECCs/whatever onto 32-bit values).

------
dependenttypes
Be warned, the implementation is vulnerable to side-channel attacks. Please
avoid using it to encode secrets. I would suggest using libsodium instead
[https://libsodium.gitbook.io/doc/helpers](https://libsodium.gitbook.io/doc/helpers)
\- although it does not offer a base32 variant.

~~~
yellowapple
Anything you'd recommend to mitigate those side-channel attacks? I was going
more for simplicity and portability for the reference implementations, but
should there be a security-focused implementation (e.g. as part of some
library like libsodium) it'd be useful to know the attack surfaces.

~~~
dependenttypes
Do not use the input as indices, do not branch depending on the input, and do
not use division, mod, and even multiplication on the input. Check how
libsodium does it. Here is a safe rot13 implementation in C if it is any help
(note: it assumes a safe islower and isupper implementation).

    
    
        #include <ctype.h>
        
        #define IFTHENELSE(c, t, e) ((-(!!(c)) & (t)) | (-(!(c)) & (e)))
        
        unsigned char
        mod26 (unsigned char x)
        {
          x -= IFTHENELSE(x >= 26 * 4,
            26 * 4,
            0);
          x -= IFTHENELSE(x >= 26 * 2,
            26 * 2,
            0);
          x -= IFTHENELSE(x >= 26 * 2,
            26 * 2,
            0);
          x -= IFTHENELSE(x >= 26,
            26,
            0);
          return x;
        }
        
        unsigned char
        rot13 (unsigned char x)
        {
          return IFTHENELSE(islower(x), mod26(x - 'a' + 13) + 'a',
              IFTHENELSE(isupper(x), mod26(x - 'A' + 13) + 'A',
                  x));
        }
    

implementing base32h should be much easier as you do not need to perform mod
with a non-power of 2.

