> Why didn't they go the UTF-8 round? The original Unicode marketing / position ...

zasdffaa · on Dec 1, 2022

Do you (or anyone) have some idea why anyone could possibly have thought 16 bits would be enough? Many decisions are bad in hindsight but surely no hindsight was needed for that.

mananaysiempre · on Dec 1, 2022

Nope. But on reflection, I can’t really tell if it was really that dumb of an idea.

If you look at the text following the “Yes” quote, you’ll find that “all characters” is carefully defined to mean ”all characters in current use from commercially non-negligible scripts”. Compared to the current definition of “all characters we have reasonable evidence have ever been used for natural-language interchange”, it doesn’t sound as noble, but would also exclude a number of large-repertoire sets (Tangut and pre-modern Han ideograms, Yi syllables, hieroglyphs, cuneiform). Remove the requirement for 1:1 code point mapping with legacy sets, and you could conceivably throw out precomposed Hangul as well. (Precomposed European scripts too, if you want, but that wouldn’t net you eleven thousand codepoints.)

At that point the question seems to come down to Han characters: the union of all government-mandated education standards (unified) would come down well below ten thousand characters, but how well does that number correspond to the number of characters people actually need? One potential source of death is uncommon characters people really, really want (proper names), but overall, I don’t know, you’d probably need a CJKV expert to tell. To me, neither answer seems completely implausible.

On the other hand, it’s also unclear that a constant-width encoding would really be all that valuable. Most of the time, you are either traversing all code points in sequence or working with larger units such as combining-character sequences or graphemes, so aside from buffer truncation issues constant width does not really help all that much. But that’s an observation that took more than a decade of Unicode implementations to crystallize.

It is certainly annoying how large and sparse the lookup tables needed to implement a current version of Unicode are—enough that you need three levels in your radix tree and not two—but if you aren’t doing locales it’s still a question of at most several dozens of kilobytes, not really a deal breaker these days. Perhaps that’s not too much of a cost for not marginalizing users of obscure languages and keeping digitized historical text representable in the common format.

agingllama · on Dec 1, 2022

Just wanted to say thank you for linking these documents! I find the history of character encoding design so interesting, especially from primary sources such as these. :)

mananaysiempre · on Dec 1, 2022

Well then :) You might also like the report “Character Set Issues for Ada 9X”[1] from 1989, the only online source I know that goes into any sort of detail about the old ISO 10646 draft before it was essentially replaced[2,3] with Unicode. (Other references about that beast are welcome!) The only trace of it in current use looks to be the term “plane” for a naturally-aligned set of 2^16 code points, originally from a base-256 sequence of group / plane / row / cell.

[1] https://apps.dtic.mil/sti/citations/ADA221614 (scanned PDF) or http://archive.adaic.com/pol-hist/history/9x-history/reports... (PostScript) or http://archive.adaic.com/pol-hist/history/9x-history/reports... (ASCII)

[2] https://www.unicode.org/history/hartmemo.html

[3] https://www.unicode.org/history/hartinterview.html

agingllama · on Dec 2, 2022

Wow, this is crazy, thank you so much for sharing! I love reading history like this! :D

Every time I read about the history of character encodings I feel like I learn about a new encoding standard that attempted to standardize things. Reading about this led me to reading more about ASCII as well. I learned it was derived from the 1924 ITA2 standard which was itself derived from the "Baudot" printing telegraph encoding from 1874! It always amazes me how much history surrounds this topic! [1]

Also, that DTIC site is such a treasure trove of great information! :D

[1] http://www.baudot.net/docs/smith--teletype-codes.pdf