Hacker News new | past | comments | ask | show | jobs | submit login

> Why didn't they go the UTF-8 round?

The original Unicode marketing / position statement from 1988[1] may provide a clue:

“In the Unicode system, a simple unambiguous fixed-length character encoding is integrated into a coherent overall architecture for text processing.”

“Unicodes [sic] are the most straightforward multilingual generalization of ASCII codes: - Fixed length of character code (16 bits); [...]”

“Are 16 bits [...] sufficient to encode all characters of all the world’s scripts? [...] Yes.”

“[A] fixed length-encoding is flat-out simple, with all the blessings attendant upon that virtue.”

Etc., etc.

The hypothetical possibility of more than 2^16 characters was introduced in Unicode 2.0 (1996), while actual such characters didn’t appear until Unicode 3.0 (1999). Windows NT shipped in 1993, OpenStep in 1994, Java and JavaScript in 1995. UTF-8 was presented at USENIX in January 1993; a contemporary exposition[2] says that “the 4[!], 5 and 6 byte sequences are only there for political reasons” (presumably referring to the fact that Unicode committed to 2^16 code points while the new, Unicode-compatible draft of ISO 10646 stuck with 2^31).

[1] https://unicode.org/history/unicode88.pdf

[2] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt




Do you (or anyone) have some idea why anyone could possibly have thought 16 bits would be enough? Many decisions are bad in hindsight but surely no hindsight was needed for that.


Nope. But on reflection, I can’t really tell if it was really that dumb of an idea.

If you look at the text following the “Yes” quote, you’ll find that “all characters” is carefully defined to mean ”all characters in current use from commercially non-negligible scripts”. Compared to the current definition of “all characters we have reasonable evidence have ever been used for natural-language interchange”, it doesn’t sound as noble, but would also exclude a number of large-repertoire sets (Tangut and pre-modern Han ideograms, Yi syllables, hieroglyphs, cuneiform). Remove the requirement for 1:1 code point mapping with legacy sets, and you could conceivably throw out precomposed Hangul as well. (Precomposed European scripts too, if you want, but that wouldn’t net you eleven thousand codepoints.)

At that point the question seems to come down to Han characters: the union of all government-mandated education standards (unified) would come down well below ten thousand characters, but how well does that number correspond to the number of characters people actually need? One potential source of death is uncommon characters people really, really want (proper names), but overall, I don’t know, you’d probably need a CJKV expert to tell. To me, neither answer seems completely implausible.

On the other hand, it’s also unclear that a constant-width encoding would really be all that valuable. Most of the time, you are either traversing all code points in sequence or working with larger units such as combining-character sequences or graphemes, so aside from buffer truncation issues constant width does not really help all that much. But that’s an observation that took more than a decade of Unicode implementations to crystallize.

It is certainly annoying how large and sparse the lookup tables needed to implement a current version of Unicode are—enough that you need three levels in your radix tree and not two—but if you aren’t doing locales it’s still a question of at most several dozens of kilobytes, not really a deal breaker these days. Perhaps that’s not too much of a cost for not marginalizing users of obscure languages and keeping digitized historical text representable in the common format.


Just wanted to say thank you for linking these documents! I find the history of character encoding design so interesting, especially from primary sources such as these. :)


Well then :) You might also like the report “Character Set Issues for Ada 9X”[1] from 1989, the only online source I know that goes into any sort of detail about the old ISO 10646 draft before it was essentially replaced[2,3] with Unicode. (Other references about that beast are welcome!) The only trace of it in current use looks to be the term “plane” for a naturally-aligned set of 2^16 code points, originally from a base-256 sequence of group / plane / row / cell.

[1] https://apps.dtic.mil/sti/citations/ADA221614 (scanned PDF) or http://archive.adaic.com/pol-hist/history/9x-history/reports... (PostScript) or http://archive.adaic.com/pol-hist/history/9x-history/reports... (ASCII)

[2] https://www.unicode.org/history/hartmemo.html

[3] https://www.unicode.org/history/hartinterview.html


Wow, this is crazy, thank you so much for sharing! I love reading history like this! :D

Every time I read about the history of character encodings I feel like I learn about a new encoding standard that attempted to standardize things. Reading about this led me to reading more about ASCII as well. I learned it was derived from the 1924 ITA2 standard which was itself derived from the "Baudot" printing telegraph encoding from 1874! It always amazes me how much history surrounds this topic! [1]

Also, that DTIC site is such a treasure trove of great information! :D

[1] http://www.baudot.net/docs/smith--teletype-codes.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: