People REALLY like to complain about unicode, but where it's complicated, it's because the _problem space_ is complicated. Which it is. People are actually complaining that they wish handling global text wasn't so complicated, like, that humans had been a lot simpler and more limited with their inventions of alphabets and how they were used in typesetting and printing and what not, and that legacy digital text encodings historically had happened differently than they did, they're not actually complaining about unicode at all, which had to deal with the hand of cards it was dealt.
That unicode worked out as nice a solution as it is to storing global text is pretty amazing, there were some really smart and competent people working on it. When you dig into the details, you will be continually amazed how nice it is. And how well-documented.
One real testament to this is how _well adopted_ Unicode is. There is no actual guarantee that just because you make a standard anyone will use it. Nobody forced anyone to move from whatever they did to Unicode. (and in fact most eg internet standards don't force Unicode and are technically agnostic as to text encoding). That it has become so universal is because it was so well-designed, it solved real problems, with a feasible migration path for developers that had a cost justified by it's benefits. (When people complain about aspects of UTF-8 required by it's multi-facetted compatibility with ascii, they are missing that this is what led to unicode actually winning).
The OP, despite the title, doesn't actually serve as a great argument/explanation for how Unicode is awesome. But I'd read some of the Unicode "annex" docs -- they are also great docs!
- UTF-8 would have been specified first
- we'd not have had UCS-2, nor UTF-16
- we'd have more than 21 bits of codespace
- CJK unification would not have been attempted
- we might or might not have pre-composed codepoints
- a few character-specific mistakes would have gone unmade
Everything to do with normalization, graphemes, and all the things that make Unicode complex and painful would still have to be there because they are necessary and not mistakes. Unicode's complexity derives from the complexity of human scripts.
 Going back further to create Unicode before there was a crushing need for it would be impossible -- try convincing computer scientists in the 60s to use Unicode... Or IBM in the 30s. For this reason, pre-composed codepoints would still have proven to be very useful, so we'd probably still have them if we started over, and we'd still end up with NFC/NFKC being closed to new additions, which would leave NFD as the better NF just as it is today.
Love an interesting sci-fi scenario. UTF-8 was a really neat technical trick, and a lot of the early UTF-8 technical documentation was already on IBM letterhead. I think if you showed up with the right documents at various points in history IBM would have been ecstatic to have an idea like UTF-8, at least. UTF-8 would have sidestepped a lot of mistakes with code pages and CCSIDs (IBM's attempts at 16-bit characters, encoding both code page and character), and IBM developers likely would have enjoyed that. Also, they might have been delightfully confused about how the memo was coming from inside the house by coworkers not currently on payroll.
Possibly that even extends as far back as the 1930s and formation of the company, because even then IBM aspired to be a truly International company, given the I in its own name.
I'm not sure how much of the rest of Unicode you could have convinced them of, but it's funny imagining explaining say Emoji to IBM suits at various points in history.
OTOH, it wouldn't have been UTF-8 -- it would have been an EBCDIC-8 thing, and probably not good :)
In some small decisions EBCDIC makes as much or more sense than ASCII; the decades of problems have been that ASCII and EBCDIC coexisted from basically the beginning. (IBM could have delayed the System/360 while ASCII was standardized and likely have saved decades of programmer grief.) The reasons that UTF-EBCDIC is so bad (such as that it is always a 16-bit encoding) could likely have been avoided had IBM awareness of UTF-8 ahead of time.
Maybe if IBM had something like UTF-8 as far back as the 1930s, AT&T needing backward compatibility with their teleprinters might not have been as big of a deal and ASCII might have been more IBM dominated. Or since this is a sci-fi scenario, you just impress on IBM that they need to build a telegraph compatible teleprinter model or three in addition to all their focus on punch cards, and maybe they'd get all of it interoperable themselves ahead of ASCII.
Though that starts to ask about the scenario what happens if you give UTF-8 to early Baudot code developers in the telegraph world. You might have a hard time to convince them they need more than 5-bits, but if you could accomplish that, imagine where telegraphy could have gone. Winking face emoji full stop
I think this points to why the science fiction scenario really is a science fiction scenario -- I think decoding and interpreting UTF-8, using it to control, say, a teleprompter, is probably significantly enough more expensive than ASCII that it would have been a no go, too hard/expensive or entirely implausible to implement in a teleprompter using even 1960s digital technology.
"Sigh, another swiss cheese card jammed the reader from all these emojis."
Just in case, there is an extension to Unicode  if we would ever be out of code points.
Yet legacy is forever, so I wouldn't expect UTF-16 to die.
Most likely, when we run out of codespace we'll see UC assign a "UTF-16 SUX, JUST UPGRADE TO UTF-8 ALREADY" codepoint that newer codepoints can be mapped to when converting to UTF-16, then lift the 21 bit limit on UTF-8.
But the answer is that 32 is not better than 16, which is not better than 8, in this specific case. The bit count here is about memory efficiency. There are developers who think/thought that UTF-32 would improve random access into strings because it's a fixed-sized encoding of codepoints, but in fact it does not because there are glyphs that require multiple codepoints.
All who pass must abandon the idea of random access into Unicode strings!
Once you give up on Unicode string random access, you can focus on other reasons to like one UTF over another, and then you quickly realize that UTF-8 is the best by far.
For example there's this misconception that UTF-8 takes significantly (e.g., 50%) more space than UTF-16 when handling CJK, but that's not obviously true -- I've seen statistical analyses showing that UTF-8 is roughly comparable to UTF-16 in this regard.
Ultimately UTF-8 is much easier to retrofit into older systems that use ASCII, it's much easier to use in libraries and programs that were designed for ASCII, and doesn't have the fundamental 21-bit codespace limit that UTF-16 has.
What I can not forgive is coming up with something that allows the same string have multiple representations, which happens in UCS-2 / UTF-16, and because pre-composed codepoints overlap pre-existing characters. That mistake makes "string1" == "string2" meaningless in the general case. It's already caused exploits.
I also can not forgive pre-composed codepoints for another reason: the make parsing strings error prone. This is because ("A" in "Åström") no longer works - it will return true in if Å is composite character.
Let me explain again. We have these pesky diacritical marks. How shall they be represented? Well, if you go with combining codepoints, the moment you need to stack more than one diacritical on a base character then you have a normalization problem. Ok then! you say, let us go with precompositions only. But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a', or that you can't easily come up with new compositions the way one could with overstrike (in ASCII) or with combining codepoints in decomposed forms. Fine!, you say, I'll take that! But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that? That the rest of the world should abandon non-English scripts, that we should all romanize all non-English languages, or even better(?), abandon all non-English languages?
(And we haven't even gotten to case issues. You think there are no case issues _outside_ Unicode? That would be quite wrong. Just look at the Turkish dot-less i, or Spanish case folding for accented characters... For example, in French upcasing accented characters does not lose the accent, but in Spanish they do -or used to, or are allowed to be lost, and often are-, which means that accented characters don't round-trip in Spanish.)
No, I'm sorry, but the equivalence / normalization problem was unavoidable -- it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.
The moment you accept that this is an unavoidable problem, life gets better. Now you can make sure that you have suitable Unicode implementations.
What? you think that normalization is expensive? But if you implement a form-insensitive string comparison, you'll see that the vast majority of the time you don't have to normalize at all, and for mostly-ASCII strings the performance of form-insensitivity can approach that of plain old C locale strcmp().
I don't have an answer, mostly because I know nothing about Hangul. Maybe decomposition is the right solution there. Frankly I don't care what Unicode does to solve the problems Hangul creates, and as Korea about 1% of the world's population I doubt many other people here care either.
I'm commenting about Latin languages. There is absolutely no doubt what is easiest for a programmer there: one code point per grapheme. We've tried considering 'A' and 'a' the same in both computer languages and file systems. It was a mess. No one does it any more.
> But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a'
It's not a problem. We know how to match 'A' and 'a', which are in every sense closer than 'á' and 'a' ('á' and 'a' can be different phonetically, 'A' and 'a' aren't). If matching both 'A' and 'a' isn't a major issue, why would 'á' and 'a' be an issue that Unicode must to solve for us?
In fact given it's history I'm sort of surprised Unicode didn't try and solve it by adding a composition to change case. shudder
> And we haven't even gotten to case issues.
The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.
There is objective reason for wanting that. Typically programmers do not do much with strings. The two most common things they do is move them around, and compare them for equality but also sort them. They naturally don't read the Unicode standard. They just expect the binary representation of strings to faithfully follow what their eyes tell them should happen: if two strings look identical, their Unicode representation will be identical. It's not an unreasonable expectation. If it's true those basic operations of moving and comparing will be simple, and more importantly efficient on a computer.
The one other thing we have to do a lot less often, but nonetheless occupies a fair bit of our time is parsing a string. It occupies our time because it's fiddly and takes a lot of code, and is error prone. I still remember the days when languages string handling are a selection criteria. (It's still the reason I dislike Fortran.) I'm not talking about complex parsing here - it's usually something like spilt it into words or file system path components, or going looking for a particular token. It invariably means moving along the string one grapheme at time, sniffing for what you want and extracting it. (Again this quite possibly is only meaningful for Latin based languages - but that's OK because the things we are after are invariably Latin characters in configuration files, file names and the like. The rest can be treated as uninteresting blobs.) And now Unicode's composition has retrospectively simple operation far harder to do correctly.
All other text handling programmers do is now delegated to libraries of some sort. You mention one: nobody does case conversion themselves. They call strtolower() for ASCII or a Unicode equivalent. Hell, as soon as you leave Latin we even printing it correctly requires years of expertise to master. The problems that crop up may as you say may be unavoidable, but that's OK because they are so uncommon I'm willing to wear the speed penalty to use somebody else's code to do it.
> it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.
Did someone say that? Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve. A little tweak here and it would do that job too! I saw an apprentice sharpen the handle of his Estwing hammer once. It did make it a useful wire cutter in a pinch, but no prizes for guessing what happen when he just using it as a hammer.
Unicode acquired it's warts by attempting to solve everybody's problems. Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, and wasted the bare minimum of his time on stackoverflow before using it.
The tragedy is it didn't do that.
> Did someone say that?
Yes, u/kazinator did.
> Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve.
That's not why decomposition happened. It happened because a) decomposition already existed outside Unicode, b) it's useful. Ditto pre-composition.
> Unicode acquired it's warts by attempting to solve everybody's problems.
Unicode acquired its warts by attempting to be an incremental upgrade to other codesets. And also by attempting to support disparate scripts with disparate needs. The latter more than the former.
> Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, ...
They did try to ruthlessly optimize it: by pursuing CJK unification. That failed due to external politics.
As to the idea that programmers who want nothing to do with I18N are the most common user or Unicode, that's rather insulting to the real users: the end users. All of this is to make life easier on end users: so they need not risk errors due to not their (or their SW) not being able to keep track of what codeset/encoding some document is written in, so they can mix scripts in documents, and so on.
Unicode is not there to make your life harder. It's there to make end users' lives easier. And it's succeeded wildly at that.
> > And we haven't even gotten to case issues.
> The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.
You really should educate yourself on I18N.
Oh for Pete's sake. Unicode / ASCII / ISO 8859-1 are encoding computers and thus programmers use to represent text. Users don't read Unicode, they read text. They never, ever have to deal with Unicode and most wouldn't know what it was if it leapt up and hit them in the faxe, so if Unicode justified adding features to accommodate these non-existent users, I guess that explains how we got into this mess.
Whether you like it or not, Unicode exists to make USERS' lives better. Programmers?? Pfft. We can deal with the complexity of all of that. User needs, on the other hand, simply cannot be met reasonably with any alternatives to Unicode.
They've been using multiple scripts long before computers. When computers came along those users quite reasonably demanded they be able to write the same scripts. This created a problem for the programmers. The obvious solution is a universal set of code points - and ISO 10646 was born. It was not rocket science. But if it had not come along some other hack / kludge would have been used because the market is too large to be abandoned by the computer companies. They would have put us programmers in a special kind of hell, but I can guarantee the users would not have known about that, let alone cared.
Oddly the encoding schemes proposed by ISO 10646 universally sucked. Unicode walked into that vacuum with their first cock up - 16 bits was enough for anybody. It was not just dumb because it was wrong - it was dumb because they didn't propose unique encoding. They gave us BOM markers instead. Double fail. They didn't win because the one thing they added, their UCS-2 encoding, was any better than what came before. They somehow managed to turn it into a Europe vs USA popularity contest. Nicely played.
Then Unicode and 10646 became the same thing. They jointly continued on in the same manner as before, inventing new encodings to paper over the UCS-2 mistake. Those new encodings all universally sucked. The encoding we programmers use today, UTF-8, was invented by, surprise, surprise, a programmer who was outside of Unicode / 10646 group think. It was popularised at a programmers conference, USENIX, and from there on it was obvious was going to be used regardless of what Unicode / 10646 thought of it, so they accepted it.
If Perl6 is any indication, the programmers are getting the shits with the current mess. Perl6 has explicitly added functions that treat text as a stream of grapheme's rather than Unicode code points. The two are the same of course except when your god damned compositions rear their ugly head - all they do is eliminate that mess. Maybe you should take note. Some programmers have already started using something other than Unicode because it makes their jobs easier. If it catches on you are going to find out real quickly just how much the users care about the encoding scheme computers use for text.
None of this is to trivialise the task of assigning code points to grapheme's. It's huge task. But for pete's sake don't over inflate your ego's by claiming you are doing some great service for mankind. The only thing on the planet that uses the numbers Unicode assigns to characters as is computers. Computers are programmed by one profession - programmers. Your entire output is consumed by that one profession. Yet for all the world what you've written here seems to say Unicode has some higher purpose, and you are optimising for that, whatever it may be. For gods same come down to earth.
Please also not that all string handling in Raku (the `Str` class) is based on graphemes, not just some added functions. This e.g. means that a newline is always 1 grapheme (regardless of whether it was a CR or LF or CRLF). And that for é there is no difference between é (LATIN SMALL LETTER E WITH ACUTE aka 0x00E9) and (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT aka 0x0065, 0x0301).
Please note that for any combination between characters and combiners for which there does not exist a composed version in Unicode, Raku will create synthetic codepoints on the fly. This ensures that you can consider your texts as graphemes, but still be able to roundtrip strange combinations.
- Old grapheme clusters and new extended grapheme clusters would be the same.
Unification was driven by a desire to keep Unicode a 16-bit codespace for as long as possible. Undoing unification meant adding more pressure on an already crowded codespace, which meant abandoning UCS-2, and creating UTF-16 (horror of horrors), and ultimately switching to UTF-8.
In my lifetime, I've seen text systems that choose to include style (bold, italic), color (fore and back), or size with each character. Unicode did not (generally) choose to include those, even though they're arguably part of "global text". Ligatures, too, are generally considered the domain of font rendering, not text storage. Vertical text was, too, until a few months ago. "Historical" writing directions are apparently still considered out of scope for Unicode. Linear A is in scope, though, even though nobody is sure what the characters mean.
Unicode did choose to be backwards compatible with ASCII, and include most of the crazy PETSCII characters (which were pretty popular for just a couple years but not really "global text"), and some mathematical symbols that no mathematician had a use for.
They chose to include both a nice component-combining system and also pre-combined glyphs where legacy codepages had used them. They chose to implement Han unification, but not analogous unifications across other scripts which have even more similar glyphs.
I've dug into the details of Unicode since 3.0 (20 years ago!), and found it's full of arbitrary decisions and legacy workarounds. The contributors are smart but the result looks the same as any committee full of people with conflicting goals.
Legacy support is why it's so well-adapted, and I've never seen a system where piling on legacy support made it "well-designed".
Suppose you wanted to make a system for "universal computation". The analogous method would have been to take Win16, Win32, Mac OS 9, Mac OS X, Linux, and Solaris, define the superset of all of their features, and invent a binary format which supported all of it natively. Legacy support might help get it adopted faster but nobody would call it well-designed. 20 years later, it'd clearly be simultaneously too weak and too powerful for all types of computation we want to do.
Unicode is an amazing political accomplishment. Technically, it seems rather mediocre. Nobody would ever design a text system like this unless held back by mountains of legacy 1970's/1980's systems.
To me having that kind of discussion really doesn’t make a lot of sense. You know, since “we live in a society” (at the risk of quoting a clearly thought-terminating cliche).
Unicode contains within itself thousands of design-decisions, many of them trade-offs. After the fact it’s always extremely easy to swoop in and nitpick those trade-offs. No possible world exists where all those trade-offs are made correctly and what’s more defining what a “correct” trade-off even is is frequently simply impossible to know.
(Just one example to illustrate the scope of this problem: A certain trade-off might be worth making in one direction for use case A and in another direction for use case B, however it’s not really easy to find out whether use case A or use case B are more frequent in the wild. What if both use cases are about equally as important? Now imagine the trade-off space not being a binary space but multi-dimensional. Now imagine not just two but several use cases. Now imagine use case usage changing over time and new use cases emerging in the future.)
It is a mistake to think that Unicode has the ability to people's text behavior by not supporting things. I mean, maybe it does now that it's so popular, but in order to get adoption it had to support what people were actually doing.
People had use cases to put "𝔄" adn "A" in the same document and keep them distinct without unicode. It is not a service for unicode to refuse to let them; and if it tried, it would just lead to people finding a way to do it in a non-standardized or non-unicode-standardized way anyway, which still wouldn't help anyone.
You might just as well say "I don't understand why we need lowercase AND capital letters, the standard is just complicating things supporting both -- the ancient romans didn't need two cases after all"
You'd have to go back before Unicode to prevent this.
Unicode was created with certain engineering constraints, one of them being round-trip compatibility. This means that it needs to be possible to go from $OTHER_ENCODING -> Unicode -> $OTHER_ENCODING and get a result which is bitwise-identical to the input. In short, Unicode is saddled with the fact pre-Unicode text encoding was a mess, plus the fact people tend to not like irreversible format changes.
The Latin script (that which I write right now) and the Cyrillic script both derived heavily from the Greek script, especially the capital letters--fully 60% of them are identical in Latin and Greek, even more if you include obsolete letters like digamma and lunate sigma (roughly F and C, respectively). Most of these homoglyphs furthermore share identical phonetic values.
In retrospect, treating traditional Chinese, simplified Chinese, and Japanese kanji as different scripts seems like it would have been the better path. I don't know enough about the Korean and Vietnamese usage of Chinese characters to know if those scripts are themselves independent daughter scripts or complete imports of Chinese with a few extra things thrown in (consider Farsi's additions to Arabic, or Icelandic's þ and ð additions to Latin).
The Japanese writing system differs from the Chinese one by having its own distincts scripts (hiragana, katakana), but most of its subset made of Chinese characters (kanji) is the same than the Chinese script. The most comprehensive Chinese character dictionary is a Japanese one (Daikanwa jiten), which give definition and Japanese readings and this is possible precisely because the script is the same.
The only differences are characters created for use in Japan (kokuji) which can be treated as an extension like the Vietnamese Nôm, characters simplified by the Japanese government (some jôyô kanji) and variation in some glyph's shape (黃/黄). So, treating the full inventory of these languages as different scripts wouldn't make more sense than encoding the English, French and Czech alphabets separately because few characters differ.
My opinion is that Han unification makes sense, but the mistake made was to encode the variant interpretation at application level (e.g. html lang tags), which is not portable. I don't know how Unicode variant form works in details (putting a trailing code to a character to indicate precisely which variant is meant) but something like that at text encoding level could ease a lot of pain.
For example, some fonts render A in a way that looks like Cyrilic Л. (Like The Mandalorian title screen.)
This would be incorrect if using the same A for both: https://i.ytimg.com/vi/V8fC7bdV-mI/maxresdefault.jpg
(probably not what you meant, but just in case: the fourth letter is not a Cyrillic A but a D.)
you can see that the A letters are rendered the same as Cyrillic Л.
To further the confusion, in Buenos Aires there are many street signs that use Л as the letter A, and they use П as the letter N.
Exactly. It is more often rendered like that in Bulgaria. But it is still the letter Л.
Which just furthers the point that glyph rendering and character code points are very different problems and the multiple code points in Unicode are the right approach.
The same goes, of course, for writing systems, going back to the first, that we would all do differently in hindsight.
Even today we are making apparently sensible choices we (or our digital successors) will regret as deeply.
ISO 8601 looks good now, but it only delays the transition to a rational calendar which, admittedly, we would certainly get wrong if we tried codifying one today.
Fortunately daylight saving time will be gone worldwide before the next decade passes, but not without some places getting stuck at the wrong timezone. (E.g. Portugal different from Spain, and probably Indiana different from itself.)
Like with most standardization, the people at the helm are the wrong people with the wrong motivations.
Most complexity in Unicode derives from:
- real complexity in human scripts
Confusability is NOT a problem created by Unicode, but by humans. Even just within the Latin character set there are confusables (like 1 and l, which often look the same, so much so that early typewriters exploited such similarities to reduce part count).
Nor were things like normalization avoidable, since even before Unicode we had combining codepoints: ASCII was actually a multibyte Latin character set, where, for example, á could be written as a<BS>' (or '<BS>a), where <BS> is backspace. Many such compositions survive today -- for example, X11 Compose key sequences are based on those ASCII compositions (sadly, many Windows compositions aren't). The moment you have combining marks, you have an equivalence (normalization) problem.
Emojis did add a bunch of complexity, but... emojis are essentially a new script that was created by developers in Japan. Eventually the Unicode Consortium was bound to have to standardize emojis for technical and political reasons.
Of course, some mistakes were made: UCS-2/BMP, UTF-16, CJK unification, and others. But the lion's share of Unicode's complexity is not due to those mistakes, but to more natural reasons.
User-facing is easy; things go downhill when users have system-facing strings of their own, and some of those strings become other-user-facing strings.
> with only an extremely few things I think might have served better if done differently.
Thus, in spite of disagreeing in the strongest possible terms, you do have some nits to pick.
A "few things" could be far-reaching. For instance, allowing the same semantic character to be encoded in more than one way can count as "one thing". If someone happens to think this is the only problem with Unicode, then that's "extremely few things". Yet, it's pretty major.
What a comic thread!
Also why are you implying that any gripes automatically prove you right? It's kind of ridiculous to suggest that not having UTF-8 was people "deciding to make it complicated to bolster their egos".