Hacker News new | past | comments | ask | show | jobs | submit login
Unicode Is Awesome (wisdom.engineering)
147 points by jagracey on Dec 2, 2019 | hide | past | favorite | 152 comments

Unicode is pretty amazing.

People REALLY like to complain about unicode, but where it's complicated, it's because the _problem space_ is complicated. Which it is. People are actually complaining that they wish handling global text wasn't so complicated, like, that humans had been a lot simpler and more limited with their inventions of alphabets and how they were used in typesetting and printing and what not, and that legacy digital text encodings historically had happened differently than they did, they're not actually complaining about unicode at all, which had to deal with the hand of cards it was dealt.

That unicode worked out as nice a solution as it is to storing global text is pretty amazing, there were some really smart and competent people working on it. When you dig into the details, you will be continually amazed how nice it is. And how well-documented.

One real testament to this is how _well adopted_ Unicode is. There is no actual guarantee that just because you make a standard anyone will use it. Nobody forced anyone to move from whatever they did to Unicode. (and in fact most eg internet standards don't force Unicode and are technically agnostic as to text encoding). That it has become so universal is because it was so well-designed, it solved real problems, with a feasible migration path for developers that had a cost justified by it's benefits. (When people complain about aspects of UTF-8 required by it's multi-facetted compatibility with ascii, they are missing that this is what led to unicode actually winning).

The OP, despite the title, doesn't actually serve as a great argument/explanation for how Unicode is awesome. But I'd read some of the Unicode "annex" docs -- they are also great docs!

If we could go back in time to Unicode's beginning and start over but with all that we know today... Unicode would still look a lot like what it looks like today, except that:

  - UTF-8 would have been specified first
  - we'd not have had UCS-2, nor UTF-16
  - we'd have more than 21 bits of codespace
  - CJK unification would not have been attempted
  - we might or might not have pre-composed codepoints[0]
  - a few character-specific mistakes would have gone unmade
which is to say, again, that Unicode would mostly come out the same as it is today.

Everything to do with normalization, graphemes, and all the things that make Unicode complex and painful would still have to be there because they are necessary and not mistakes. Unicode's complexity derives from the complexity of human scripts.

[0] Going back further to create Unicode before there was a crushing need for it would be impossible -- try convincing computer scientists in the 60s to use Unicode... Or IBM in the 30s. For this reason, pre-composed codepoints would still have proven to be very useful, so we'd probably still have them if we started over, and we'd still end up with NFC/NFKC being closed to new additions, which would leave NFD as the better NF just as it is today.

> Or IBM in the 30s

Love an interesting sci-fi scenario. UTF-8 was a really neat technical trick, and a lot of the early UTF-8 technical documentation was already on IBM letterhead. I think if you showed up with the right documents at various points in history IBM would have been ecstatic to have an idea like UTF-8, at least. UTF-8 would have sidestepped a lot of mistakes with code pages and CCSIDs (IBM's attempts at 16-bit characters, encoding both code page and character), and IBM developers likely would have enjoyed that. Also, they might have been delightfully confused about how the memo was coming from inside the house by coworkers not currently on payroll.

Possibly that even extends as far back as the 1930s and formation of the company, because even then IBM aspired to be a truly International company, given the I in its own name.

I'm not sure how much of the rest of Unicode you could have convinced them of, but it's funny imagining explaining say Emoji to IBM suits at various points in history.

I.. agree. After all, even before ASCII people already used the English character set and punctuation to type out many non-English Latin characters on typewriters (did typesetters do that with movable type, ever? I dunno, but I imagine so). Just as ASCII was intended to work with overstrike for internationalization, i can imagine combining codepoints having been a thing even earlier.

OTOH, it wouldn't have been UTF-8 -- it would have been an EBCDIC-8 thing, and probably not good :)

There actually is a rarely used, but standardized because it needed to be, UTF-16 variant UTF-EBCDIC, if you needed nightmares about it. [0]

In some small decisions EBCDIC makes as much or more sense than ASCII; the decades of problems have been that ASCII and EBCDIC coexisted from basically the beginning. (IBM could have delayed the System/360 while ASCII was standardized and likely have saved decades of programmer grief.) The reasons that UTF-EBCDIC is so bad (such as that it is always a 16-bit encoding) could likely have been avoided had IBM awareness of UTF-8 ahead of time.

Maybe if IBM had something like UTF-8 as far back as the 1930s, AT&T needing backward compatibility with their teleprinters might not have been as big of a deal and ASCII might have been more IBM dominated. Or since this is a sci-fi scenario, you just impress on IBM that they need to build a telegraph compatible teleprinter model or three in addition to all their focus on punch cards, and maybe they'd get all of it interoperable themselves ahead of ASCII.

Though that starts to ask about the scenario what happens if you give UTF-8 to early Baudot code developers in the telegraph world. You might have a hard time to convince them they need more than 5-bits, but if you could accomplish that, imagine where telegraphy could have gone. Winking face emoji full stop

[0] https://en.wikipedia.org/wiki/UTF-EBCDIC

> Maybe if IBM had something like UTF-8 as far back as the 1930s, AT&T needing backward compatibility with their teleprinters might not have been as big of a deal

I think this points to why the science fiction scenario really is a science fiction scenario -- I think decoding and interpreting UTF-8, using it to control, say, a teleprompter, is probably significantly enough more expensive than ASCII that it would have been a no go, too hard/expensive or entirely implausible to implement in a teleprompter using even 1960s digital technology.

Yeah, and I was thinking about it this morning the reliance of FF-FD for surrogate pairs would shred punch cards (too many holes) and probably be a big reason for IBM to dismiss it when they were hugely punch card dependent and hadn't made advances like the smaller square hole punches that could pack holes more densely and with better surrounding integrity.

"Sigh, another swiss cheese card jammed the reader from all these emojis."

Yeah, no thanks to that nightmare, or to UTF-5. Did you know that UTF-7 is not a joke??

Yeah, email is one of those things that in a sci-fi scenario should probably be invented after UTF-8, and hopefully too with a better plan for binary attachments. uuencode/uudecode, MIME, base64, UTF-7, so many fun hacks on top of each other on top of a communications system restricted to ASCII to play things safe.

> - we'd have more than 21 bits of codespace

Just in case, there is an extension to Unicode [1] if we would ever be out of code points.

[1] http://ucsx.org/

The answer is to deprecate and obsolete UTF-16, then remove the artificial 21 bit limit on UTF-8. Or remove the artificial 21 bit limit on UTF-8 and let UTF-16 just die. Or... see below.

There's an essential, very necessary ingredient: a sense of urgency at MSFT and various JavaScript implementors (and others) to ditch UTF-16.

Yet legacy is forever, so I wouldn't expect UTF-16 to die.

Most likely, when we run out of codespace we'll see UC assign a "UTF-16 SUX, JUST UPGRADE TO UTF-8 ALREADY" codepoint that newer codepoints can be mapped to when converting to UTF-16, then lift the 21 bit limit on UTF-8.

The proposed UCS-G-8 encoding [1] does exactly that. And in case UTF-16 never dies, the website also proposes extensions to UTF-16 (and UTF-32) as well.

[1] http://ucsx.org/g8

As a lay-developer, I know unicode is what you need for international character support. Oh, so what are the options: utf-8, utf-16, utf-32. I will choose utf-32 just because 32 > 16 or 8.

I'm not sure if that's a joke (maybe that's why it got downvoted).

But the answer is that 32 is not better than 16, which is not better than 8, in this specific case. The bit count here is about memory efficiency. There are developers who think/thought that UTF-32 would improve random access into strings because it's a fixed-sized encoding of codepoints, but in fact it does not because there are glyphs that require multiple codepoints.

All who pass must abandon the idea of random access into Unicode strings!

Once you give up on Unicode string random access, you can focus on other reasons to like one UTF over another, and then you quickly realize that UTF-8 is the best by far.

For example there's this misconception that UTF-8 takes significantly (e.g., 50%) more space than UTF-16 when handling CJK, but that's not obviously true -- I've seen statistical analyses showing that UTF-8 is roughly comparable to UTF-16 in this regard.

Ultimately UTF-8 is much easier to retrofit into older systems that use ASCII, it's much easier to use in libraries and programs that were designed for ASCII, and doesn't have the fundamental 21-bit codespace limit that UTF-16 has.

I can possibly forgive them not coming up with UTF-8 first, and a few character specific mistakes. I can even forgive them for not coming up with UTF-8 at all, which is fact what happened because it was invented outside of the organisation and then common usage forced it onto them.

What I can not forgive is coming up with something that allows the same string have multiple representations, which happens in UCS-2 / UTF-16, and because pre-composed codepoints overlap pre-existing characters. That mistake makes "string1" == "string2" meaningless in the general case. It's already caused exploits.

I also can not forgive pre-composed codepoints for another reason: the make parsing strings error prone. This is because ("A" in "Åström") no longer works - it will return true in if Å is composite character.

You've not read the comments here explaining why that arose naturally from the way human scripts work, not from choices they made.

Let me explain again. We have these pesky diacritical marks. How shall they be represented? Well, if you go with combining codepoints, the moment you need to stack more than one diacritical on a base character then you have a normalization problem. Ok then! you say, let us go with precompositions only. But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a', or that you can't easily come up with new compositions the way one could with overstrike (in ASCII) or with combining codepoints in decomposed forms. Fine!, you say, I'll take that! But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that? That the rest of the world should abandon non-English scripts, that we should all romanize all non-English languages, or even better(?), abandon all non-English languages?

(And we haven't even gotten to case issues. You think there are no case issues _outside_ Unicode? That would be quite wrong. Just look at the Turkish dot-less i, or Spanish case folding for accented characters... For example, in French upcasing accented characters does not lose the accent, but in Spanish they do -or used to, or are allowed to be lost, and often are-, which means that accented characters don't round-trip in Spanish.)

No, I'm sorry, but the equivalence / normalization problem was unavoidable -- it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

The moment you accept that this is an unavoidable problem, life gets better. Now you can make sure that you have suitable Unicode implementations.

What? you think that normalization is expensive? But if you implement a form-insensitive string comparison, you'll see that the vast majority of the time you don't have to normalize at all, and for mostly-ASCII strings the performance of form-insensitivity can approach that of plain old C locale strcmp().

> But now consider Hangul, where the combinations are to form syllables that are themselves glyphs... the problems get much worse now. What might your answer be to that?

I don't have an answer, mostly because I know nothing about Hangul. Maybe decomposition is the right solution there. Frankly I don't care what Unicode does to solve the problems Hangul creates, and as Korea about 1% of the world's population I doubt many other people here care either.

I'm commenting about Latin languages. There is absolutely no doubt what is easiest for a programmer there: one code point per grapheme. We've tried considering 'A' and 'a' the same in both computer languages and file systems. It was a mess. No one does it any more.

> But then you get other problems, such as that you can't decompose them so you can cleverly match 'á' when you search for 'a'

It's not a problem. We know how to match 'A' and 'a', which are in every sense closer than 'á' and 'a' ('á' and 'a' can be different phonetically, 'A' and 'a' aren't). If matching both 'A' and 'a' isn't a major issue, why would 'á' and 'a' be an issue that Unicode must to solve for us?

In fact given it's history I'm sort of surprised Unicode didn't try and solve it by adding a composition to change case. shudder

> And we haven't even gotten to case issues.

The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.

There is objective reason for wanting that. Typically programmers do not do much with strings. The two most common things they do is move them around, and compare them for equality but also sort them. They naturally don't read the Unicode standard. They just expect the binary representation of strings to faithfully follow what their eyes tell them should happen: if two strings look identical, their Unicode representation will be identical. It's not an unreasonable expectation. If it's true those basic operations of moving and comparing will be simple, and more importantly efficient on a computer.

The one other thing we have to do a lot less often, but nonetheless occupies a fair bit of our time is parsing a string. It occupies our time because it's fiddly and takes a lot of code, and is error prone. I still remember the days when languages string handling are a selection criteria. (It's still the reason I dislike Fortran.) I'm not talking about complex parsing here - it's usually something like spilt it into words or file system path components, or going looking for a particular token. It invariably means moving along the string one grapheme at time, sniffing for what you want and extracting it. (Again this quite possibly is only meaningful for Latin based languages - but that's OK because the things we are after are invariably Latin characters in configuration files, file names and the like. The rest can be treated as uninteresting blobs.) And now Unicode's composition has retrospectively simple operation far harder to do correctly.

All other text handling programmers do is now delegated to libraries of some sort. You mention one: nobody does case conversion themselves. They call strtolower() for ASCII or a Unicode equivalent. Hell, as soon as you leave Latin we even printing it correctly requires years of expertise to master. The problems that crop up may as you say may be unavoidable, but that's OK because they are so uncommon I'm willing to wear the speed penalty to use somebody else's code to do it.

> it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

Did someone say that? Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve. A little tweak here and it would do that job too! I saw an apprentice sharpen the handle of his Estwing hammer once. It did make it a useful wire cutter in a pinch, but no prizes for guessing what happen when he just using it as a hammer.

Unicode acquired it's warts by attempting to solve everybody's problems. Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, and wasted the bare minimum of his time on stackoverflow before using it.

The tragedy is it didn't do that.

> > it did not happen because of "egos" on the Unicode Consortium, or because of "national egos", or politics, or anything.

> Did someone say that?

Yes, u/kazinator did.

> Anyway, it's pretty obvious why it happened. When a person invents new hammer, the second thing they do is going looking for all the other problems it might solve.

That's not why decomposition happened. It happened because a) decomposition already existed outside Unicode, b) it's useful. Ditto pre-composition.

> Unicode acquired it's warts by attempting to solve everybody's problems.

Unicode acquired its warts by attempting to be an incremental upgrade to other codesets. And also by attempting to support disparate scripts with disparate needs. The latter more than the former.

> Instead of making it more and more complex, they should have ruthlessly optimised it to make it work near flawlessly it's most common user: a programmer who couldn't give a shit about internationalisation, ...

They did try to ruthlessly optimize it: by pursuing CJK unification. That failed due to external politics.

As to the idea that programmers who want nothing to do with I18N are the most common user or Unicode, that's rather insulting to the real users: the end users. All of this is to make life easier on end users: so they need not risk errors due to not their (or their SW) not being able to keep track of what codeset/encoding some document is written in, so they can mix scripts in documents, and so on.

Unicode is not there to make your life harder. It's there to make end users' lives easier. And it's succeeded wildly at that.

> > And we haven't even gotten to case issues.

> The "case issues" should not have been Unicode's issue at all. Unicode should have done one thing, well. That one thing was ensure visually distinct string had one, and only one, unique encoding.

You really should educate yourself on I18N.

> As to the idea that programmers who want nothing to do with I18N are the most common user or Unicode, that's rather insulting to the real users: the end users.

Oh for Pete's sake. Unicode / ASCII / ISO 8859-1 are encoding computers and thus programmers use to represent text. Users don't read Unicode, they read text. They never, ever have to deal with Unicode and most wouldn't know what it was if it leapt up and hit them in the faxe, so if Unicode justified adding features to accommodate these non-existent users, I guess that explains how we got into this mess.

They read text in many scripts (because many writers use more than one script, and many readers can read more than one script). Without Unicode you can usually use TWO scripts: ASCII English + one other (e.g., ISO8859-*, SHIFT_JIS, ...). There are NO OTHER CODESETS than Unicode that can support THREE or more scripts, or any combination of TWO where one isn't ASCII English. For example, Indian subcontinent users are very likely to speak multiple languages and use at least two scripts other than ASCII English. Besides normal people all over the world who have to deal with multiple scripts, there's also: scolars, diplomats, support staff at many multi-nationals, and many others who also need to deal with multiple scripts.

Whether you like it or not, Unicode exists to make USERS' lives better. Programmers?? Pfft. We can deal with the complexity of all of that. User needs, on the other hand, simply cannot be met reasonably with any alternatives to Unicode.

> Whether you like it or not, Unicode exists to make USERS' lives better.

They've been using multiple scripts long before computers. When computers came along those users quite reasonably demanded they be able to write the same scripts. This created a problem for the programmers. The obvious solution is a universal set of code points - and ISO 10646 was born. It was not rocket science. But if it had not come along some other hack / kludge would have been used because the market is too large to be abandoned by the computer companies. They would have put us programmers in a special kind of hell, but I can guarantee the users would not have known about that, let alone cared.

Oddly the encoding schemes proposed by ISO 10646 universally sucked. Unicode walked into that vacuum with their first cock up - 16 bits was enough for anybody. It was not just dumb because it was wrong - it was dumb because they didn't propose unique encoding. They gave us BOM markers instead. Double fail. They didn't win because the one thing they added, their UCS-2 encoding, was any better than what came before. They somehow managed to turn it into a Europe vs USA popularity contest. Nicely played.

Then Unicode and 10646 became the same thing. They jointly continued on in the same manner as before, inventing new encodings to paper over the UCS-2 mistake. Those new encodings all universally sucked. The encoding we programmers use today, UTF-8, was invented by, surprise, surprise, a programmer who was outside of Unicode / 10646 group think. It was popularised at a programmers conference, USENIX, and from there on it was obvious was going to be used regardless of what Unicode / 10646 thought of it, so they accepted it.

If Perl6 is any indication, the programmers are getting the shits with the current mess. Perl6 has explicitly added functions that treat text as a stream of grapheme's rather than Unicode code points. The two are the same of course except when your god damned compositions rear their ugly head - all they do is eliminate that mess. Maybe you should take note. Some programmers have already started using something other than Unicode because it makes their jobs easier. If it catches on you are going to find out real quickly just how much the users care about the encoding scheme computers use for text.

None of this is to trivialise the task of assigning code points to grapheme's. It's huge task. But for pete's sake don't over inflate your ego's by claiming you are doing some great service for mankind. The only thing on the planet that uses the numbers Unicode assigns to characters as is computers. Computers are programmed by one profession - programmers. Your entire output is consumed by that one profession. Yet for all the world what you've written here seems to say Unicode has some higher purpose, and you are optimising for that, whatever it may be. For gods same come down to earth.

Thank you for your reference to Perl 6. Please note that Perl 6 has been renamed to "Raku" (https://raku.org) using #rakulang as a tag for social media.

Please also not that all string handling in Raku (the `Str` class) is based on graphemes, not just some added functions. This e.g. means that a newline is always 1 grapheme (regardless of whether it was a CR or LF or CRLF). And that for é there is no difference between é (LATIN SMALL LETTER E WITH ACUTE aka 0x00E9) and (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT aka 0x0065, 0x0301).

Please note that for any combination between characters and combiners for which there does not exist a composed version in Unicode, Raku will create synthetic codepoints on the fly. This ensures that you can consider your texts as graphemes, but still be able to roundtrip strange combinations.

    - Old grapheme clusters and new extended grapheme clusters would be the same.

Wasn’t CJK unification something that the PRC demanded?

The opposite. China, Korea, and Japan, each have different styles for common characters, and as a result wanted to have different codepoints for each character. The opposite of unification.

Unification was driven by a desire to keep Unicode a 16-bit codespace for as long as possible. Undoing unification meant adding more pressure on an already crowded codespace, which meant abandoning UCS-2, and creating UTF-16 (horror of horrors), and ultimately switching to UTF-8.

The Consortium chose the domain of the problem space, though. "Text" could be either much simpler or much more complex than Unicode chose to model it. They picked what features they wanted to support, and now we all have to live with that.

In my lifetime, I've seen text systems that choose to include style (bold, italic), color (fore and back), or size with each character. Unicode did not (generally) choose to include those, even though they're arguably part of "global text". Ligatures, too, are generally considered the domain of font rendering, not text storage. Vertical text was, too, until a few months ago. "Historical" writing directions are apparently still considered out of scope for Unicode. Linear A is in scope, though, even though nobody is sure what the characters mean.

Unicode did choose to be backwards compatible with ASCII, and include most of the crazy PETSCII characters (which were pretty popular for just a couple years but not really "global text"), and some mathematical symbols that no mathematician had a use for.

They chose to include both a nice component-combining system and also pre-combined glyphs where legacy codepages had used them. They chose to implement Han unification, but not analogous unifications across other scripts which have even more similar glyphs.

I've dug into the details of Unicode since 3.0 (20 years ago!), and found it's full of arbitrary decisions and legacy workarounds. The contributors are smart but the result looks the same as any committee full of people with conflicting goals.

Legacy support is why it's so well-adapted, and I've never seen a system where piling on legacy support made it "well-designed".

Suppose you wanted to make a system for "universal computation". The analogous method would have been to take Win16, Win32, Mac OS 9, Mac OS X, Linux, and Solaris, define the superset of all of their features, and invent a binary format which supported all of it natively. Legacy support might help get it adopted faster but nobody would call it well-designed. 20 years later, it'd clearly be simultaneously too weak and too powerful for all types of computation we want to do.

Unicode is an amazing political accomplishment. Technically, it seems rather mediocre. Nobody would ever design a text system like this unless held back by mountains of legacy 1970's/1980's systems.

I think you fundamental mistake is thinking that you can separate technical and political concerns or that those two things really are always distinct and cleanly separable.

To me having that kind of discussion really doesn’t make a lot of sense. You know, since “we live in a society” (at the risk of quoting a clearly thought-terminating cliche).

Unicode contains within itself thousands of design-decisions, many of them trade-offs. After the fact it’s always extremely easy to swoop in and nitpick those trade-offs. No possible world exists where all those trade-offs are made correctly and what’s more defining what a “correct” trade-off even is is frequently simply impossible to know.

(Just one example to illustrate the scope of this problem: A certain trade-off might be worth making in one direction for use case A and in another direction for use case B, however it’s not really easy to find out whether use case A or use case B are more frequent in the wild. What if both use cases are about equally as important? Now imagine the trade-off space not being a binary space but multi-dimensional. Now imagine not just two but several use cases. Now imagine use case usage changing over time and new use cases emerging in the future.)

There is way more than enough wacky stuff introduced by Unicode. Having dozens of letters A, for example. And giving a Japanese Kanji character the same code as a Chinese one that usually looks similar.

Unicode did not introduce having dozens of letters A, they existed without unicode. Unicode just gives you a way to represent them -- and bonus, often to normalize them all to a normal letter A too.

It is a mistake to think that Unicode has the ability to people's text behavior by not supporting things. I mean, maybe it does now that it's so popular, but in order to get adoption it had to support what people were actually doing.

People had use cases to put "𝔄" adn "A" in the same document and keep them distinct without unicode. It is not a service for unicode to refuse to let them; and if it tried, it would just lead to people finding a way to do it in a non-standardized or non-unicode-standardized way anyway, which still wouldn't help anyone.

You might just as well say "I don't understand why we need lowercase AND capital letters, the standard is just complicating things supporting both -- the ancient romans didn't need two cases after all"

There's no way to distinguish "A" "uppercase a" and "Α" "uppercase α" in written text, but they're different Unicode letters (and might be rendered differently depending on font).

> There is way more than enough wacky stuff introduced by Unicode. Having dozens of letters A, for example.

You'd have to go back before Unicode to prevent this.

Unicode was created with certain engineering constraints, one of them being round-trip compatibility. This means that it needs to be possible to go from $OTHER_ENCODING -> Unicode -> $OTHER_ENCODING and get a result which is bitwise-identical to the input. In short, Unicode is saddled with the fact pre-Unicode text encoding was a mess, plus the fact people tend to not like irreversible format changes.

Yeah, that is quite inconsistent. Kanji literally means "Chinese Character" so it should be the same for the letter A. Unless a French A isn't equivalent to an English A.

Arabic numerals (0123456789) are not to be confused with the Arabic numerals (٠١٢٣٤٥٦٧٨٩). So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Latin script (that which I write right now) and the Cyrillic script both derived heavily from the Greek script, especially the capital letters--fully 60% of them are identical in Latin and Greek, even more if you include obsolete letters like digamma and lunate sigma (roughly F and C, respectively). Most of these homoglyphs furthermore share identical phonetic values.

In retrospect, treating traditional Chinese, simplified Chinese, and Japanese kanji as different scripts seems like it would have been the better path. I don't know enough about the Korean and Vietnamese usage of Chinese characters to know if those scripts are themselves independent daughter scripts or complete imports of Chinese with a few extra things thrown in (consider Farsi's additions to Arabic, or Icelandic's þ and ð additions to Latin).

> So the fact that kanji literally means "Chinese character" doesn't mean that kanji and hanzi should be considered the same script.

The Japanese writing system differs from the Chinese one by having its own distincts scripts (hiragana, katakana), but most of its subset made of Chinese characters (kanji) is the same than the Chinese script. The most comprehensive Chinese character dictionary is a Japanese one (Daikanwa jiten), which give definition and Japanese readings and this is possible precisely because the script is the same.

The only differences are characters created for use in Japan (kokuji) which can be treated as an extension like the Vietnamese Nôm, characters simplified by the Japanese government (some jôyô kanji) and variation in some glyph's shape (黃/黄). So, treating the full inventory of these languages as different scripts wouldn't make more sense than encoding the English, French and Czech alphabets separately because few characters differ.

My opinion is that Han unification makes sense, but the mistake made was to encode the variant interpretation at application level (e.g. html lang tags), which is not portable. I don't know how Unicode variant form works in details (putting a trailing code to a character to indicate precisely which variant is meant) but something like that at text encoding level could ease a lot of pain.

The alternative is worse. Look at all of the problems we have with Turkish I, just because they didn't create new codepoints to make Turkish I and Latin I distinct even though they look the same.

Cyrillic А is not the same as English A.

For example, some fonts render A in a way that looks like Cyrilic Л. (Like The Mandalorian title screen.)

This would be incorrect if using the same A for both: https://i.ytimg.com/vi/V8fC7bdV-mI/maxresdefault.jpg

I don't see anything special in the linked image. The As look as they would in Latin script.

(probably not what you meant, but just in case: the fourth letter is not a Cyrillic A but a D.)

I think the image is not meant to show the problem but show a case where if the Cyrillic A had been stylised the same way that the English A is in the English version, the two distinct letters would become indistinguishable such that the Cyrillic title would effectively read "The Mlndlloriln"

I see, for those out of the loop, the English title screen does not have the horizontal bar in the As.

I should have linked the English title as well, now I read my comment and see it is not sufficiently unambiguous.

Here: https://en.wikipedia.org/wiki/File:The_Mandalorian_logo.jpg you can see that the A letters are rendered the same as Cyrillic Л.

To further the confusion, in Buenos Aires there are many street signs that use Л as the letter A, and they use П as the letter N.

Part of the confusion was because I see the English title's As are actually Λ (capital Greek lambda), rather than Л (at least in the font that HN uses). I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

> I'm guessing from context that Л is sometimes rendered as Λ in some Cyrillic fonts.

Exactly. It is more often rendered like that in Bulgaria. But it is still the letter Л.

Which just furthers the point that glyph rendering and character code points are very different problems and the multiple code points in Unicode are the right approach.

If the readers were to confuse A and Д, that's a problem with the font, not the letters. Cyrillic, Greek and Latin A are all one letter (in uppercase).

That might be wacky to you but I'm not sure it's wacky to the people to whom it makes a difference.

Lots of the wacky stuff came from the original dream of a purely 16-bit code, and then more wacky stuff to extend it from there. I.e., starting from UTF-8 could have avoided any amount of unpleasantness. But of course UTF-8 wasn't invented until later. The 16-bit representation got encrusted in OSes and languages of a certain period.

The same goes, of course, for writing systems, going back to the first, that we would all do differently in hindsight.

Even today we are making apparently sensible choices we (or our digital successors) will regret as deeply.

ISO 8601 looks good now, but it only delays the transition to a rational calendar which, admittedly, we would certainly get wrong if we tried codifying one today.

Fortunately daylight saving time will be gone worldwide before the next decade passes, but not without some places getting stuck at the wrong timezone. (E.g. Portugal different from Spain, and probably Indiana different from itself.)

Indiana doesn't have special time zones any more. It allows individual counties to choose which standard zone to be in but they all observe the normal DST.

No, where Unicode is complicated is where the Unicode people decided to make it complicated to bolster their egos, to the detriment of everyone downstream of them.

Like with most standardization, the people at the helm are the wrong people with the wrong motivations.

You're demonstrably wrong.

Most complexity in Unicode derives from:

  - real complexity in human scripts
  - politics
neither of which is something that Unicode could have avoided. Complexity in human scripts necessarily leads to complexity in Unicode. Not having Unicode at all would be much worse than Unicode could possibly seem to you -- you'd have to know the codeset/encoding of every string/file/whatever, and never lose track. Not having politics affect Unicode is a pipe dream, and politics has unavoidably led to some duplication.

Confusability is NOT a problem created by Unicode, but by humans. Even just within the Latin character set there are confusables (like 1 and l, which often look the same, so much so that early typewriters exploited such similarities to reduce part count).

Nor were things like normalization avoidable, since even before Unicode we had combining codepoints: ASCII was actually a multibyte Latin character set, where, for example, &aacute; could be written as a<BS>' (or '<BS>a), where <BS> is backspace. Many such compositions survive today -- for example, X11 Compose key sequences are based on those ASCII compositions (sadly, many Windows compositions aren't). The moment you have combining marks, you have an equivalence (normalization) problem.

Emojis did add a bunch of complexity, but... emojis are essentially a new script that was created by developers in Japan. Eventually the Unicode Consortium was bound to have to standardize emojis for technical and political reasons.

Of course, some mistakes were made: UCS-2/BMP, UTF-16, CJK unification, and others. But the lion's share of Unicode's complexity is not due to those mistakes, but to more natural reasons.


And the alternative would be what exactly?

As a developer who's been working intimately with user-facing strings for years, I have to disagree in the strongest possible terms. Unicode is one of the borderline zero standards that is almost angelic in its purity, with only an extremely few things I think might have served better if done differently.

> As a developer who's been working intimately with user-facing strings for years, ...

User-facing is easy; things go downhill when users have system-facing strings of their own, and some of those strings become other-user-facing strings.

> with only an extremely few things I think might have served better if done differently.

Thus, in spite of disagreeing in the strongest possible terms, you do have some nits to pick.

A "few things" could be far-reaching. For instance, allowing the same semantic character to be encoded in more than one way can count as "one thing". If someone happens to think this is the only problem with Unicode, then that's "extremely few things". Yet, it's pretty major.

Your idea that either there are NO "nits to pick" (things that could have been done better in a standard, complete perfection), OR it means that the standards-makers "decided to make it complicated to bolster their egos" -- is ABSOLUTELY INSANE.

My point isn't that there must be no nits to pick, but that look, even a self-proclaimed Unicode cheerleader who disagrees with me in the "strongest possible" terms still finds it necessary to mention that he or she has some.

> No, where Unicode is complicated is where the Unicode people decided to make it complicated to bolster their egos, to the detriment of everyone downstream of them.


Interesting statement. Other than maybe han unification, what would you do differently?

u/kazinator is decidedly wrong (see above), but besides not trying CJK unification, I wish we had had UTF-8 from day 0, no UCS-2, no UTF-16, no BMP, no codespace limit as low as 21 bits. That's mostly it. If we could have stood not having precompositions, I'd rather not have had those either, but that would have required a rather large leap in functionality in input modes in the late 80s or early 90s, which would not have been feasible.

kazinator is wrong, decidedly so, but let me take this opportunity to opine my own impractical list of gripes that require going back in history and redesigning Unicode in fundamental ways ...

What a comic thread!

That 'list of gripes' is all about the single change of having UTF-8 from the start. It's not multiple separate problems.

Also why are you implying that any gripes automatically prove you right? It's kind of ridiculous to suggest that not having UTF-8 was people "deciding to make it complicated to bolster their egos".

What we are contesting is your characterisation that the people in charge of Unicode added complexity in order to puff up there egos. Rather then them making decisions that with the benefit of our present knowledge was the incorrect ones.

> Unicode is simply a 16-bit code - Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

Verity Stob has a great column


where she says that it is wrong to call this a myth, since that was how it was originally designed. It is better characterized as being obsolete, rather than a myth.

It's a myth about the current version of Unicode.

Whether it's true about some obsolete version is hair-splitting at this point.

IMHO the article should mention that UTF-16 was (more or less) a hack to fix Windows and some other systems which didn't see the light and use UTF-8 from the start. UTF-16 has all the disadvantes of UTF-8 (variable length) and UTF-32 (endianess), but none of the advantages (encoding as endian-agnostic, 7-bit ASCII compatible byte stream like UTF-8, or a fixed-width encoding like UTF-32). UTF-16 should really be considered a hack to talk to (mainly) Windows APIs.

Also, obligatory link to: https://utf8everywhere.org/

Windows and many other operating systems and languages (Java) got on board with Unicode back when the character set would fit in 16bits. The character set originally used was UCS-2 (not UTF-16). UTF-16 came next to extend the Unicode character set beyond 65536 code points.

UTF-8 wasn't even invented until well after all these operating systems and languages deployed Unicode.

They didn't see the light of day to use UTF-8 because they didn't have a time machine to make that possible.

I actually checked a while ago when UTF-8 was created, and it was just around the same time when Windows NT was developed with 16-bit "early" Unicode support. UTF-8 was created in September 1992 [1], and Windows NT came out mid 1993, but I guess it was too late for Windows to change to UTF-8 (and I guess the advantages of UTF-8 haven't been as clear back then).

But IMHO there's no excuse to not use UTF-8 after around 1995 ;)

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Also, UTF-16 was only published in July 1996 (although the need for more than 16 bits was probably apparent a bit earlier). So before that, Unicode was only a 16-bit encoding, and UCS-2 was enough. UTF-8 was initially just a nice trick to keep using ASCII characters for things like directory separators (/) and single-byte NUL terminators. By 1995 its superiority certainly wasn't apparent yet.

Also, Windows internals were completely 16-bit-character based, including e.g. the NTFS disk format, so by 1992 that was already quite hard to change.

That said, it is crazy that NT didn't have full UTF-8 support, including in console windows, by about 2000.

The main point that should be emphasised is that any encoding with fixed size unicode codepoints is mostly unnecessary as you mostly don’t care about the codepoints but about how the resulting glyphs or even glyph runs look like.

My experience is that if you want to implement efficient unicode-aware text editor then the right datastructure is list of lines and you have to simply forget about gap buffers, ropes and what not (unless you really care about 32k+ lines/paragraphs, which is when rope-style representation starts to make sense as long as the breaks match unicode semantics)

I mean, awesome for whom? It might be awesome for end users as you can type in or copy/paste things without caring which language you're using. But for programmers, Unicode is a bloated monstrosity and a source of endless nightmare. Eventually, it's not going to be awesome for end users either because it will be plagued by a lot of (subtle) inconsistencies. Unicode looks a lot like a leaky abstraction to me (because of poor foresight), and it's getting worse each year.

If you think Unicode is a "bloated monstrosity and a source of endless nightmare," what would you remove from Unicode?

And if you're going to respond "emoji", I'll point out that removing emoji doesn't actually remove anything that makes text processing with Unicode difficult, just makes it more likely that people will assume that what works for English works for everybody.

(Side note: it is not possible to accurately represent modern English text solely with ASCII, as English does contain several words with accented characters, such as façade and résumé).

How about removing variation selectors? For example it's possible to turn an emoji back into text by appending a code point!

They are very painful to implement and most don't get it right.

See https://twitter.com/ridiculous_fish/status/10894210337932369...

Unicode doesn't solve the underlying complexity of human languages, as you noted. I think the main contribution of Unicode Consortium is that they brought all the nitty-gritty problems of human languages into one central repository and made them visible to everyone. That itself is an awesome effort, and I personally had a lot of benefit from it (my native language is Japanese). But that doesn't make Unicode as a standard "awesome". Maybe we should be thankful for how messy it is? That's more or less a view that I can agree with.

‘Remove’ is too strong, since Unicode is entrenched. But there are things that should have been done differently. For instance, combining characters and operators should have been placed before the base character rather than after, so that (a) it would be possible to know when you've reached the end of a character^W glyph^W grapheme cluster without reading ahead, and (b) dead keys would be identical to the corresponding characters.

> façade and résumé

ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

> ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦ é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08 2C.

Doesn't work for ñ, since the ASCII ~ is often typeset in the middle of the box instead of in a position to appear above an 'n' character. " is a pretty poor substitute for ◌̈ though, especially when you're trying to write ï as in naïve. And then there's the æ of archæology, which doesn't work with overwriting.

I'll also point out that ç is U+00E7 in Unicode and C3 A7 in UTF-8, not 63 CC A6, since it's a precomposed character (and NFC form is usually understood to be the preferred way to normalize Unicode unless there's a reason to do something else).

Tilde exists in ASCII because of its use as an accent. (In 1967 the non-diacritic interpretation was an overline.) The use in programming languages, and lowering to fit other mathematical operators, came later.

There was never any requirement that ‘n BS ~’ have the same appearance as ‘n’ overprinted with ‘~’, although terminals capable of making the distinction didn't appear until the 70s.

Precomposed characters aren't relevant to illustrating composition mechanisms.

If you extend ASCII to CP1252, which is the most common encoding besides/before UTF-8 became common, then you do get those accented characters (and that's likely responsible for the popularity of '1252.)

In fact, the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

> the first 256 characters of Unicode are almost identical to CP1252. I'm pretty sure that's not a coincidence.

That depends on whether you consider the fact that Windows CP 1252 is almost identical to Latin-1 (ISO-8859-1), which is exactly the first 256 characters of Unicode, to be a coincidence.

> This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

* https://en.wikipedia.org/wiki/Windows-1252

Most resumes I’ve seen don’t even bother with the accents.

Most are written in Word on Windows, and I’d guess that most people don’t even know how to access the accented characters.

> such as façade and résumé

That's simple: just url encode.





The second one is way easier to comprehend than the first.

You mean www.xn--faebook-vxa.com of course :P

>but for programmers

I think that depends on what level of the stack you work at. I'm a programmer, but strictly at an end-user-facing level. I'm not implementing Unicode support, I'm using programming languages that already have Unicode support. And Unicode support is an absolute godsend. It's amazing to not have to think about any of that, and just treat it as a solved problem.

25 or 50 examples of inconsistencies would help support your tone.

Unicode was originally designed to fit in 16 bits, and this is memorialized in Java APIs that make it easy to mess up.

The unicode character does not specify the glyph to draw. Han unification is the best known, but not only source, of this challenge.

The glyph does not specify the unicode character. Precombined vs combining characters is a source of this challenge. The result is that a name can be entered into a database then unfindable due to a search.

This feature has also been a source of security holes. See https://appcheck-ng.com/unicode-normalization-vulnerabilitie... for an explanation of how.

You would think that you could avoid this through banning control and combining characters and not lose anything. Indeed at one point the authors of Go (who included the inventors of UTF-8) thought this. But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters.

There are also lots and lots of invisible characters. This has been used to "fingerprint" text. (Each person gets a different invisible signature. The forwarded email includes the signature.) That's an interesting feature but complicates matching text documents even more.

Need I go on? When I see Unicode, I know that there lie dragons that programmers don't necessarily expect.

One of your points is that an encoding designed to handle languages has support for more than one kind of white space. Given that languages use more than one kind of white space, this is sort of a necessity.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Those aren't inconsistencies, so do feel free to go on.

One of your points is that an encoding designed to handle languages has support for more than one kind of white space.

No. It is that there is more than one kind of invisible character. No language has invisible characters.

Another one is that a standard designed to support all languages has a feature necessary for supporting some languages.

Not sure what point you are misreading here. But that was not among my points.

You said "But there are whole languages (particularly from the Indian subcontinent) that cannot be written without combining characters."

I suppose I didn't consider that they could be written without combining characters given a different design.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

I suppose I didn't consider that they could be written without combining characters given a different design.

They could be.

Likewise European languages can be written without precombined characters. The fact that é can be written in multiple ways was my point.

As far as invisible characters, I'm not interested in arguing about it. English, as written, has all sorts of different structural uses of white space, it isn't all just style.

You still don't understand. I am not talking about whitespace. I am talking about invisible zero-width characters that can be slipped into text with no sign that they are there. Characters like U+180E, U+200B, U+FEFF, U+200C, U+200D, and U+FEFF. Not to mention that you can achieve the same thing with control characters like U+200FU+200E. (The undetectability of the last one is language dependent.)

As I said, this can be used to invisibly sign a document. But I don't see any other particular point to having so many ways to accomplish what looks like nothing.

Without arguing the details, I have to agree with your statement because the article never really supported it's claim of being awesome.

I think Unicode is terrible. Remove everything. Use ASCII and other character sets.

Unicode is OK for searching for data using many different languages (if you omit much of the junk such as emoji and compatibility characters), although might not be best with that too.

You can't effectively use one character set well for everything; different applications have different requirements. Unicode is equally bad for everything, rather than e.g. ASCII which is good for some stuff and not usable for some stuff, and other character set which is a similar thing. Many things you just can't do accurately with Unicode.

> You can't effectively use one character set well for everything; different applications have different requirements.

In our application, our users gets data from systems around the world, and might have to change some of it before sending a file with the data to some official system. The data includes names of people and places. How would you do this using character sets?

One file might need to contain names with Cyrillic characters and with Norwegian characters. There's no character set with both. Should each string in the file have an attribute saying which character set the string is encoded in? What are the odds that people implementing that won't mess that up when oh so many can't even get a single encoding attribute right[1]?

Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

[1]: https://www.w3.org/TR/xml/#charencoding

> Or, just maybe, strings in the file could be Unicode, encoded in say UTF-8, so that the handling of all of them are uniform...

Actually, that won't work. There are cases where a character may be different according to the language, where capitalization may differ depending on the language, where sort order may depend on the language, etc.

If your application is allowing users to edit the text, or if you know which languages will be used, or if you don't care about capitalization, then you don't have to worry about any of those edge cases, and Unicode is useful.

Unicode solves all that. It has case folding rules to handle capitalization differences. It has collation rules to handle sorting differences.

Well if you write an application for a 'non-technical' international audience, you'll have to support international text output. And representing text as one of the universal Unicode encodings is still much better than the codepage mess and region-specific multi-byte encodings like Shift-JIS we had before.

UTF-8 is usually the best choice both for simple tools and 'user-facing applications' since it is backward-compatible with 7-bit ASCII (e.g. usually you don't need to change a thing in your code, at least if you just pass strings around).

If you encounter a byte in an UTF-8 encoded string which has the topmost bit cleared, it's an ASCII character and definitely not part of a multi-byte sequence. If the topmost bit is set, the byte is part of a multi-byte-sequence, and such sequences must remain intact.

UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text). But I was not talking about the encoding; I was talking about the Unicode character set.

> UTF-8 isn't such a bad encoding (although it isn't ideal for fix pitch text; I invented a character set and encoding which would be better for fix pitch text).

This is utterly incoherent.

Can anyone explain how the statement I responded to makes sense?

I must be wrong, getting so many disagreements.

Well, UTF-8 is an encoding of Unicode, which allows for surrogate pairs and all that jazz which can be a bad fit for fixed-pitch text.

For example, take a Zalgo text generator[1] and try to make the result make sense in a fixed-pitch (monospace) setting.

At least that's my interpretation of what he tried to convey.

[1]: http://eeemo.net/

    > Use ASCII and other character sets.
We have tried that before. It did not work, and it was not pretty. You may not know, but there is a huge demand to be able to use characters from different sets in the same document. How do you do Wikipedia without Unicode? (E.g. this: Alexander Sergeyevich Pushkin (English: /ˈpʊʃkɪn/;[1] Russian: Александр Сергеевич Пушкин[note 1]).

How would you implement any chat/messaging app for the international audience? Like in my current company, I am sure at least five languages, each with its own alphabet, are used to communicate in Slack.

For me, app not supporting Unicode is broken.

Wikipedia didnt use unicode originally, en, da, sv, nl language wikipedia all used windows-1252. This all changed somewhere around 2004 i think, but there is still legacy code to deal with edits from before the switchover point.

I imagine the answer is, it kind of sucked but people made due the best they could with the limited allowed characters. Its not like IPA notation is a critical feature

> You can't effectively use one character set well for everything; different applications have different requirements

So how about an application like Twitter, which has the requirement "has to support all globally currently written languages, often right next to each other", what character set aside from Unicode is appropriate?

And for what application in 2019 is Unicode inappropriate and why?

I swear there should be some rule or law about how Unicode articles will inevitably muddle code units/points / grapheme clusters / bytes together.

> String length is typically determined by counting codepoints.

> This means that surrogate pairs would count as two characters.

If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units.

> Combining multiple diacritics may be stacked over the same character. a + ̈ == ̈a, increasing length, while only producing a single character.

Not if you're counting code points or code units, which would both produce an answer of "2", and that's a great example of why you shouldn't count with either.

The dark blue on black in tables is next to invisible. And then to put that on white on the alternate rows is just eyeball murder.

> Since there are over 1.1 million UTF-8 glphys (sic)

UTF-8 glyphs twitch; aside from that, I'm really curious how they got that number. In some ways, a font has it easy; my understanding is that modern font formats can do one glyph for acute accent, one glyph for all the vowels/letters, and then compose the glyphs into arrangements for having them combined. (IDK if those are also "glyphs" to the font or not.) But it's less drawing, at least. OTOH, some characters have >1 appearance/"image", AIUI.

Also a law that the author is thinking of one and only one programming language.

> String length is typically determined by counting codepoints.

That depends entirely on what "strings" you are talking about.

In C/Go/Rust/Ruby, char*/string/std::string::String/String is bytes.

In Java/JavaScript, java.lang.String/String is UTF-16 code units.

In Python 3, str is code points.

In Swift, String is extended grapheme clusters.

In Haskell, there are various different "string" types in common use.

And in C++, std::basic_string is a generic container for whatever element type you want. (std::string specialization being for bytes.)

EDIT: Clarified that I don't disagree with parent comment; merely pointing out additional less-than-precise language.

Sure, different languages have various, usually bad, definitions of length.

The point is that those two sentences themselves in the article are conflicting with each other, not that we're talking about any language in particular. (But certainly the article could go into a survey of common languages like you have.)

I think Rust strings are all Unicode native, though they can be transmuted to bytes.

It's a bit of both. Rust strings are (guaranteed valid) UTF-8 bytes.

str.len() returns the number of bytes; s.chars().count() returns the number of characters.

I'm in love with Swift's approach, where the default representation is a well defined thing that both users and developers think of as "characters", but all the other representations are trivially accessible.

I disagree. Grapheme clusters are locale-dependent, much like string collation is locale-dependent. What Unicode gives you by default, the (extended) grapheme cluster, is as useful as the DUCET (Default Unicode Collation Element Table); while you can live with them, you would be unsatisfied. In fact there are tons of Unicode bugs that can't be corrected due to the compatibility reason, and can only be fixed via tailored locale-dependent schemes.

I would like to avoid locales in the language core. It would be great to have locale stuffs in the standard library, but without locale information you can't treat strings as (human) texts.

Can you give examples of locale-dependent things, or issues with extended grapheme clusters?

Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

[1] https://www.unicode.org/reports/tr10/#Trailing_Weights

[2] https://unicode.org/reports/tr29/

[3] https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition

For example, the text "ch" (U+0063 U+0068) is two grapheme clusters in English contexts, but one grapheme cluster in Czech contexts, collated between "h" and "i". [1]

According to Unicode, the text "Chemie" is written exactly the same whether it's the German or the Czech word. However, a German will say it has six letters and a Czech will say it has five.

Unicode provided a unified way to express international characters within the same text, but the context (i.e. locale) external to the text is still required to sensibly collate and manipulate it according to human sensibilities.

The default definition of grapheme clusters is simply a compromise for a global, locale-less understanding of collation/manipulation of Unicode characters.

> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.

> Note: The default Unicode grapheme clusters were previously referred to as "locale-independent graphemes." The term cluster is used to emphasize that the term grapheme is used differently in linguistics. For simplicity and to align terminology with Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], the terms default and tailored are preferred over locale-independent and locale-dependent, respectively.

[1] https://en.wikipedia.org/wiki/Ch_(digraph)

[2] http://www.unicode.org/reports/tr29/

Agreed. Normal "text" operations tend to work quite well.

And the other forms are accessible, e.g. if you write a text-based parser (XML, JSON, etc.), you'll probably want String.unicodeScalars.

> If you were counting code points, a surrogate pair would be 1. If it's two, you're counting code units.

And to be explicit as to why that is: surrogate pairs are a feature of the UTF-16 encoding, where two 16-bit code units ("code units" being the lexemes of the decoder) decode to a single Unicode codepoint.

I feel like everything to do with Unicode is clearer if you never bring up how it's encoded; or, alternately, if you pretend for the sake of your tutorial that everybody uses UTF-32, so you can just talk about flinging single-code-unit codepoints around as machine-words, the same way ASCII flings single-code-unit codepoints around as bytes. This being basically what Unicode text-handling libraries are doing underneath anyway.

After all, from the perspective of the Unicode standard itself, all the stuff below the abstraction of "a codepoint" is implementation detail.

The standard has to let the abstraction leak in a few places, like surrogate pairs or BOMs, but these leaks aren't what the Unicode standard is supposed to be "about", and should really be thought of as features of the encodings that have found their way up a layer, rather than features of Unicode per se. Heck, even the categorization of codepoint-ranges into "planes" is just a pragma of UTF-16. Putting these pragma-features front-and-center in a discussion of "what Unicode is", is IMHO entirely backwards.

>I swear there should be some rule or law...

Now is your chance! Distill this comment down into something pithy and deathanatos' law could be a thing.

Naming things is the hardest problem ;)

Thanks for catching. It's a fairly complex subject matter- and particularly hard get extra eye balls willing to check for typos.

- String length is typically measured in code units. - Funny enough, with Unicode normalization, multiple diacritics can be reduced into a single code point.

If you'd like to explore Unicode characters, you can use the Unicode Character finder, a web app I built some years ago: https://www.mclean.net.nz/ucf/

The app allows you to paste in a character to find out more about it, or to search the database of character descriptions to find what you're after.

You can link to a specific character to share with your friends and family: https://www.mclean.net.nz/ucf/?c=U+130BA

For exploration, additionally I'd recommend http://shapecatcher.com/ It allows you to draw the shape you are looking for, and with some form of ML, sorts by similarity. It has come in handy a few times for finding the characters I'm unable to describe.

Page doesn't render without JS enabled. Enabling JS causes questionable CSP requests.

Just a quick and scrappy "Ghost" blog running on a $5 Digital Ocean droplet with the usual analytics.

> data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit)

I don't think I've heard "word" mean "16 bits" since the 1980's ... and apparently neither has Wikipedia: https://en.wikipedia.org/wiki/Word_(computer_architecture)#T...

In Windows world DWORD is 32 bits large: https://docs.microsoft.com/en-us/openspecs/windows_protocols...

The concept of the word size of an architecture is different than "word" which has long been used colloquially in computing to mean two bytes.

2 nibbles are a byte. 2 bytes are a word.

Edit: "Word" as two bytes may actually be a microcomputer-specific colloquialism.

Unicode is great, emoji are a (technically impressive) monstrosity.

Emoji are an almost critical Unicode democratizing need. They aren't doing anything that other languages encoded with Unicode don't already do (and haven't already done since the beginning of Unicode). This article itself points out several key existing relatives, such as how Arabic, one of the most common and important written languages in the world, or the very important CJK family of written languages, used ZWJ and ZWNJ well before Emoji made it "cool" to other parts of the world, most especially the English-writing contingent that has long thought of Unicode as simply "ASCII plus a bunch of other stuff I might never use". Suddenly a lot of English documents have embedded emoji that deeply matters to the writers, and there are fewer excuses to treat Unicode as "ASCII+" and more cases where doing so is not only wrong (broken surrogate pairs, incorrect codepoint analysis for ZWJ, etc), but very visibly wrong in a way that users care and complain about it.

Excepting some of the weird combining mark tricks used for flags and some more straightforward modifiers for skin tone, they’re basically a bunch of inert codepoints that you can get away with just popping in a high-res PNG for. No complicated typesetting, not even any kerning. If you handle surrogate pairs correctly (which is also needed for some real, widely used languages) common emojis will work fine.

I don’t see what’s technically impressive about them, care to elaborate?

Unicode has two really great features.

* It names and defines things and sets standard. This seems trivial but is incredibly useful.

* Unicode encodings, mainly UTF-8 are good storage format for text (as a data structure for editing text, not so much if you want to be universal).

Unicode has one really horrible failing.

The 'user-perceived character' (Unicode terminology) is arguably the most important unit in text. Unicode approximates user-perceived characters using set of general rules to define grapheme clusters. A Grapheme cluster is a sequence of adjacent code points that should be treated as a unit by applications. Unfortunately the ruleset and definition is inadequate. Sometimes you need two grapheme clusters to define one unit.

If you get UTF-8 encoded and normalized string from somewhere from some unspecified time and era, don't know what application wrote it, using what version of UNICODE standard and what was the locale, you may lose some information.

Unicode should have added explicit encoding for user-perceived character boundaries (either fixed grapheme cluster eoncoding or completely different encoding). Let the writing software define it explicitly. It would have been future-proof (new software in the future can understand old strings) and past-proof (ancient software can understand and edit strings written in the future).

Unicode definitely has flaws but that doesn't mean we should throw the baby out with the bathwater and go back to "ASCII and other character sets." There's a reason we moved on from that world. However, I bet we will see another encoding coming up eventually (within 30 years) which solves the problems Unicode currently has and introduces a new set of problems as well. I saw this comment [0] about how that encoding should get started.

> Greek, for example, has a lot of special-casing in Unicode. Korean is devilishly hard to render correctly the way Unicode handles it. And once you get into the right-to-left scripts, scripts that sort-of-sometimes omit vowels, or Devanagari (the script used to write a bunch of widely-spoken languages in India), you start needing very different capabilities than what's involved in Western European writing. _The better approach probably would have been to start with those, and work back to the European scripts_

[0] https://www.reddit.com/r/programming/comments/b09c0j/when_zo...

Funnily enough, URLs still can't do actual Unicode.

Unicode URL has serious security problems.

The canonical example is google.com vs gооgle.com.

That was solved years ago by IDN/Punycode (implemented by any browser worth their salt).

Commented above, but to follow up from yesterday, here is the next post.

"Hacking GitHub with Unicode" https://news.ycombinator.com/item?id=21693550

I agree. I'll be releasing an article about this tomorrow. There are in-fact many security ramifications that have not been solved in practice.

Years ago I've posted support material [1] for Hangul filler mentioned in the article, reproduced below:


U+3164 HANGUL FILLER is one of the stupidest choices made by character sets. Hangul is noted for its algorithmic construction and Hangul charsets should ideally be following that. Unfortunately, the predominant method for multibyte encoding was ISO 2022 and EUC and both required a rather small repetoire of 94 × 94 = 8,836 characters [0] which are much less than required 19 × 21 × 28 = 11,172 modern syllables.

The initial KS X 1001 charset, therefore, only contained 2,350 frequent syllables (plus 4,888 Chinese characters with some duplicates, themselves becoming another Unicode headache). Notwithstanding the fact that remaining syllables are not supported, this resulted in a significant complexity burden for every Hangul-supporting software and there were confusion and contention between KS X 1001 and less interoperable "compositional" (johab) encodings before Unicode. The standardization committee has later acknowledged the charset's shortcoming, but only by adding four-letter (thus eight-byte) ad-hoc combinations for all remaining syllables! The Hangul filler is a designator for such combinations, e.g. `<fliler>ㄱㅏ<filler>` denotes `가` and `<filler>ㅂㅞㄺ` denotes `뷁` (not in KS X 1001 per se).

Hangul filler was too late in the scene that it had virtually no support from software industry. Sadly, the filler was there and Unicode had to accept it; technically it can be used to designate a letter (even though Unicode does not support the combinations) so the filler itself should be considered as a letter as well. What, the, hell.

[0] It is technically possible to use 94 × 94 × 94 = 830,584 characters with three-byte encoding, but as far as I know there is no known example of such charset designed (thus no real support too).


I should also mention that early Mozilla (and thus Firefox) had once supported ad-hoc combinations for KS X 1001, got interoperability problems and dropped the support later. Nowadays we treat KS X 1001 as an alias of Windows code page 949 for the sake of compatibility [2].

[1] https://github.com/Wisdom/Awesome-Unicode/issues/4

[2] https://encoding.spec.whatwg.org/#index-euc-kr

Unicode is an inspirational standard. We started with so many different character encodings and wound up pretty universally using Unicode. I wouldn't be surprised to see browsers start to drop support for other encodings - who even uses them at this point?

Are there any scenarios where you wouldn't use Unicode, other than an every-byte-matters embedded system?

Sorry for nitpicking but: Unicode is not an encoding, just (basically) a central registry for numbers, and you can have Unicode strings made of bytes (the UTF-8 encoding), in fact that's the most useful encoding for exchanging text data :)

There's a not-insignificant number of Japanese websites that can only correctly display using EUC-JP or ShiftJIS.

It seems very Latin/ASCII centric to push for disabling non-UTF-8 encodings, especially since the only reason UTF-8 works so well on ASCII websites is due to its backwards compatibility.

If it were the reverse, and UTF-8 were backwards compatible with EUC-JP/CJK, but not ASCII, I doubt you'd be pushing for eschewing other formats since it would break so many english websites.

There is no character in EUC-JP or Shift-JIS that is not in Unicode--the explicit goal of Unicode in its original formulation was to be able to losslessly round-trip any other charset through Unicode, and the initial version of Unicode incorporated the source kanji lists for the EUC-JP/Shift-JIS charsets in their entirety.

That's true, but you misunderstood what I meant.

The parent comment seemed to be implying that we should drop support for non-utf8 charsets.

To me, that rings like saying a website with 'charset=EUC-JP' (such as http://www.os2.jp/) should be broken, as in browsers should error out or display a large quantity of black boxes due to it using a non-utf-8 encoding.

I'm claiming the only reason the author thinks that's really viable is because in our western-centric world, we see mostly ascii and utf8. Things that, if you flip to only utf-8, both still look fine.

CJK websites, on the other hand, that are using the equivalent of ASCII will have to be manually upgraded to display correctly if browsers drop their support.

Sure, all their characters can be represented in utf-8, but there's large swathes of websites that will never be updated to a new charset, and it's only a western-centric view that can so blithely suggest breaking them all.

Windows-1252/ISO-8859-1 (the two charsets are so commonly conflated that it's often best to treat them as one) was the dominant [non-ASCII] charset of the web until around 2007 or 2008, and their prevalence more recently is only about 5%.

A collection of Usenet messages gathered in 2014 (see http://quetzalcoatal.blogspot.com/2014/03/understanding-emai... for full details) showed that out of 1,000,000 messages, about 530,000 were actually ASCII; 270,000 were ISO-8859-1 or Windows-1252; and only 75,000 were UTF-8. More modern numbers would probably show higher UTF-8 counts, although Usenet is notoriously conservative in terms of technology.

What I'm trying to elucidate here is that the rise of UTF-8 isn't because most text is ASCII, but because there's been a rather more concerted effort to default content generation to UTF-8 and treat other charsets only as legacy inputs. Well, with the exception of the Japanese, who tend to be strongly averse to UTF-8. (I've been told that Japanese email users would rather have their text get silently mangled than silently converted to UTF-8 because you're quoting an email with a smart quote [not present in any of the 3 Japanese charsets], whereas every other locale was happy changing the default charset for writing to UTF-8).

The obvious problem for Shift-JIS roundtripping is 0x5C (is it ¥ or \), not the kanji.

What's the code point for uppercase superscript Z?

There isn't one. Unicode considers superscripting a matter of presentation, which Unicode doesn't cover, except when it does.

More particularly: Presentation variant is not a justification for inclusion in Unicode BUT prior encoding in another character set is.

Unicode sets a high priority on roundtripping. The idea is that if you take some data in any one character set X and convert it to Unicode, you should preserve all the meaning by doing this, such that you could losslessly convert it back to encoding X.

It's like the wordprocessor problem where users say they only want 10% of the features of a popular wordprocessor but it turns out each user wants a different 10% and so the only way to deliver what they all want is to deliver 100% of the features. Likewise, Unicode has all the weird features of every legacy character set which was embraced BUT it doesn't arbitrarily add new weird features, although you could argue that some of the work done for Unicode has that effect e.g. the way flags work or the Fitzpatrick modifiers.

If Unicode had insisted upon never encoding anything that might be a presentation feature, it'd be a long forgotten academic project that never went anywhere and we'd all be using some (probably Microsoft designed) 16-bit ASCII superset today.

Is there a realitvely easy way to find the character set that was included for uppercase superscript W? (ᵂ)

In the case of U+1D42 Modifier Letter Capital W I was wrong about the cause, it was in fact specifically added on the rationale that for this purpose (phonetics) the presentation was semantic in nature, and so the plain text (thus Unicode) needed to preserve these symbols which could otherwise be handled by a presentation layer.

U+1D42 Modifier Letter Capital W was added in Unicode 4.0 as part of the Phonetic Extensions and Wikipedia provides a long list of Unicode committee paperwork regarding this: https://en.wikipedia.org/wiki/Phonetic_Extensions

You can see that initially it would have been numbered differently and then over the course of several drafts the proposal evolved until it was assigned U+1D42

Does that mean there is hope to complete the [a-Z] sub/super set?

The Emoji emoji modifiers are pretty cool. - Skin color modifiers - Character combiners: - man [ZWJ] woman [ZWJ] boy [ZWJ] girl === family of 4

"How to shrink a family using MySQL" or: "Unicode Emojis, Code Points, and Grepheme Clusters"


Just posted another Unicode article to HN. "Hacking GitHub with Unicode"


Unicode reverse character:

'hello \u{202e} world'; 'hello dlrow' // Visual equivalent

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact