This is a solution to a problem I didn't know existed.

Veedrac · on May 27, 2015

The nature of unicode is that there's always a problem you didn't (but should) know existed.

And because of this global confusion, everyone important ends up implementing something that somehow does something moronic - so then everyone else has yet another problem they didn't know existed and they all fall into a self-harming spiral of depravity.

cygx · on May 27, 2015

Some time ago, I made some ASCII art to illustrate the various steps where things can go wrong:

    [user-perceived characters]
                ^
                |
                v
       [grapheme clusters] <-> [characters]
                ^                   ^
                |                   |
                v                   v
            [glyphs]           [codepoints] <-> [code units] <-> [bytes]

leni536 · on May 27, 2015

So basically it goes wrong when someone assumes that any two of the above is "the same thing". It's often implicit.

cygx · on May 27, 2015

That's certainly one important source of errors. An obvious example would be treating UTF-32 as a fixed-width encoding, which is bad because you might end up cutting grapheme clusters in half, and you can easily forget about normalization if you think about it that way.

Then, it's possible to make mistakes when converting between representations, eg getting endianness wrong.

Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate about Han unification - but as far as I'm concerned, that's a WONTFIX.