Hacker News new | past | comments | ask | show | jobs | submit login

This is a solution to a problem I didn't know existed.



The nature of unicode is that there's always a problem you didn't (but should) know existed.

And because of this global confusion, everyone important ends up implementing something that somehow does something moronic - so then everyone else has yet another problem they didn't know existed and they all fall into a self-harming spiral of depravity.


Some time ago, I made some ASCII art to illustrate the various steps where things can go wrong:

    [user-perceived characters]
                ^
                |
                v
       [grapheme clusters] <-> [characters]
                ^                   ^
                |                   |
                v                   v
            [glyphs]           [codepoints] <-> [code units] <-> [bytes]


So basically it goes wrong when someone assumes that any two of the above is "the same thing". It's often implicit.


That's certainly one important source of errors. An obvious example would be treating UTF-32 as a fixed-width encoding, which is bad because you might end up cutting grapheme clusters in half, and you can easily forget about normalization if you think about it that way.

Then, it's possible to make mistakes when converting between representations, eg getting endianness wrong.

Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate about Han unification - but as far as I'm concerned, that's a WONTFIX.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: