
The Absolute Minimum Every Dev Must Know About Unicode and Character Sets (2003) - krausejj
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
======
nabla9
This article should be retired because it's harmful.

This and other "absolute minimums" like it seem to stop before teaching
absolute minimum probably because the authors don't know the absolute minimum.
They just teach the encodings and stop there. That's harmful.

Consider the two incorrect sentences Joel makes:

> In Unicode, a letter maps to something called a code point

and

>Every platonic letter in every alphabet is assigned a magic number by the
Unicode consortium

These are incorrect statements and Joel does not (or did not) know enough
about Unicode to know that he is wrong.

Above the code points are "grapheme clusters", "extended grapheme clusters" or
“user-perceived character” (“a basic unit of a writing system for a language”)
that match the "platonic letter" Joel talks about. wchar_t can't represent

Extended grapheme clusters can have arbitary number of code points in them .
You need to use unicode-segmentation to cut unicode string into smaller
strings that represent "platonic characters" if you want to do it right.

~~~
krausejj
Interesting. For me, the underlying mechanics won't likely be remembered, but
the big takeaway was a better understanding of what character sets are, why
they exist from a historical perspective, and why I should care... That higher
level context makes character encoding issues seems less scary and more worth
trying to understand.

It's too bad he may have gotten some technical details wrong, but I found it
refreshing to get a somewhat entertaining "story" along with some of the
details.

~~~
nabla9
Basically what you should now about Unicode is that code points don't
conceptually match anything of interest for the users except by accident (you
restrict yourself to limited number of languages and characters).

Fully supporting Unicode is PITA and most apps and systems choose to cheat.
It's completely OK, but you should know that you are doing it and either
reject everything you can't handle or treat it as immutable sting.

