
What Every Software Developer Must Know About Unicode and Character Sets - tmlee
http://joelonsoftware.com/articles/Unicode.html
======
nabla9
>Every platonic letter in every alphabet is assigned a magic number by the
Unicode consortium which is written like this: U+0639. This magic number is
called a code point.

This is wrong.

Code-point does not match each platonic letter (abstract character in Unicode)
nor grapheme. It does so in many western languages and alphabets but not in
general. Code-point is just unit of information used in __encoding__.

Mapping from code-points to abstract characters is not total, injective, or
surjective. Some abstract characters need more than one code point to express
them. Also a grapheme can be sequence of one or more code points and so can
abstract character and so can abstract character. You can't split code points
to split text into abstract characters or graphemes.

What every software developer must know beyond code points and code units:

User-perceived character : what user thinks is a character.

Grapheme cluster : A sequence of coded characters that ‘should be kept
together'. They try to represent user perceived character in language
independent way. Selecting single character or cursor movement happen at this
level.

