
Around the World With Unicode - totallymike
https://norasandler.com/2017/11/02/Around-the-with-Unicode.html
======
wodenokoto
> This [han unification] significantly reduces the number of code points you
> need, and simplifies normalizing and collating CJK text, at the expense of
> undermining the entire point of Unicode.

This is the best description of the pro and cons of han unification I've ever
seen.

Personally I really enjoy the ease of which I can look up characters from a
Japanese text in a Chinese dictionary and see how they are subtly understood
differently across the two languages.

~~~
jasode
_> , at the expense of undermining the entire point of Unicode._

I'm not a Unicode history expert but did the author (Nora Sandler) accurately
represent the _philosophical_ intentions of Unicode?

Specifically, she quotes: _" Unicode provides a unique number for every
character,"_

Is "every character" underspecified? What's the philosophy of Unicode? Is it
to:

a) map a codepoint for every _semantic_ character?

b) map a codepoint for every _visual_ character?

Here's a non-CJK example of the single tick (′) character U+2032 [1]:

That has 3 different semantics:

    
    
      1) foot mark e.g. 3′ to a yard
      2) prime mark e.g. f′(x)
      3) coordinates minutes e.g. 48°51′24″N
    

In each case, the character's rendering to screen and printer look identical
so Unicode used the same codepoint for 3 different meanings. Even if Unicode
created 3 separate codepoints for 3 separate meanings of (′), _semantic
fidelity_ would be lost anyway since authors would often choose the first
glyph _visually_ that "looked like" the tick mark they wanted. Or they'd
probably just use U+0027, the traditional ASCII apostrophe (') since it's the
easiest to type on a keyboard.

Put another way, was Unicode intended to create codepoints at the level of
_characters-the-visual-look_ or at the level of _language character sets_? It
looks like Unicode chose the _visual look_ which is why 3 uses of tick marks
and many Han characters collapse to single codepoints. If the "entire point of
Unicode" was for the 1-to-many mapping of codepoints to map to _language sets_
, that means that there would be a contiguous array of Han codepoints for
Chinese and another set of Han codepoints for Korean -- with many duplicates.
Is there evidence that that was the "correct" idea for Unicode and somehow
politics or technical debates de-duplicated the CJK characters?

Yes, there are some duplicate characters such as math symbols and Greek
language letters so maybe Unicode philosophy has no single consistent idea of
what a codepoint maps to.

[1]
[http://www.fileformat.info/info/unicode/char/2032/index.htm](http://www.fileformat.info/info/unicode/char/2032/index.htm)

~~~
ranit
I am also not Unicode history expert nor Unicode expert in any way.

> It looks like Unicode chose the visual look

This might be true for Han characters, but it is not for others. For example
capital Latin A (U+0041), Cyrillic A (U+0410) and Greek A (U+0391) have same
origin and are visually identical. Same for many other symbols.

~~~
msla
For Greek, Cyrillic, and Latin in specific, Unicode has to be round-trip
compatible with encoding schemes which distinguish between them.

(There might have been other reasons as well, but the round-trip desideratum
forces it in any case.)

There are a lot of characters which were explicitly encoded for such
compatibility:

[https://en.wikipedia.org/wiki/Unicode_compatibility_characte...](https://en.wikipedia.org/wiki/Unicode_compatibility_characters)

This document discusses equivalence among Unicode characters:

[http://unicode.org/reports/tr15/tr15-18.html](http://unicode.org/reports/tr15/tr15-18.html)

