I’m not going to try and minimize the problem, here. Han unification was pushed ...

cygx · on April 14, 2020

Han unification was pushed through by western interests, by my understanding.

Note that as far as I'm aware, the interest in question was the initial 16-bit limit of the character set and later on the non-proliferation of competing standards.

Also note that while Han unification is the most prominent example, there are technically similar cases, which just aren't as charged culturally. For one, Unicode doesn't encode German Fraktur: While some characters are available due to their use in mathematics, it's lacking the corresponding variants of ä, ö, ü, ß, ſ as well as specific ligatures. So if you want to intermix modern with old German writing, you'll also have to go out-of-band.

naniwaduni · on April 15, 2020

Let's not excuse the utter irresponsibility of deciding on 16 bits: the initial 16-bit limit of the character set is instantly invalidated by looking at any comprehensive Chinese character dictionary, no reasonable choice of which will give you an estimate of under about 30k characters, even excluding graphical variants.

Even assuming that we discount 80k+ estimates by collapsing graphical variants, that's over half of your code space right off the bat. For this to seem like a seem like a good idea, you'd need to assume that Chinese is a uniquely bad one-off case. Not a good bet to stake your character set on.

anoncake · on April 14, 2020

That's not the same thing. Fraktur is just a style of fonts, antiqua and fraktur letters are semantically the same.

tialaramex · on April 14, 2020

It's actually exactly the same thing. The Han Unification didn't smash together unrelated squiggles that just happened to look similar, they were semantically the same - scholars of the Han writing system spent a bunch of time deciding what is or is not the same squiggle just drawn differently, like Fraktur, and today people are annoyed because, as you'd expect some of them believed that "style of fonts" was integral to the meaning anyway.

anoncake · on April 15, 2020

Chinese characters represent the Chinese words or parts thereof, Japanese ones represent Japanese words and parts thereof. That is a semantic difference.

tialaramex · on April 15, 2020

So what you're saying is that because 'chat' in English and 'chat' in French are quite different words with very different meanings, you believe there should be a separate letter 'c' for English and French to enable us to tell those words apart?

anoncake · on April 15, 2020

The Latin alphabet is not logographic.

zajio1am · on April 15, 2020

It is not logographic, but characters still have meaning - associtated phonemes. Although this is less clear in English, it is emphasized in other languages.

And this mapping is different between languages. So 'c' in English has different meaning to 'c' in Czech.

anoncake · on April 17, 2020

Not really. Morphemes are considered (defined even) as the smallest unit that has meaning by itself.

cygx · on April 14, 2020

There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally? There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode. What if I wanted to preserve that distinction in historic writing?

edit:

[1] I think the mandatory ones are actually there (just not in Fraktur), it's some optional ones like ſch that are missing.

anoncake · on April 15, 2020

> There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally?

No, since "now" is an English word, not a Japanese or Chinese one.

> There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode.

Unicode doesn't encode ligatures except for backwards compatibility.

cygx · on April 15, 2020

Unicode doesn't encode ligatures except for backwards compatibility.

And it doesn't encode separate variants for unified Han characters. As in, that's not an argument, just a description of the status quo.

anoncake · on April 15, 2020

Of course it is. Ligatures aren't characters, they're glyphs that represent multiple characters. Unicode does not encode glyphs, that's simply not its job. No more than encoding what font to use or when to render text in italic.

cygx · on April 15, 2020

Which is the whole point of Han unification, the argument being that whether or not a particular line in U+4ECA is horizontal or diagonal is just like that. What's the difference?

anoncake · on April 17, 2020

To the contrary: What any line in any glyph looks like is of no concern because Unicode doesn't deal with glyphs. It deals with abstract characters that don't have appearances to begin with.

"Α" and "A" look exactly the same (at least in most fonts). But each has its own code point because the GREEK CAPITAL LETTER ALPHA simply isn't the LATIN CAPITAL LETTER A or any other Latin letter.

cryptonector · on April 14, 2020

As I understand it Han unification happened because at the time all there was was UCS-2 -no UTF-16, no UTF-8- so codespace was tight and precious, and that motivated codespace preserving optimizations, of which Han unification is the notable one.

To avoid that they needed to have invented UTF-8 many years earlier. Perhaps if the people designing UTF-8 were more diverse they might have felt the necessity to invent UTF-8 to the point of actually doing it, but then perhaps they might have done it poorly. At any rate, I don't know enough details to really know if "Han unification was pushed through by western interests" is remotely fair.

macintux · on April 14, 2020

UTF-8 was sketched on a placemat as a response to a different idea. It seems likely that had it not arisen in a moment of inspiration by a genius, we would be stuck with another inferior design by committee.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

cryptonector · on April 15, 2020

I agree. But too, necessity is the mother of invention. GP seems to argue that Han unification happened because the UC was not diverse enough. Maybe, and maybe if it had been diverse enough the need would have arisen sooner. But again, the thing they came up with could have been garbage, who knows!

What I do know is that UTF-8 is genius. The Han unification problems seem mostly minor -- I suspect code can detect language and do the right thing, for example, and again, we could revive language tags if need be.

gpderetta · on April 15, 2020

Utf-8 is a magnificent hack.

glandium · on April 15, 2020

Here's some text I could write about some Japanese characters, that, thanks to Han Unification, may be confusing:

In 1946, the Japanese government created a (non-exhaustive) list of common characters, some of which were simplified from their more traditional form. One of them is 臭. Its older form was 臭. Another character that shares the same root, 嗅, was not part of that list of common characters. It was added later, in 2010, and was never simplified, such that the stroke that was removed in 臭 is still there, making it just slightly different.

If your fonts are biased towards Chinese, 臭 and 臭 will be identical, and you won't know what I'm talking about. The former is 自 above 大, the latter is 自 above 犬.

You could think the difference is trivial, but 大 is big and 犬 is dog. Not that it alters the meaning of 臭, 臭, or 嗅, but when talking about how 嗅 is not 口 alongside 臭 anymore, it does make a difference.

ksec · on April 14, 2020

Yes, the real problem is when you start mixing All Four ( or Five ) of them together Chinese Traditional, Simplified Korean, Japanese things becomes extremely problematic.

I think it is by luck, All four writings has significant usage within their own region, imagine if one of them were significantly smaller and over time were forced ( or by ease of use or what ever reason ) to switch to a different style without knowing it.