Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m not going to try and minimize the problem, here. Han unification was pushed through by western interests, by my understanding.

However, most Unicode characters are identical or nearly identical in Chinese and Japanese. Characters with “significant” visual differences got encoded as different Unicode characters. The same thing applies to simplified and traditional Chinese characters.

So for a given “Han character”, there might be between one and three different Unicode characters, and there might be between one and three different ways of writing it.

Here’s an illustration: https://japanese.stackexchange.com/questions/64590/why-are-j...

So the issue does come up when mixing Chinese and Japanese text, but it’s not really one that has a big impact on legibility of the text but you would definitely be concerned if you were writing a Japanese textbook for Chinese students, or vice versa.

Beyond that, it is usually fairly trivial to distinguish between Japanese and Chinese text, so you could just lean on simple heuristics to get the work done (Japanese text, with the exception of fairly ancient text or very short fragments, contains kana, but Chinese does not).



Han unification was pushed through by western interests, by my understanding.

Note that as far as I'm aware, the interest in question was the initial 16-bit limit of the character set and later on the non-proliferation of competing standards.

Also note that while Han unification is the most prominent example, there are technically similar cases, which just aren't as charged culturally. For one, Unicode doesn't encode German Fraktur: While some characters are available due to their use in mathematics, it's lacking the corresponding variants of ä, ö, ü, ß, ſ as well as specific ligatures. So if you want to intermix modern with old German writing, you'll also have to go out-of-band.


Let's not excuse the utter irresponsibility of deciding on 16 bits: the initial 16-bit limit of the character set is instantly invalidated by looking at any comprehensive Chinese character dictionary, no reasonable choice of which will give you an estimate of under about 30k characters, even excluding graphical variants.

Even assuming that we discount 80k+ estimates by collapsing graphical variants, that's over half of your code space right off the bat. For this to seem like a seem like a good idea, you'd need to assume that Chinese is a uniquely bad one-off case. Not a good bet to stake your character set on.


That's not the same thing. Fraktur is just a style of fonts, antiqua and fraktur letters are semantically the same.


It's actually exactly the same thing. The Han Unification didn't smash together unrelated squiggles that just happened to look similar, they were semantically the same - scholars of the Han writing system spent a bunch of time deciding what is or is not the same squiggle just drawn differently, like Fraktur, and today people are annoyed because, as you'd expect some of them believed that "style of fonts" was integral to the meaning anyway.


Chinese characters represent the Chinese words or parts thereof, Japanese ones represent Japanese words and parts thereof. That is a semantic difference.


So what you're saying is that because 'chat' in English and 'chat' in French are quite different words with very different meanings, you believe there should be a separate letter 'c' for English and French to enable us to tell those words apart?


The Latin alphabet is not logographic.


It is not logographic, but characters still have meaning - associtated phonemes. Although this is less clear in English, it is emphasized in other languages.

And this mapping is different between languages. So 'c' in English has different meaning to 'c' in Czech.


Not really. Morphemes are considered (defined even) as the smallest unit that has meaning by itself.


There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally? There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode. What if I wanted to preserve that distinction in historic writing?

edit:

[1] I think the mandatory ones are actually there (just not in Fraktur), it's some optional ones like ſch that are missing.


> There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally?

No, since "now" is an English word, not a Japanese or Chinese one.

> There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode.

Unicode doesn't encode ligatures except for backwards compatibility.


Unicode doesn't encode ligatures except for backwards compatibility.

And it doesn't encode separate variants for unified Han characters. As in, that's not an argument, just a description of the status quo.


Of course it is. Ligatures aren't characters, they're glyphs that represent multiple characters. Unicode does not encode glyphs, that's simply not its job. No more than encoding what font to use or when to render text in italic.


Which is the whole point of Han unification, the argument being that whether or not a particular line in U+4ECA is horizontal or diagonal is just like that. What's the difference?


To the contrary: What any line in any glyph looks like is of no concern because Unicode doesn't deal with glyphs. It deals with abstract characters that don't have appearances to begin with.

"Α" and "A" look exactly the same (at least in most fonts). But each has its own code point because the GREEK CAPITAL LETTER ALPHA simply isn't the LATIN CAPITAL LETTER A or any other Latin letter.


As I understand it Han unification happened because at the time all there was was UCS-2 -no UTF-16, no UTF-8- so codespace was tight and precious, and that motivated codespace preserving optimizations, of which Han unification is the notable one.

To avoid that they needed to have invented UTF-8 many years earlier. Perhaps if the people designing UTF-8 were more diverse they might have felt the necessity to invent UTF-8 to the point of actually doing it, but then perhaps they might have done it poorly. At any rate, I don't know enough details to really know if "Han unification was pushed through by western interests" is remotely fair.


UTF-8 was sketched on a placemat as a response to a different idea. It seems likely that had it not arisen in a moment of inspiration by a genius, we would be stuck with another inferior design by committee.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt


I agree. But too, necessity is the mother of invention. GP seems to argue that Han unification happened because the UC was not diverse enough. Maybe, and maybe if it had been diverse enough the need would have arisen sooner. But again, the thing they came up with could have been garbage, who knows!

What I do know is that UTF-8 is genius. The Han unification problems seem mostly minor -- I suspect code can detect language and do the right thing, for example, and again, we could revive language tags if need be.


Utf-8 is a magnificent hack.


Here's some text I could write about some Japanese characters, that, thanks to Han Unification, may be confusing:

In 1946, the Japanese government created a (non-exhaustive) list of common characters, some of which were simplified from their more traditional form. One of them is 臭. Its older form was 臭. Another character that shares the same root, 嗅, was not part of that list of common characters. It was added later, in 2010, and was never simplified, such that the stroke that was removed in 臭 is still there, making it just slightly different.

If your fonts are biased towards Chinese, 臭 and 臭 will be identical, and you won't know what I'm talking about. The former is 自 above 大, the latter is 自 above 犬.

You could think the difference is trivial, but 大 is big and 犬 is dog. Not that it alters the meaning of 臭, 臭, or 嗅, but when talking about how 嗅 is not 口 alongside 臭 anymore, it does make a difference.


Yes, the real problem is when you start mixing All Four ( or Five ) of them together Chinese Traditional, Simplified Korean, Japanese things becomes extremely problematic.

I think it is by luck, All four writings has significant usage within their own region, imagine if one of them were significantly smaller and over time were forced ( or by ease of use or what ever reason ) to switch to a different style without knowing it.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: