Hacker News new | past | comments | ask | show | jobs | submit login

Please understand that Han unification is _the_ problem. It is clean that Unicode needs to realize that the Han unification is wrong and accepts what the native writers of those languages think about their scripts.

To make the problem more understandable to the people that are used to alphabetic scripts, suppose that tomorrow an Asian committee starts creating Uniword, a repertoire that maps complete words to numerical IDs. At a certain point they get to "colour".

Uniword committee: Well, that word shares meaning and origin with the other word "color", for which we have already a codepoint, so we will encode them under the same codepoint.

GB, Australia and Canada: Ehi! No! To us those are different words; especially, we do not want Mr. Colours to appear as Mr. Color.

Uniword commitee: No problem, just add some out-of-band information like "nationality" or "<span lang='en-GB'>"

"colour"-people: that will not work, there are so many cases in which this can go wrong. Whenever I copy a field from a DB I also have to extract this extra information?

Uniword: yes, that is the problem? C'mon!

"colour"-people: but do you need to do that in your applications?

Uniword: no, we have one code for every single word in our languages, including codes for very old languages that exist only in two palimpsests.

"colour"-people: and why cannot we have the same level of granularity?

Uniword: because you have too many words!!! And we started we had only 100k available integers.

"colour"-people: and now?

Uniword: now we have 2^32. But, yeah, that is not the point; just do how we suggest. This dialog is getting to long.

"colour"-people: "dialogue", please.




The only way I could improve on this dialogue would be accusing the Australians of anti-American prejudice for refusing to accept the English unification.

That was perceived as happening more than a few times in the Han Unification debate.


This is a great summary, thanks for that. I'd never had this explained in a way I could personally relate to.

I remember being concerned about Han unification around the time Ruby 1.9 was released, since this seemed to be one of Ruby's major reasons for being encoding-independent instead of standardizing on Unicode. But I hadn't heard about this issue in a while, except to hear occasionally someone say it's not a problem (maybe it was a Chinese person instead of a Japanese person -- the Wikipedia page says that the Chinese aren't as concerned about Han unification since Traditional Chinese didn't get unified with Simplified Chinese).


Thanks, I replied a bit to early without reading. I think you capture the problem nicely.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: