Hacker News new | past | comments | ask | show | jobs | submit login

Disagree. A good example of the opposite mistake is the “Turkish i” problem. Basically they have a version of I with and without a dot — for both lowercase and uppercase — so algorithms that uppercase i to I break Turkish by removing the dot. If the Turkish i were a unique code point, the algorithms would not mess it up.



Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason the official upper case is two letters consisting of either "SS" or "SZ". So you have three different ways to upper case ß one which is guaranteed to be wrong in any official context and two which lower case to "ss" or "sz" and not back to ß. That is one big ouch, especially to the ISO standard adding that invalid upper case variation. Languages are messy, best don't try to transform your input text in any way.


> Then you have the german ß (sharp S) which does not have an upper case version. While ISO added one for whatever reason

It's used in typesetting sometimes, and if a character is used then it should have an encoding.


>It's used in typesetting sometimes, and if a character is used then it should have an encoding.

IMO there's little semantic difference so it doesn't deserve a character. We should have drawn the line between content and formatting, but it's too late and what we have now is emoji and one-use glyphs. [1]

[1] https://en.wikipedia.org/wiki/Multiocular_O


Uppercasing is heavily dependent on context. Even the ASCII characters are context dependent.

a+b=d+d

Shouldn't be uppercased.

It's an insoluble problem to put contextual semantic info into Unicode characters, because individual characters have no context.


This then makes the CJK unification decision even more perplexing. Surely Japanese characters should not be treated the same as Mandarin ones, even if they look the same?


>Surely Japanese characters should not be treated the same as Mandarin ones

No. In Japanese, how you read/pronounce a character depends on context. Sometimes they are the same as Chinese, sometimes not.

Take mountain (山) for example.

Using the Chinese pronouncation it is "san". 富士山 (Mount Fuji) is ふじさん "Fuji san"

Using Japanese pronouncation it is "yama". 山登り (Mountain Climbing) is やまのぼり "yamanoboru"

(and don't call me Shirley)


I can't speak Japanese, only some Chinese, but I'm wondering if whether to use the (Chinese) Onyomi or (Japanese) Kunyomi pronunciation in Japanese is related in any way to whether the 山 comes first or last in the compound. If it comes last as in 富士山 "Fuji san", the grammar matches the Chinese, and so does the pronunciation ("Fushishan"). If it comes first as in 山登り "yamanoboru", the grammar is opposite to the Chinese (which would also have the 山 last, i.e. 跑山).

PS: Isn't り pronounced "ri" and る pronounced "ru"?


As a rule of thumb, I have learned that Onyomi is usually when a kanji is part of a compound word and Kunyomi is usually when the kanji is by itself.

Yes, I typoed that and it's too late to fix it. り is "ri", not "ru".




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: