Hacker News new | past | comments | ask | show | jobs | submit login

The two Turkish letters dotted and dotless i are often confused by users of poorly localised software. Wikipedia links to a murder case allegedly caused by this: http://en.wikipedia.org/wiki/Dotted_and_dotless_I

A real horror story.

(Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot. I am curious about this design decision, since it seems like a basic error in operating a the level of glyphs rather than symbols. Or maybe the opposite.)




Upper and lower casing can't be assumed to be inverse; there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case). The correct lower-casing of "I" in English is definitely "i"; the correct upper-casing of "ı" in English is maybe a wrong question, because it just isn't an English letter, so I guess you could argue for leaving it unchanged, but converting it to "I" is probably what the person who wrote "ı" would want to happen when it was upper-cased. Maybe?


> there are plenty of other cases where they will change (e.g. precomposed characters that don't have a precomposed upper case).

Just curious, is there still some cases if you only consider NFKD strings/characters?


Yes; the Turkish "I"s under discussion here are the most immediate case, but there are other cases where you have two almost-aliases in one case that aren't present in another case even ignoring composition. E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.


> E.g. the ohm symbol "Ω" lowercases to a standard omega "ω", but that uppercases to a standard uppercase omega "Ω", because there's a distinct codepoint for "ohm symbol" (even though it's "just" omega, perhaps because some legacy codepages included it as a symbol without including a full greek alphabet) but no corresponding lowercase codepoint.

Except that the NFKD form (which I was specifically asking) for 'OHM SIGN' is 'GREEK CAPITAL LETTER OMEGA'.


Ah sorry, I was sure I'd read NFD. Will look more.


Is that actually the right thing to do, or is it another mistake?


As you said and linked elsewhere in the thread, the unicode consortium takes the viewpoint that there should be one codepoint for each glyph, even if that glyph has multiple semantic meanings in different languages (e.g. "U"). So by that standard they should probably be the same codepoint, but in that case it's hard to argue that roman capital I and turkish capital dotless I should be different codepoints.

Alternately you could argue that ohm symbol shouldn't lowercase to omega, which, maybe. I think the right view is simply that lower- and upper-casing aren't always well defined, are culturally and contextually dependent, and are probably something you should only ever be doing for display, not for semantic purposes. (If you want to do case-insensitive comparisons of strings, Unicode comes with algorithms for that which do a better job than upper- or lower-casing the strings before comparing)


With Unicode, why aren't the two Turkish Is just treated as if they have nothing to do with the normal Latin I? The fact that the glyph for uppercase dotless I resembles the glyph for uppercase Latin I should be irrelevant, surely. It's a kind of typographic false friend situation.

Maybe there's a missing level of indirection in Unicode that prevents it from doing this, but I can't see how there could be.


One answer is that unicode had to import existing documents; I suspect that a lot of documents are written in a Turkish codepage that would have been an 8-bit encoding with the lower half as ASCII, that wouldn't have bothered with a different codepoint for "Turkish" I. As I said, you can't rely on upper/lowercasing roundtripping correctly in general.

(I was about to give the example of ß, which is usually uppercased to SS. But interestingly Unicode has now adopted a codepoint for the (disputed, and currently lacking a typographic consensus) capital version, ẞ. So maybe a codepoint for "uppercase Turkish I" is on the way. Turkish users will still expect to be able to lowercase "I" to a dotless lowercase i though, since a lot of existing documents will have "I"s in)


I did a bit of research on this and you're right, legacy encodings are one problem. More seriously there seems to be no established way to manage multilingual text which includes homoglyphs (say by using colour coding) so you would really be replacing one problem with another.

It does seem like this Turkish I problem is the most conspicuous situation, maybe unique, where changing locale changes the behaviour of toupper/tolower. Unicode, on the other hand, has many homoglyphs and duplicate characters which all need to be dealt with.

http://en.m.wikipedia.org/wiki/Homoglyph http://en.m.wikipedia.org/wiki/Duplicate_characters_in_Unico...


Yeah, turkish I is bit of a red herring due it being a quirk in Unicode specifically. From a previous comment of mine:

> They [Unicode consortium] should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent. [...] I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.


That's helpful. Wikipedia page for "glyph" seems to concur:

"For example, in most languages written in any variety of the Latin alphabet the dot on a lower-case "i" is not a glyph because it does not convey any distinction, and an i in which the dot has been accidentally omitted is still likely to be read as an "i". In Turkish, however, it is a glyph because that language has two distinct versions of the letter "i", with and without a dot."


The Turkish I problem is seriously annoying. We recently tried adding Turkish translations to our website, only to find that .NET suddenly treats DataRow("ID") differently than DataRow("id") when your locale is set to Turkish.



I think if you stab somebody over a text message, there's more going on than just a missing dot. As if it'd be acceptable to kill somebody even if they did call your daughter a prostitute.

The article itself could benefit from corrections: "with a knife on his chest" That doesn't sound so bad.


> Less seriously, Unicode has counterintuitive case-changing behaviours with those letters. If you are working outside the Turkish locale and uppercase a dotless I and then lowercase it, it gains a dot.

AFAIK the only solution would be to error out when uppercasing a dotless i in a non-turkish locale. Which I'm not sure sounds better. Or going back in time and creating a separate category of i and I for the turkish script.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: