The problem with text (that Unicode solves only partially) is that text represen...

taeric · on Nov 27, 2013

I think your first assertion can be strengthened even further. It isn't like this is unique to letters that look the same. That is, sometimes WORD != WORD. Consider a few common words. Time? As in Time of day? As in how long you have? An interesting combination of the two? Day? As in a marker on the calendar? Just the time when the sun is out? Then we get into names. Imagine the joy of having to find someone named "Brad" that isn't famous. From a city named Atlanta, but not the one in GA. (If you really want some fun, consider the joy that is abbreviations. Dr?)

derleth · on Nov 27, 2013

Except these are all well outside the ambit of what programmers usually think of as text processing, so they won't try to solve them using the same tools.

More to the point, they sound hard, so people won't be so quick to claim they've solved them.

On the other hand, case-insensitive string matching sounds easy, even if it's actually somewhat difficult due to the language dependencies mentioned above, so people will claim to have a general solution that fails the first time it's faced with i up-casing to İ instead of I, or the fact the German 'ß' up-cases to 'SS' as opposed to any single character. (Unicode does contain 'ẞ', a single-character capital 'ß', which occurs in the real world but is vanishingly rare. As far as modern German speakers are concerned, the capital form of 'ß' is 'SS'.)

http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

http://opentype.info/blog/2013/11/18/capital-sharp-s-design-...

http://blogs.msdn.com/b/michkap/archive/2009/07/28/9850675.a...

http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a...

taeric · on Nov 27, 2013

Right, I do not disagree. I just feel better treating them the same. That is, both are actually easy and reliable so long as you realize you have to make some gross simplifications. And most of the time your life will be much easier if you start with the gross simplifications and try to expand beyond them only when necessary. (This is also why I'm loathe to try programming in unicode...)

zokier · on Nov 27, 2013

I think (2) is an issue with Unicode specifically. They should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent.

samatman · on Nov 27, 2013

While that's a problem with Unicode, it's a really big problem with Unicode. As the name alludes to, Unicode preserved as much as possible of existing regional encodings, which is why (among other reasons) there's a pre-composed version of basically every accented Latin letter.

jheriko · on Nov 27, 2013

isn't this solving the wrong side of the problem? how about not having to think about such things at all and just accepting that uppercase/lowercase conversion is never going to be language agnostic.

thats futureproof and powerful, rather than extra thinking and work...

zokier · on Nov 27, 2013

Most likely case-changes need to be locale-aware, that is true. But still I think minimizing number of locale-specifics is a reasonable goal and in that light I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.

jheriko · on Nov 28, 2013

You are right, everything should be as easy as possible. This is a good philosophy for design in general...

ygra · on Nov 27, 2013

Homoglyphs vary sometimes with text styles, though. So Α doesn't always have to look like A. Or, more to the point, while T and т might look alike, T and т often do not (the latter of which often looks like m). So even as humans we need to keep track of the script at times.

pfortuny · on Nov 27, 2013

The funny thing is that, according to "the rules" (the Real Academia de la Lengua Española), in Spanish we should be always using \u0130, but of course no one does...