Hacker News new | comments | show | ask | jobs | submit login

The problem with text (that Unicode solves only partially) is that text representation, being a representation of human thought, in inherently ambiguous and imprecise.

Some examples:

(1) A == A but A != Α. The last letter is not uppercase "a", but uppercase "α". Most of the time, the difference is important, but sometimes humans want to ignore it (imagine you can't find an entry in a database since it contains Α that looks just like A). Google gives different autocomplete suggestions for A and Α. Is this outcome expected? is it desired?

(2) The Turkish alphabet is mostly the same as the Latin alphabet, except for the letter "i", which exists in two variants: dotless ı and dotted i (as in Latin). For the sake of consistency, this distinction is kept in the upper case as well: dotless I (as in Latin) and dotted İ. We can see that not even the uppercase <==> lowercase transformation is defined for text independently of language.

These are just two examples of problems with text processing that arise even before all the problems with Unicode (combining characters, ligatures, double-width characters, ...) and without considering all the conventions and exceptions that exist in richer (mostly Asian) alphabets.




I think your first assertion can be strengthened even further. It isn't like this is unique to letters that look the same. That is, sometimes WORD != WORD. Consider a few common words. Time? As in Time of day? As in how long you have? An interesting combination of the two? Day? As in a marker on the calendar? Just the time when the sun is out? Then we get into names. Imagine the joy of having to find someone named "Brad" that isn't famous. From a city named Atlanta, but not the one in GA. (If you really want some fun, consider the joy that is abbreviations. Dr?)


Except these are all well outside the ambit of what programmers usually think of as text processing, so they won't try to solve them using the same tools.

More to the point, they sound hard, so people won't be so quick to claim they've solved them.

On the other hand, case-insensitive string matching sounds easy, even if it's actually somewhat difficult due to the language dependencies mentioned above, so people will claim to have a general solution that fails the first time it's faced with i up-casing to İ instead of I, or the fact the German 'ß' up-cases to 'SS' as opposed to any single character. (Unicode does contain 'ẞ', a single-character capital 'ß', which occurs in the real world but is vanishingly rare. As far as modern German speakers are concerned, the capital form of 'ß' is 'SS'.)

http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

http://opentype.info/blog/2013/11/18/capital-sharp-s-design-...

http://blogs.msdn.com/b/michkap/archive/2009/07/28/9850675.a...

http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a...


Right, I do not disagree. I just feel better treating them the same. That is, both are actually easy and reliable so long as you realize you have to make some gross simplifications. And most of the time your life will be much easier if you start with the gross simplifications and try to expand beyond them only when necessary. (This is also why I'm loathe to try programming in unicode...)


I think (2) is an issue with Unicode specifically. They should have specified Turkish alphabet to use ı and a diacritic to make the dotted one. That would have made (in this case) capitalization locale-independent.


While that's a problem with Unicode, it's a really big problem with Unicode. As the name alludes to, Unicode preserved as much as possible of existing regional encodings, which is why (among other reasons) there's a pre-composed version of basically every accented Latin letter.


isn't this solving the wrong side of the problem? how about not having to think about such things at all and just accepting that uppercase/lowercase conversion is never going to be language agnostic.

thats futureproof and powerful, rather than extra thinking and work...


Most likely case-changes need to be locale-aware, that is true. But still I think minimizing number of locale-specifics is a reasonable goal and in that light I dislike the common usage of turkish i as a example because it is such a obviously fixable (if legacy stuff wasn't concern) flaw in Unicode rather than fundamental issue.


You are right, everything should be as easy as possible. This is a good philosophy for design in general...


Homoglyphs vary sometimes with text styles, though. So Α doesn't always have to look like A. Or, more to the point, while T and т might look alike, T and т often do not (the latter of which often looks like m). So even as humans we need to keep track of the script at times.


The funny thing is that, according to "the rules" (the Real Academia de la Lengua Española), in Spanish we should be always using \u0130, but of course no one does...




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: