Hacker News new | comments | show | ask | jobs | submit login

Nope, big mistake. You cannot just search for "Café" in "Café", as there are multiple representations for the same character. Think of Cyrillic vs Greek, Han vs Hangul. The popular garbage-in, garbage-out doesn't apply to unicode identifiers. That's why the Mac filesystem normalization was correct, and the unix filesystems are all broken. If it's not identifiable it's no identifier. That's why java and cperl are the only languages with proper and safe unicode support. Python 3 is a bit better than Python 2, Perl6 better than Perl5 and the rest, but still don't apply unicode security guidelines for identifiers. It's much better for browsers and email clients.

And with case-insensitivity it gets worse, as there are some locale dependent additional rules, for Turkish and Lithuania. And this depends on "some" global settings.

http://perl11.org/blog/foldcase.html




Japanese is also great. While there are general rules when you use hiragana vs. katakana, sometimes for stylistic reasons the "wrong" one is used, or some kanji is "spelled out" as hiragana, and depending on context, you might or might not want to include these in your search results.

Tbh I just stopped caring and now happily live in my little world where utf8 is the solution to everything, doesn't yield any problems ever, and anyone telling me differently gets completely ignored.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: