Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, and that's the reason why it's important to choose a tool that gets Unicode correctly. Perl is an example of such a tool.


Even if the language handles it correctly the lowly programmer is playing with white hot fire. Unicode adds so many edge cases it's basically a serrated knife in the middle of your code.

Will your regex match correctly in the case that the text shifts direction halfway through? Will the case insensitive search work in languages where case is ambiguous (where uc(lc(x)) != x)? Will you be able to match letters that are effectively identical but on different code points? Do you want to? Will it get confused by furigana?

Unicode's goal of encoding every language in the world as is means it encompasses every bizarre thing people have ever done to a written language. It is almost impossible to actually do anything with a block of unicode codepoints except treat it like a big binary blob and hope that you never have to personally deal with the strings contained therein. Most everybody gets it wrong in one way or another.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: