Hacker News new | past | comments | ask | show | jobs | submit login

Tangentially related (rule 9); my girlfriend's surname contains an 'é'. I have yet to see a year go by without receiving mail having 'é' on the address label where the é should be.


    echo "é" | iconv -t utf8 -f iso8859-15
We're Dutch, and the é is part of our language, and even part of the legacy character encoding standard everyone used before Unicode's widespread adoption. This is just a matter of code that works perfect as long as all characters are part of the ASCII set, but fails on the characters that don't conveniently match between UTF-8 and ISO-8859-15.

I doubt these issues will go away within even, say, twenty years.

It's getting much better. Almost everything new is in UTF-8.

I've been writing code to clean up a 2013 database dump. The database stored everything in LATIN-1 fields. Not because the data is in LATIN-1, but because LATIN-1 will accept any byte value. This makes error messages during input go away. See this bad advice on Stack Overflow.[1]

Some of the data is ASCII. Some is UTF-8. Some is Windows-1252. Some data is none of those, but is mostly ASCII except that there's a 0x9d once in a while. (Still haven't figured out what character set that is. From context, the ™ or ® symbol is intended.) So I have recognizers for these cases, and convert everything to UTF-8, testing every field value individually.

One column has garbaged non-English names. Someone had tried to "normalize" UTF-8 to lower case by using an ASCII lowercasing function on UTF-8 stored in a LATIN-1 field:

    KACMAZLAR MEKANİK  -> kacmazlar mekanä°k
    Anita Calçados -> anita calã§ados
    Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta
I have the un-"normalized" form and can fix this.

[1] https://stackoverflow.com/questions/44251813/unicodedecodeer...

Even if something new is UTF-8, you'd basically have to guarantee that it never interfaces with anything old and/or broken to ensure the data survives intact. I've recently received a package where the ö in my name was written as ̦ which I've never seen before. Things go wrong in the most unexpected places and the sad thing is that even if you use UTF-8 for text storage, you still have to know what you can do and how to do it to not mangle it.

There are a lot of Unicode-hostile environments out there. Java is old enough to always require explicit encoding declaration for pretty much any tool ... compiler, documentation generator, etc. Forget it at any one point and you get garbage. Reading or writing text files should always make the encoding explicit, but rarely does so. C#'s string methods all support, but don't require, a Culture parameter, without which you're practically guaranteed to do things like case conversion, or substring searches wrong in the general case. There was an awesome and long answer by tchrist on SO once about what the Perl boilerplate is to properly support Unicode for many or most circumstances (it's complicated and long and I doubt many people are going those lengths).

Point being, even when using something that supports Unicode well, the programmer still has to care, simply because text and language are messy things and it simply isn't possible to have a magic bullet that does everything right.

Dutch programmers probably have little incentive to get it right, because Dutch makes relatively light use of accents. I wouldn't be surprised if Czech programmers, for example, were much more meticulous at converting between UTF-8 and legacy encodings.

Here in CJK territory, using the wrong encoding makes the output so obviously broken [1] that mistakes are almost always caught before hitting production.

[1] https://en.wikipedia.org/wiki/Mojibake

I'm French and I see this all the time, even Outlook, given the right conditions, will gives you this in the default folders, Inbox in French is "Boite de réception".

Don't get me started on Outlook folders. The actual names are of the folders are localised, not just the way it's presented to the user. The folders are created when you first start Outlook (and not when your account is created).

If you happen to be using Windows configured in a foreign language the first time you start Outlook, your inbox, sent mail, etc, folders are named according to that language, and will never change, and you'll have to live with non-standard names for the folders.

At least its teaches you how to configure folders manually in most email clients.

You can change the folder names after the fact. It is annoying, though.


That's why I take care that my OS doesn't know I'm French. But it's a luxury that most French people can't afford and they have to live with bugs. Ex: Having both a "Download" and a "Téléchargement" folder, with "Download" being sometimes translated.

This is something Apple got right. The underlying folder has a standard name, and the translated name is just a different presentation. If you change language, the names of the translated folders change.

For anyone who doesn't know OSX, this translation happens on the UI level. Typing ls in a terminal gives you the real directory name.

> I doubt these issues will go away within even, say, twenty years.

Much like the printing press, I'm 100% certain that the computing (and the internet specifically) is altering human written language across the world.

It is just so much easier to avoid anything outside ASCII because you can be certain ASCII will always work - even though some awful MS Access -> CSV -> SQL -> SQL -> Excel -> SQL -> COBOL ETL pipeline. No matter what version of any software is being used.

Technology has always shaped written language and we should fight to do better but at this point it seems inevitable.

(To be clear: I'm not saying this is a good or desirable state of affairs)

I really doubt that there is any transliteration into ASCI that's acceptable to European speakers.

Eg A Ą Å Æ Ä are all different letters in most languages, not simply a pronunciation guide. I think most European languages use at least two from that list.

I wonder what happens if you book an airline ticket as "Miche" instead of "Miché". Will custom/CIA choke on the unmatching name?

I feel your pain. My wife has a French first name with an acute accent over the "e" as well. Many forms will outright reject the accent, so I've taken to just not typing it most of the time. Being American, it doesn't really bother her, though. I know in some languages missing accents and other similar markings changes the pronunciation and meaning of words and can be very irritating to some.

> accents and other similar markings changes the pronunciation and meaning of words

Yes. In some languages those are actually not "markings" but denote proper letters, like in German ä,ö,ü and ß. But even if not, like in French, it can alter the meaning of words. E.g la != là. Therefore, for most Europeans and speakers of other languages that depend on more letters than ASCII provides, it is very annoying when that is not supported properly.

However, I have made the experience in a few cases that particularly Americans have a hard time understanding this. The remark about your wife not caring seems to be in this vein, too. Recently, I decided to convert our MySQL DB tables from latin1 to UTF8. (I wasn't even aware that we didn't have some form unicode, as our DB is only few years old, and I thought some unicode is the default nowadays everywhere. But then MySQL...)

Anyway, my CEO (also an American incidentally) was trying to keep me from it because he thought it's not high priority. However, we're about to go live in a French-speaking region, but which also has other indigenous languages (and therefore names), with their own "special" characters (I put "special" in quotes because for those languages, they're not "special" at all -- but I guess you get my gist by now).

Also, in previous jobs I have converted legacy systems to unicode and know what a pain it is down the road. Not to mention all the hard-to-find bugs if you don't do it, because some strings don't compare as they should, or people are just annoyed because their name is not shown correctly.

So I went ahead with the conversion anyway. We may never know for sure, but I'm convinced that I saved us some major customer frustrations, days of bug hunting and weeks of converting everything later, when existing data would need to be migrated.

So please everyone, just use UTF8 or some other unicode variant from the get-go. The few bits you might save otherwise are just not worth it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact