Hacker News new | past | comments | ask | show | jobs | submit login

What kind of text do you have to process at your job, that you never meet any unicode in it? Nowadays unicode is everywhere, especially with emojis. Even a simple IRC bot needs to handle that.

A lot of scientific/numeric work (up until quite recently, it's slowly, slowly changing) involves text processing of inputs and outputs of other programs, using Python as the glue language.

This is a lot of old code, and it's all ASCII, no matter what the locale of the system is. And even if the code was updated, all the messages would still be in some text == bytes encoding, because there's no "user data" involved, and the throughput desired is in many gigabytes of text processed per second.

So yeah, unicode is not "everywhere": it may be everywhere on the public internet, but there is a world beyond this.

I deal with file formats that like plain text files and zip do not specify an encoding and have different encodings depending on where they come from. I think the generic approach is to guess, which means trying encodings until one successfully converts unknown input to garbage unicode resulting in output that is both wrong and different from the original input. Most of the time I can just treat the text contents as byte array, with a few exceptions that are specified to use ascii compatible names.

So you can throw in your emoji and they might not correctly show up on the xml logging metadata I write, because I don't care. But they will end up in the processed file the same way they came in instead of <?> or some random Chinese or Japanese symbol that the guessing algorithm thought appropriate.

In that case you should be opening files in binary mode "b", then you will be operating in bytes.

Also, there's no guessing happening in this instance. A locale configured in your environment variable are used if you open files using text mode.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact