Hacker News new | past | comments | ask | show | jobs | submit login

> Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

Not always. As far as I can tell writing garbage bytes to various APIs works fine unless they explicitly try to handle encoding issues. First time I noticed encoding issues in my code was when writing an xml structure failed on windows, all because of an umlaut in an error message I couldn't care less about. The solution was to simply kill any non ascii character in the string, not a nice or clean solution but the issue wasn't worth more effort.

> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

That is nice if your job involves dealing with unicode issues. My job doesn't, any time I have to deal with it despite that is time wasted.




So you don't have to deal with it until user data includes _any non-ascii character_ (including emoji, weird spaces copied from other stuff, or loan words like café)

"Dealing with unicode" is really just about dealing with it at the input/output boundaries (and even then libraries handle it most of the time). But without the clear delineation that Python 3 provides, when you _do_ hit some issue you probably insert a "fix" in the wrong space. Leading to the classic Py2 "I just call decode 1000 times on the same string because I've lost track"


> So you don't have to deal with it until user data includes _any non-ascii character_ (including emoji, weird spaces copied from other stuff, or loan words like café)

Interesting text follows company set naming schemes, which means all english and ascii. The rest could be random bytes for all I have to care about. Many formats like plain text or zip don't have a fixed encoding and I am not going to start guessing which one it is for every file i have to read, there is no way to do that correctly. Dealing with that mess is explicitly something I want to avoid.


What kind of text do you have to process at your job, that you never meet any unicode in it? Nowadays unicode is everywhere, especially with emojis. Even a simple IRC bot needs to handle that.


A lot of scientific/numeric work (up until quite recently, it's slowly, slowly changing) involves text processing of inputs and outputs of other programs, using Python as the glue language.

This is a lot of old code, and it's all ASCII, no matter what the locale of the system is. And even if the code was updated, all the messages would still be in some text == bytes encoding, because there's no "user data" involved, and the throughput desired is in many gigabytes of text processed per second.

So yeah, unicode is not "everywhere": it may be everywhere on the public internet, but there is a world beyond this.


I deal with file formats that like plain text files and zip do not specify an encoding and have different encodings depending on where they come from. I think the generic approach is to guess, which means trying encodings until one successfully converts unknown input to garbage unicode resulting in output that is both wrong and different from the original input. Most of the time I can just treat the text contents as byte array, with a few exceptions that are specified to use ascii compatible names.

So you can throw in your emoji and they might not correctly show up on the xml logging metadata I write, because I don't care. But they will end up in the processed file the same way they came in instead of <?> or some random Chinese or Japanese symbol that the guessing algorithm thought appropriate.


In that case you should be opening files in binary mode "b", then you will be operating in bytes.

Also, there's no guessing happening in this instance. A locale configured in your environment variable are used if you open files using text mode.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: