Hacker News new | past | comments | ask | show | jobs | submit login

> Actually that's the behavior of python 2, it works fine, until you send invalid characters then it blows up.

> In python 3 it always blows up when you mix bytes with text so you can catch the issue early on.

Sometimes you don't care about weird characters being print as weird things. In python 2 it works fine: you receive garbage, you pass garbage. In python 3 it shuts down your application with a backtrace.

Dealing with this was one of my first Python experiences and it was very frustrating, because I realized that simply using #!/usr/bin/python2 would solve my problem but people wanted python3 just because it was fancier. So we played a lot of whack-a-mole to make it not explode regardless of the input. And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.

Those issues are common when you're having python 2 code that uses unicode datatype and you have a task to migrate it to python 3.

You run your python 2 code on python 3 and it fails, most people at that point will place encode() or decode() in place where you have a failure. When the correct fix would be to place encode/decode at I/O boundary (writing to files (and in python 3 even that is not needed if you open files in text mode), network etc).

Ironically a python 2 code that doesn't use unicode is easier to port.

When you program in python 3 from the start it's very rare to need encode/decode strings. You only do that if you are working on I/O level.

> And the documentation was particularly horrible regarding that, not even the experienced pythoners knew how to deal with it properly.

Because it's not really python specific knowledge. It's really about understanding what the unicode is, what bytes are, and when to use each.

The general practice is to keep everything you do as text, and do the conversion only when doing I/O. You should think of unicode/text as as a representation of a text, as you think of a picture or sound. Similarly to image and audio text can be encoded as bytes. Once it is bytes it can be transmitted over network or written to a file etc. If you're reading the data, you need to decode it back to the text.

This is what Python 3 is doing:

- by default all string is of type str, which is unicode - bytes are meant for binary data - you can open files in text and binary mode, if you open in text the encoding is happening for you - socket communication - here if you need to convert string to bytes and back

Python 2 is a tire fire in this area:

- text is bytes - text also can be unicode (so two ways to represent the same thing) - binary data can also be text - I/O accepts text/bytes, no conversion happening - a lot (most? all?) stdlib is actually expecting string/bytes as input and output - cherry on top is that python2 also implicitly converts between unicode and string so you can do crazy thing like my_string.encode().encode() or my_string.decode()

So now you get a python 2 code, where someone wanted to be correct (it is actually quite hard to do it, mainly because of the implicit conversion) so the existing code will have plenty of encode() and decode() because some functions now expect str some unicode.

At different functions you might then have bytes or unicode as a string.

Now you take such code and try to move it to python 3, which no longer has implicit conversion and will throw an error when it expected text and got bytes and vice versa. The str now is unicode, unicode type no longer exists and bytes is now not the same thing as str. So your code now blows up.

Most people see an error so they add encode() or decode() often trying which one works (like what you were removing) when the proper fix would be actually removing encodes() and decodes() in other places of the code.

It's quite difficult task when your code base is big, so this is why Guido put a lot of effort with type annotations, mypy. One of its benefits supposed to help with these issues.

The worst part about Unicode in Python 2 isn't even that everything defaults to bytes. It's that the language will "helpfully" implicitly convert bytes/str, using the default encoding that literally makes no sense in practically any context - it's not the locale encoding. It's ASCII!

Native English speakers are usually the ones blissfully unaware of it, because it just happens to cover all their usual inputs. But as soon as you have so much as an umlaut, surprise! And there are plenty of ways to end up with a Unicode string floating around even in Python 2 - JSON, for example. And then it ends up in some place like a+b, and you get an implicit conversion.

I've been struggling with this recently when trying to print stdout from subprocess.communicate with code that runs on both 2 and 3. Such a headache - got any recommended reading around this area?

I don't think this is exactly what you're asking but a good starting point:


With 2 vs 3 code is easiest to write your code for python 3 and then in 2 import everything you have in __future__ package including unicode literals. That's still not enough and you still might need to do extra work. In python 3 there's argument encoding, which could do the encoding which doesn't look like it is available in python 2. So you probably shouldn't be use it and treat all input/output as bytes (i.e. call encode() when sending data to stdin, and decode() on what you get back from stdout and stderr).

Perhaps that might be enough for your case, although many things is hard to get right in python 2 even when you know what you should do, because of the implicit conversion.

Edit: this also might be useful: https://unicodebook.readthedocs.io/good_practices.html

Also this could help: https://unicodebook.readthedocs.io/programming_languages.htm...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact