

Porting to Python 3 Redux - dous
http://lucumr.pocoo.org/2013/5/21/porting-to-python-3-redux/

======
aidos
It's great to see Armin's continued work on this. Things have obviously
progressed since he wrote his post on the subject [0] (over a year ago). He's
written so much of the code I rely on daily that I was concerned that he'd
lose interest in Python during the 2-3 transition (I know it's not completely
on him to support it but he is a key contributor to Python's use on the web).

[0] <http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/>

------
kmike84
Great writeup (and a cool metaclass workaround!)

However, I think that "Drop 2.5, 3.1 and 3.2" advice is bad - dropping 2.5 and
3.1 is the way to go (hey, drop 2.5 even if you're not porting to 3.x), but
dropping 3.2 is not necessary in most cases.

In my experience (porting and maintaining 20 open-source packages that work
with Python 2 and Python 3 using a single codebase, including NLTK) Python 3.2
has never been a problem - I don't see how NLTK code and code of my other
packages could be improved by dropping Python 3.2 compatibility.

The main argument for dropping Python 3.2 support seems to be that u'strings'
are not supported in Python 3.2. There are 3 "types" of strings in Python:

* b"bytes",

* "native strings" # bytes in 2.x and unicode in 3.x

* u"unicode"

By adding `from __future__ import unicode_literals`` line to a top of the
file, code compatible with 2.6-3.2 could be written like this:

* b"bytes"

* str("native string") # bytes in 2.x and unicode in 3.x

* "unicode"

In my opinion this is not a hack (unlike six.b and six.u necessary for 2.5
support), and this is arguably closer to Python 3.x semantics (unicode strings
are default). So IMHO while using u"unicode" feature from Python 3.3 makes
porting somewhat easier (less stupid search-replace), it also could make code
worse and more cluttered, and Python 3.2 - compatible syntax is just fine.

It is true that 3.3 brings other improvements (Armin mentioned binary codecs),
but it is quite rare that the library actually needs them (even libraries as
big as NLTK and Django are fine with 3.2 stdlib).

3.2 is a default 3.x Python in current Ubuntu LTS (EOL in 2017) and a default
3.x Python in the recently released Debian Wheezy; 3.2 will be around for a
long time, and not supporting it will hurt. So if you're doing Python 3.x
porting, please just fix those stupid u'strings' with unicode_literals future
import - your code will be more ideomatic and also 3.2 compatible.

There is also an advice for encoding __repr__ and __str__ results to utf8
under Python 2.x in the article; this is fine (other approaches are not
better), but it has some non-obvious consequences (like breaking REPL in some
setups) that developers should be aware of, see <http://kmike.ru/python-with-
strings-attached/>

For lower-level 2.x-3.x compatible C/C++ extensions Cython is great. In fact,
many libraries (e.g. lxml) are compatible with Python 3.x because they are
written in Cython which generates compatible code (modulo library changes) by
default.

~~~
the_mitsuhiko
> Python 3.2 has never been a problem

It's not a problem if you are willing to litter your code with calls or
upgrade a ton of code in 2.x to unicode accidentally. There are just too many
cases in 2.x where that is a terrible idea and introduces subtle bugs. I very
strongly recommend against `from __future__ import unicode_literals`. If
anything go with six.

In regards to supporting 3.2: I don't think anyone cares. The number of people
currently using Python 3 is pretty low and a lot of libraries are already
dropping 3.2 support. Requests, MarkupSafe, Jinja2 now all dropped 3.2 support
and with that a lot of stuff that pulls in dependencies to those will now also
depend on 3.3.

I still think people should stick to 2.7 for at least another one, two years
and at that point a lot will have changed.

//EDIT: wrt __str__ returning utf-8 data: __str__'s encoding is undefined but
usually accepted to be > ASCII. Django and Jinja2 for instance returned utf-8
there for years and it did not cause any problems.

~~~
kmike84
In case of NLTK unicode_literals ("unicode by default") fixed a lot of bugs
and made other bugs visible, so mileage may vary :)

Could you give an example of cases where unicode_literals is a terrible idea?

3.2 is important for newcomer experience IMHO; it is very common for people
starting with Python to use 3.x version and wonder why the code doesn't work.
It's a pity high-profile packages are dropping 3.2 support, I wasn't aware
Requests and Jinja2 dropped it.

utf8 __str__ definitely caused issues for Django (e.g. `print mymodel`
sometimes fails in REPL in Windows with Russian locale); people using REPL in
Windows are too used to such errors so they don't complain and blame Windows
for this, but that doesn't mean there is no issue.

~~~
the_mitsuhiko
So will latin1 `__str__` on Russian locales. So will ASCII `__str__` on any
locale that is not ASCII compatible. You can't expect the impossible.

In regards to cases where unicode_literals is a terrible idea is any piece of
code that then suddenly gets a unicode string which does not expect it.
Because unicode coercion in 2.x spreads like a cancer you might not see the
failure until someone uses your API. I still have to fix bugs where people
accidentally send things coerced to unicode to an API that does not support
it.

Additionally: newcomers still should not be using Python 3. There are just too
many remaining issues that are annoying to deal with.

~~~
kmike84
Are there non-ascii compatible encodings that are default in any OSes? With
ascii-incompatible system/terminal encoding a lot of software will stop
working. Strange things happen, but this looks like a theoretical issue, and
ascii looks safe. In Python 2.x __str__ of all standard container types are
ASCII-only (even if elements has non-ascii __str__), and __repr__ of standard
objects is also ASCII-only as far as I can tell. ASCII-only is an option, and
it is not uncommon and relatively safe (but it has its own issues of course).

It was exactly this unicode_literals property (turning everything into
unicode) that helped to reveal bugs :) For example, models were trained on
bytestrings under Python 2.x, and nobody remembers what was the encoding of
the text models were trained on. This was unnoticed for several years because
instead of raising an exception functions just handled some egde cases (e.g.
unicode punctuation) in a suboptimal way. This leads to almost correct
results, but with less accuracy/precision/recall. After changing to "unicode
everywhere" the issue became visible.

The issue was not with cancer-like turning text into unicode, issue was with
the code that works with text and doesn't support unicode. Python 2.x standard
library has such APIs, and this causes troubles, but I don't see how it is a
bug in the code that works with text and returns unicode.

What I'm writing are common words and a standard "unicode mantra", but
anyways.

We could say "programmers should just handle encodings properly, and
unicode_literals have nothing to do with this", but this doesn't always work.
"Unicode everywhere" makes some code changes necessary, but some of these
changes reveal real bugs.

Another story: I took 2 different courses from 2 different top-notch
universities at coursera.org where instructors gave us starter code (written
in Python 2.x) for programming assignments. The code was not bad, but there
were many cases of incorrect encodings handling in most of the provided files
(such errors that would be impossible in Python 3.x) - this was the code that
was supposed to teach students something (including Python programming).

What I like about unicode_literals is that it makes things more consistent and
easier (at least for me) to reason about: if variable is unicode under 3.x, it
is unicode in 2.x, the same applies to bytestrings. In cases where different
behaviour is necessary (e.g. because of non-unicode API in 2.x stdlib),
explicit str("foo") is used; otherwise code is written in Python 3.x and works
with the same semantics under Python 2.x.

Just curious, what newcomer issues are you talking about, and who do you mean
by "newcomers"?

------
wallunit
> If you have a C module written on top of the Python C API: shoot yourself.
> There is no tooling available for that yet from what I know and so much
> stuff changed.

I don't agree with that one. I have added Python 3 support two years ago to
the Python bindings for libssh2 and it was straight forward. First of all it
is still C and therefore you don't have to care about the syntax changed in
Python. Just add some #if PY_MAJOR_VERSION < 3 for API calls that have
changed, or even better wrap that code in macros. Probably you already have
some backwards compatibility switches/macros like that already anyway in your
code, if you already support multiple versions of 2.x. So adding some more for
Python 3, isn't that a big deal.

At least at that time, when six and modernizer wasn't available it was way
easier and straight forward to support Python 2 and 3 with the same codebase
with extension modules, than with actual Python code. And it seems if you
don't want to drop support for Python <= 2.5 or <= 3.2, it still is.

~~~
the_mitsuhiko
That again depends on how much you do with strings and integers and how many
modules you construct. PyInt is gone, PyUnicode is now PyStr, module
construction uses a vastly different system and on 3.x you want to support the
stable ABI which looks a bit different.

~~~
wallunit
Except for the module creation you can easily add some very simple
compatibility macros. I don't see how that would be different from your
_compat module. However module creation can't be abstracted into a uniform
macro in fact, because of it requires to define a PyModuleDef struct and the
the modlue's init function got a return value in Python 3. But I'm fine with
using some #if PY_MAJOR_VERSION >= 3 here.

After all you have to deal with way less compatibility issues, in extension
modules than in actual python code. And if needed you can always do a simple
version switch. You don't have to care about changes in the syntax of Python.
You also don't have to care about changes of the __*__ magic method, because
of you don't call them directly, and when defining classes you use slots for
stuff like that.

~~~
the_mitsuhiko
Fair enough. As I said you can probably get around with some macros. To the
best of my knowledge no such thing currently exists and what markupsafe does
is not particularly nice.

------
mynegation
I also found that maintaining single code base for both 2 and 3 is the only
sane way. Running 2to3 during build is just too intrusive.

I liked the approach of 'six', but it is not shipped as a system module.
Having something like that as a default system module in Python 2.6, 2.7, and
3.x would go a long way towards adoption of Python 3.

I found that I end up either using six or implementing some subset of it if I
do not want to introduce the dependency.

------
bdarnell
This is a good writeup. On Tornado I went through a similar transition from
2to3 to a single codebase. As long as you can drop Python 2.5 support you can
probably avoid 2to3, but if you do need it I wrote some tools to make it less
painful: [http://bdarnell.github.io/blog/2012/03/13/cross-python-
devel...](http://bdarnell.github.io/blog/2012/03/13/cross-python-development-
with-auto2to3/)

------
hoodoof
So Armin what did not come across from your blog post is how you _feel_ about
Python 3.

Until now you have been seen as one of the people holding out strongly against
it. Where are you at now? Are you going to move to Python 3?

What's the future for Jinja2 now? Are you enthused about maintaining it? If
it's headed for the deadpool please let us know as we can move to technologies
that are going to have a future.

What the future for Flask?

What are your current thoughts on Python 3?

