

Differences between Python 2.7.x and Python 3.x, with examples - rasbt
http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html

======
ak217
"Now, in Python 3, we finally have Unicode (utf-8) strings, and 2 byte
classes: byte and bytearrays."

No, they are Unicode strings. utf-8 is an encoding, and only comes into play
when you want to encode strings into bytes for sending them somewhere, or
decode them from bytes when receiving them. The interpreter's internal
representation of the string is either UCS-4 or an automatically selected
encoding
([http://legacy.python.org/dev/peps/pep-0393/](http://legacy.python.org/dev/peps/pep-0393/)),
but that is an irrelevant implementation detail. Conceptually, the strings are
sequences of Unicode characters, and it helps to think of them that way.

Here are the really important facts about Unicode handling differences between
Python 2 and 3 (aside from the obvious str/unicode -> bytes/str move):

\- There is no silent Unicode coercion in Python 3. Unlike Python 2, your
bytes objects won't be decoded for you to str just because you happened to
concatenate them with a Unicode string. Your Unicode strings won't be encoded
silently if you write them to a byte stream (instead, Python 3 will fail with
the cryptic error "TypeError: 'str' does not support the buffer interface").

\- The default encoding in Python 3 is utf-8, instead of the insane ascii
default in Python 2.

\- All text I/O methods by default return decoded strings, except if you open
a stream in binary mode (open(filename, "b")), which now actually means what
you'd expect. See the documentation for the io module
([https://docs.python.org/3.4/library/io.html](https://docs.python.org/3.4/library/io.html))
for more information. (You can use the io module in Python 2.7 to write code
that is more forward-compatible with Python 3.)

\- The above I/O semantics includes sys.argv, os.environ, and the standard
streams (sys.stdin/stdout/stderr). The fact that all of these behave
differently between Python 2 and 3 with respect to text encoding makes for a
lot of fun hair pulling when trying to write code compatible with both.

I have built a small library of helper functions to help deal with this stuff
in a sane way:
[https://github.com/kislyuk/eight](https://github.com/kislyuk/eight). Another
project that I can recommend that tries to lessen the pain of writing code
that's compatible with both 2 and 3 is python-future:
[https://github.com/PythonCharmers/python-
future](https://github.com/PythonCharmers/python-future).

~~~
rdtsc
> No, they are Unicode strings. utf-8 is an encoding, and only comes into play
> when you want to encode strings into bytes for sending them somewhere,

Isn't sending data "somewhere" pretty common. Unless this is middleware in
python ecosystem, data is going to go a logger, database, console, web page,
file. Am I misunderstanding it? It seems you are dismissing it as something
one doesn't need to worry about, because it is not done much...

> The above I/O semantics includes sys.argv, os.environ, and the standard
> streams (sys.stdin/stdout/stderr).

Completely ignorant about it, but how would Python 3 know when reading from
stdin what the encoding is? Or what about when reading sys.argv?

~~~
ak217
> Isn't sending data "somewhere" pretty common. Unless this is middleware in
> python ecosystem, data is going to go a logger, database, console, web page,
> file. Am I misunderstanding it? It seems you are dismissing it as something
> one doesn't need to worry about, because it is not done much...

The point is not to dismiss the encoding. The point is that the OP is
confusing the character representation (Unicode) and the I/O encoding (utf-8).
Yes, sending data somewhere is common, and when you do that, you are taking
your Unicode string and encoding it using whatever encoding is appropriate
(usually utf-8).

As for whether you should worry about what the encoding is, modern systems
(including Python 3, but not Python 2!) use utf-8 everywhere by default, and
save you the headache of specifying the encoding or passing it around. One
important exception is a Linux process which hasn't had the locale variables
set (normally this is done by the PAM environment module, but in a number of
situations all environment variables might be stripped from the process,
leaving it with what's known as a "POSIX C" locale, which is kind of a broken
anachronism). Generally that leaves the system (not just Python) open to all
kinds of brokenness, so keep the locale set by not stripping LANG and LC_ALL
from your environment.

> Completely ignorant about it, but how would Python 3 know when reading from
> stdin what the encoding is? Or what about when reading sys.argv?

Great question! Python guesses the system encoding using logic that looks at
the locale environment variables (most commonly LANG and LC_ALL) defined by
POSIX (or on Windows, detects the console or uses ANSI). For more information,
take a look at
[https://docs.python.org/3.4/library/locale.html#locale.getde...](https://docs.python.org/3.4/library/locale.html#locale.getdefaultlocale)
and
[https://docs.python.org/3.4/library/sys.html#sys.stdin](https://docs.python.org/3.4/library/sys.html#sys.stdin).
It's also possible to override the I/O encoding with the PYTHONIOENCODING
environment variable.

The output of locale.getpreferredencoding() is used for environment variables,
command line arguments, and I/O streams. More generally, you can access and
change the encoding of any character stream using its .encoding attribute
(e.g. sys.stdin.encoding).

~~~
xorcist
> Python guesses the system encoding using logic that looks at the locale
> environment variables

This is not an implementation detail, but means that when Python chooses a
different encoding than you thought, your program crashes spectacularly (and
not on start, but on some unicode operation further down the line).

The same program will run differently depending on who starts it. This should
go on top on all Python 3 tutorials, since it's not obvious at all what
happened, especially not for a beginner. Non-conformant environment variables
and file names causes all sorts of weird problems.

I can recommend Armin Ronacher's unicode tutorials for Python 3. It is what
saved my sanity when I first encountered it.

~~~
jessaustin
_...your program crashes spectacularly (and not on start, but on some unicode
operation further down the line)._

ISTM that it could be a good idea to try all potentially-dangerous unicode-
related operations immediately on startup. That might be a good idea for a
package or even an addition to the stdlib.

------
pdknsk
There are other, more subtle differences. In Python 3 this doesn't work.

    
    
      >>> filter(lambda (x, y): x > y, ((1, 2), (4, 3)))
    

And 2.7 returns the filter result in the input type.

    
    
      >>> filter(lambda x: x in 'ABC', 'ABCDEFA')
      'ABCA'
    

In 3 it's an extra step.

    
    
      >>> ''.join(filter(lambda x: x in 'ABC', 'ABCDEFA'))
      'ABCA'

------
tzury
The `xrange` example suggests that in Python 3.x, `range `(which is equivalent
to xrange in python 2.x is slower than range in python 2.x.

This is a double patently IMHO.

[http://sebastianraschka.com/Articles/2014_python_2_3_key_dif...](http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html#xrange)

------
rdtsc
Vis-a-vis the unicode issue this is another post (by Armin Ronacher/mitsuhiko,
creator of Flask web framework)

[http://lucumr.pocoo.org/2014/5/12/everything-about-
unicode/](http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/)

------
wting
> However, it is also possible - in contrast to generators - to iterate over
> those multiple times if needed, it is aonly not so efficient.

You can reuse a generator multiple times via itertools.tee():

[https://docs.python.org/2/library/itertools.html#itertools.t...](https://docs.python.org/2/library/itertools.html#itertools.tee)

------
thomk
Typo in this sentence: However, it is also possible - in contrast to
generators - to iterate over those multiple times if needed, it is aonly not
so efficient.

------
gomesnayagam
thought many improvement happened, developer still prefer to use 2.x and the
eco system take long time to adopt 3.x. "who tie the bell to the cat"

