"Now, in Python 3, we finally have Unicode (utf-8) strings, and 2 byte classes: byte and bytearrays."
No, they are Unicode strings. utf-8 is an encoding, and only comes into play when you want to encode strings into bytes for sending them somewhere, or decode them from bytes when receiving them. The interpreter's internal representation of the string is either UCS-4 or an automatically selected encoding (http://legacy.python.org/dev/peps/pep-0393/), but that is an irrelevant implementation detail. Conceptually, the strings are sequences of Unicode characters, and it helps to think of them that way.
Here are the really important facts about Unicode handling differences between Python 2 and 3 (aside from the obvious str/unicode -> bytes/str move):
- There is no silent Unicode coercion in Python 3. Unlike Python 2, your bytes objects won't be decoded for you to str just because you happened to concatenate them with a Unicode string. Your Unicode strings won't be encoded silently if you write them to a byte stream (instead, Python 3 will fail with the cryptic error "TypeError: 'str' does not support the buffer interface").
- The default encoding in Python 3 is utf-8, instead of the insane ascii default in Python 2.
- All text I/O methods by default return decoded strings, except if you open a stream in binary mode (open(filename, "b")), which now actually means what you'd expect. See the documentation for the io module (https://docs.python.org/3.4/library/io.html) for more information. (You can use the io module in Python 2.7 to write code that is more forward-compatible with Python 3.)
- The above I/O semantics includes sys.argv, os.environ, and the standard streams (sys.stdin/stdout/stderr). The fact that all of these behave differently between Python 2 and 3 with respect to text encoding makes for a lot of fun hair pulling when trying to write code compatible with both.
> Conceptually, the strings are sequences of Unicode characters, and it helps to think of them that way.
They're not, they're a sequence of Unicode code points in Python 3.3+, and either 16-bit or 32-bit Unicode code units in 3.0-3.2, a distinction that is important to make. (Hint: `re.compile("[\U00010000-\U0010FFFF]")` doesn't create a regexp that matches what you think it does on 16-bit builds!)
> No, they are Unicode strings. utf-8 is an encoding, and only comes into play when you want to encode strings into bytes for sending them somewhere,
Isn't sending data "somewhere" pretty common. Unless this is middleware in python ecosystem, data is going to go a logger, database, console, web page, file. Am I misunderstanding it? It seems you are dismissing it as something one doesn't need to worry about, because it is not done much...
> The above I/O semantics includes sys.argv, os.environ, and the standard streams (sys.stdin/stdout/stderr).
Completely ignorant about it, but how would Python 3 know when reading from stdin what the encoding is? Or what about when reading sys.argv?
> Isn't sending data "somewhere" pretty common. Unless this is middleware in python ecosystem, data is going to go a logger, database, console, web page, file. Am I misunderstanding it? It seems you are dismissing it as something one doesn't need to worry about, because it is not done much...
The point is not to dismiss the encoding. The point is that the OP is confusing the character representation (Unicode) and the I/O encoding (utf-8). Yes, sending data somewhere is common, and when you do that, you are taking your Unicode string and encoding it using whatever encoding is appropriate (usually utf-8).
As for whether you should worry about what the encoding is, modern systems (including Python 3, but not Python 2!) use utf-8 everywhere by default, and save you the headache of specifying the encoding or passing it around. One important exception is a Linux process which hasn't had the locale variables set (normally this is done by the PAM environment module, but in a number of situations all environment variables might be stripped from the process, leaving it with what's known as a "POSIX C" locale, which is kind of a broken anachronism). Generally that leaves the system (not just Python) open to all kinds of brokenness, so keep the locale set by not stripping LANG and LC_ALL from your environment.
> Completely ignorant about it, but how would Python 3 know when reading from stdin what the encoding is? Or what about when reading sys.argv?
The output of locale.getpreferredencoding() is used for environment variables, command line arguments, and I/O streams. More generally, you can access and change the encoding of any character stream using its .encoding attribute (e.g. sys.stdin.encoding).
> Python guesses the system encoding using logic that looks at the locale environment variables
This is not an implementation detail, but means that when Python chooses a different encoding than you thought, your program crashes spectacularly (and not on start, but on some unicode operation further down the line).
The same program will run differently depending on who starts it. This should go on top on all Python 3 tutorials, since it's not obvious at all what happened, especially not for a beginner. Non-conformant environment variables and file names causes all sorts of weird problems.
I can recommend Armin Ronacher's unicode tutorials for Python 3. It is what saved my sanity when I first encountered it.
...your program crashes spectacularly (and not on start, but on some unicode operation further down the line).
ISTM that it could be a good idea to try all potentially-dangerous unicode-related operations immediately on startup. That might be a good idea for a package or even an addition to the stdlib.
> Isn't sending data "somewhere" pretty common. Unless this is middleware in python ecosystem, data is going to go a logger, database, console, web page, file. Am I misunderstanding it? It seems you are dismissing it as something one doesn't need to worry about, because it is not done much...
In my opinion, the high frequency of sending data "somewhere else" is precisely the reason why unicode strings make so much sense.
Yes, there's a processing overhead. Yes, there is a need to make sure that all strings that go into python are valid unicode. And yes, if you happen to live in a part of the world where you never need characters outside of the ASCII table, it is likely a massive pain in the arse.
I have never enjoyed the privilege of the last one. The single, most important thing about unicode - at least for me - is that conversion from unicode to any other encoding is in practice a well specified operation.
(It may be luck that I've never had to deal with strings that would use multiple diaresis variants. I've been informed that mixing those symbols gets particularly ugly in terms of string transformations.)
> The default encoding in Python 3 is utf-8, instead of the insane ascii default in Python 2.
It is only a default encoding for Python source code. Library and builtin functions can still use another defaults. For example, builtin "open" uses "locale.getpreferredencoding".
Yes, thanks. I should have been clearer: the point is that Python 3 obeys the I/O encoding specified by the system locale (which on modern systems is utf-8), whereas Python 2 disobeys it and uses ascii by default instead.
Typo in this sentence: However, it is also possible - in contrast to generators - to iterate over those multiple times if needed, it is aonly not so efficient.
No, they are Unicode strings. utf-8 is an encoding, and only comes into play when you want to encode strings into bytes for sending them somewhere, or decode them from bytes when receiving them. The interpreter's internal representation of the string is either UCS-4 or an automatically selected encoding (http://legacy.python.org/dev/peps/pep-0393/), but that is an irrelevant implementation detail. Conceptually, the strings are sequences of Unicode characters, and it helps to think of them that way.
Here are the really important facts about Unicode handling differences between Python 2 and 3 (aside from the obvious str/unicode -> bytes/str move):
- There is no silent Unicode coercion in Python 3. Unlike Python 2, your bytes objects won't be decoded for you to str just because you happened to concatenate them with a Unicode string. Your Unicode strings won't be encoded silently if you write them to a byte stream (instead, Python 3 will fail with the cryptic error "TypeError: 'str' does not support the buffer interface").
- The default encoding in Python 3 is utf-8, instead of the insane ascii default in Python 2.
- All text I/O methods by default return decoded strings, except if you open a stream in binary mode (open(filename, "b")), which now actually means what you'd expect. See the documentation for the io module (https://docs.python.org/3.4/library/io.html) for more information. (You can use the io module in Python 2.7 to write code that is more forward-compatible with Python 3.)
- The above I/O semantics includes sys.argv, os.environ, and the standard streams (sys.stdin/stdout/stderr). The fact that all of these behave differently between Python 2 and 3 with respect to text encoding makes for a lot of fun hair pulling when trying to write code compatible with both.
I have built a small library of helper functions to help deal with this stuff in a sane way: https://github.com/kislyuk/eight. Another project that I can recommend that tries to lessen the pain of writing code that's compatible with both 2 and 3 is python-future: https://github.com/PythonCharmers/python-future.