Hacker News new | past | comments | ask | show | jobs | submit login
Why Python 3 Exists (snarky.ca)
418 points by cocoflunchy on Dec 17, 2015 | hide | past | favorite | 266 comments



IMO one of the reasons for all the angst is that .encode() and .decode() are so ambiguous and unintuitive which makes them incredibly confusing to use. Which direction are you converting? From what to what? The whole unicode thing is hard enough to understand without Python's encoding and decoding functions adding to the mystery. I still have to refer to the documentation to make sure I'm encoding or decoding as expected.

I think there would have been much less of a problem if encode and decode were far more obvious, unambiguous and intuitive to use. Probably without there being two functions.

Still a problem of course today.


Hm, I never saw this as ambiguous at all, except for a few weird encodings that Python has as "convenience" method.

Here's how you remember it: "Unicode" is not an encoding. It never was, it never will be. Of course, the data must be encoded in memory somehow, but in Python 3, you cannot be sure what encoding that is because it's not really exposed to the user. From what I understand, there are different encodings that string objects will use, transparently, in order to save memory!

You always "encode" something into bytes, and "decode" bytes back into something. There should be exactly two functions, because the functions have different types: "encode" is str -> bytes, "decode" is bytes -> str. Explicit is better than implicit.

    output = input.decode(self.coding)
With Python 3, I instantly know that "input" is bytes and "output" is str.


> You always "encode" something into bytes, and "decode" bytes back into something. There should be exactly two functions, because the functions have different types: "encode" is str -> bytes, "decode" is bytes -> str. Explicit is better than implicit.

I'm not sure "encoding" tells us that the output is bytes. Surely information can be encoded in other ways. Names like .to_bytes() and .from_bytes() might have saved a lot of trips to the documentation.


I would use .to_str() instead of .from_bytes(), since the action is from the perspective of the object you're calling it on.


> Surely information can be encoded in other ways.

Information - yes. Strings... I'm not sure. What other representations are you thinking about?


urlencode(mapping) --> querystring


Indeed, the Python 3 Unicode string object is fascinatingly clever. Code worth reading:

https://github.com/python/cpython/blob/master/Objects/unicod...


Also, it's incompatible with UTF-8 strings stored in C, which means that when you cross the Python/C API boundary, you have to re-encode all strings. This is a large performance penalty right at the time when you can least afford performance penalties.

IMNSHO, most modern languages should be storing strings as UTF-8 and give up on random access by characters. You almost never need it; in the most frequent case where you do (using indexOf or equivalent to search for a substring, and then breaking on it), you can solve the problem by returning a type-safe iterator or index object that contains a byte offset under the hood, and then slicing on that. Go, Rust, and Swift have all gone this route.


> Also, it's incompatible with UTF-8 strings stored in C

That's not true as of 3.3. It will store the compact representation, which in most cases is ASCII, which is UTF-8 compatible. https://www.python.org/dev/peps/pep-0393/

So unless you go outside of UCS-1, you don't need to reencode anything.


This design doc from DyND (a possible NumPy alternative) has some useful references on this point: https://github.com/libdynd/libdynd/blob/master/docs/string-d...


Nope. This is just a normal short-string implementation, inlined with 64bit systems, with a tag bit. Very common on every better VM.

This has nothing to do with (inefficient) encodings, and their bad historic namings, probably derived from perl.

Of course any name like byte2utf8 or just to_byte or b2u8 would have been better than encode/decode.

And of course cache size matters nowadays much more than immediate substring access for utf8, so nobody should use ucs-2 or even ucs-4 internally. This is easily benchmarkable.


"Byte2utf8" is a pretty confusing name for a method, considering utf8 is a byte encoding of unicode... :)


However C also did not agree on UTF8. There are lots libraries which use UTF16 encoding, so you would need to convert for these if you stick to UTF8. And you maybe need some conversion anyway, because one system expects zero terminated strings while another expects explicit length fields. Unfortunately strings are really a mess.


> Indeed, the Python 3 Unicode string object is fascinatingly clever. Code worth reading:

Clever is relative. The Python 3 unicode string objects makes very little sense given real world situations. It wastes enormous amount of memory in many common scenarios for no reason at all other than to facilitate O(1) access to charpoints which is completely inappropriate for text processing anyways.

Worse than that: it can blow up by another 50% the moment someone encodings it to utf-8 as the encoded version is cached around.

The object is too clever for its own good.


You're right that O(1) access is pointless, and UTF-8 is fine. But there are space optimizations on the Python string object which make it so it doesn't waste that much space. Basically, it dynamically selects LATIN-1, UCS-2, or UTF-32 (or you could call them UCS-1, UCS-2, UCS-4, but there isn't really any such thing as UCS-1, and UCS-4 is an obsolete name for UTF-32).

    >>> import sys
    >>> def charsize(x):
    ...     return sys.getsizeof(x + x) - sys.getsizeof(x)
    ...
    >>> charsize(b'A')
    1
    >>> charsize('A')
    1
    >>> charsize('\xff')
    1
    >>> charsize('\uffff')
    2
    >>> charsize('\U0010FFFF')
    4


> But there are space optimizations on the Python string object which make it so it doesn't waste that much space

Except it does not help. When you render out a template it starts out as latin1, then the first time it hits a unicode char the whole thing re-encodes into ucs2 and then when it sees an emoji it re-encodes to ucs4. In practical terms you will see UCS2 and UCS4 all the time.


Except UTF-8 is never less compact than UTF-32 (it is one to four octets, and UTF-32 is always four octets), and is often more compact than UTF-16, with the sole exception being BMP CJK text that is not embedded in an ascii-based markup language.


Another alternative would have been to store strings in UTF-8 and generate an array of indices to the starting point of each entity only if needed for random access. Subscripting a string would use the array of indices.

Only a few standard member functions of string (find, rfind, index, rindex) return integer subscripts into a string. Those could return an opaque index object which has an index and a link to the string. Such objects could be used for string operations which need a string index. If an index object is forcibly converted to an integer, the array of indices has to be generated, but otherwise, it is unnecessary. You could even allow adding or subtracting integers from an index object, which would walk the UTF-8 string forwards or backwards as indicated.

This would make UTF-8 strings usable with subscript and slice notation without exploding them to wide characters.


Yeah, I'm not reading 15K LOC... What makes it clever? What is even the point of linking a 15K file?


> "Unicode" is not an encoding.

It's important to point out that "Unicode" in Windows is an encoding. It means specifically UTF-16 LE, many people like me learned the concept "Unicode" on Windows got confused to hell when learning Python.


Microsoft - I don't know whether to hate on them or pity them. They were an early adopter of Unicode, back when there was only one encoding that probably didn't even have a name yet (UCS-2). It's easy to see how they would unleash that confusing nomenclature on the world, but they should have made an attempt to clear it up when other encodings became available, and they didn't.


It is ambiguous. When you still don't get encodings. As surprising as this can be, I click faster on monads than encodings (maybe it doesn't tease my brain enough for me to retain it).


It is not true that unicode never was an encoding. It was originally a fixed-width 16-bit encoding[1]. This was arguably no longer true in the early '90s and definitely not true by 1996.

1: http://www.unicode.org/history/unicode88.pdf


ok, but what I read from a file is that bytes or "something".

When save some "things" I read from utf16 file and want to save them to Postgresql who's client encoding is set to utf8, what do I do?


    In [1]: len("नि")
    Out[1]: 2

So close, but it's not really a string or bytes, but a codepoint array. You can even iterate it:

    In [2]: for c in "नि":
       ...:     print("Character {}".format(c))
       ...:     
    Character न
    Character ि


I'm getting tired of this argument, we had it in the Rust community as well and I was tired of it there, too. I've seen the same argument over and over again, and every argument had the same hole that I see in your post: What, exactly, is a "character"? Please answer me that, and then you can write a proposal for how the API should work, exactly, and why this works with Indic and German and Korean and everything else.

Strings have to be arrays. It is inevitable. We are on Von Neumann machines with L1 cache lines that are something like 64 bytes wide, so making strings into something other than arrays is a complete non-starter. Because strings are arrays, we use array indexes to slice strings. The indexes are integers, because that just makes sense on a Von Neumann machine.

So your complaint is that you can slice a string in "bad" ways. What makes it bad? You're trying to prevent people from slicing up grapheme clusters, which is a noble goal. But in practice, we get the indexes from methods like str.find() or re.match(), and we treat them as more or less opaque, so we don't end up slicing grapheme clusters very often in practice. Grapheme cluster slicing requires a big Unicode table in memory anyway, so in the rare case that you need it, you can pay the high performance cost for using it. In the meantime, formats like JSON and XML are defined in terms of code points, so using code point arrays eliminates a class of bugs where you could accidentally make malformed XML or JSON, which would then get completely REJECTED by the receiving side, causing your Jabber client to quit or your web browser to show a bunch of mojibake.

And let me ask you this: what do I get when I write:

    x = "a"
    y = "\u0301"
Is the resulting len(x + y) == 1? But len(x) == 1 and len(y) == 1? Can you tell me what the correct behavior is? Might you end up with bugs in programs because len(x + y) != len(x) + len(y)? Or do you introduce extra code points into the string when you concatenate them?

Please, tell me what you actually think is correct behavior. It is far, far more useful than pointing out that something is "wrong" on purely semantic arguments.


> Because strings are arrays, we use array indexes to slice strings

But that's a choice and not the only one. Sure you're likely to implement your Unicode string as an array of some type. But that doesn't mean that the only sensible approach is to expose that array directly to the user. Or more specifically as the primary interface.

For my money, Swift has probably the most comprehensive Unicode string API I've seen [0]. What they do is, essentially, have an opaque String type but support various "views" as properties. The main view is called .characters and represents a collection of grapheme clusters. They also have properties that present Unicode scalars, utf8 encoding etc.

Their API is complex, no question. But then proper handling of Unicode is complex. But it does show that there are other options than simply exposing the Unicode scalars.

BTW to your point on len(x) + len(y). The answer is 2 if you define len() on Unicode scalars but 1 if you define it on characters. Why? because len("\u301") should be 0. It's not a character, it's a base modifier. It is, of course, true that getting this right is likely to be substantially more expensive than getting it wrong but that doesn't mean it can't be done.

[0] https://developer.apple.com/library/ios/documentation/Swift/...


Check out Perl 6's strings, they are implemented similarly to the Swift usage of string "views" but with a default that assumes you want NFC with the ability to chose alternate normalizations and raw access to codepoints if you want that instead of grapheme clusters.


> Swift has probably the most comprehensive Unicode string API I've seen

Perl 6 has long specified an interesting and rich approach to Unicode. Have you explored it?

(An initial implementation for an initial release of the Rakudo compiler was worked on over the last few years and declared done, modulo bugs, a few weeks ago. A recent blog post provides a friendly overview.[1].)

One notable difference is that Swift adopts an iteration-only view of strings for dealing with characters whereas Perl 6 does not. One consequence is that indexing operations in Perl 6 are O(1) (time) at the expense of potentially (and for the current implementation, typically) greater use of RAM.

> BTW to your point on len(x) + len(y). The answer is 2 if you define len() on Unicode scalars but 1 if you define it on characters. Why? because len("\u301") should be 0.

Are you 100% sure of this of your last sentence ("should be 0")? I get that it's intuitively reasonable that an isolated base modifier be length 0 but I thought the Unicode standard specified (or at least recommended) that it be 1, and I note that Perl 6 treats isolated base modifiers (or at least this particular one) as length 1:

> ("a").chars 1

> (0x301.chr).chars 1

> ("a" ~ 0x301.chr).chars 1

So both Swift and Perl 6 correctly report the length of "a" as 1 and an "a" concatenated with "\u301" as 1 but disagree on what to do with an isolated base modifier.

> It is, of course, true that getting this right is likely to be substantially more expensive than getting it wrong but that doesn't mean it can't be done.

Indeed.

[1] https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...


len("\u0301") == 0 is extremely surprising.


Not really. "\u0301".characters should be equivalent to "".characters, so len("\u0301".characters) == 0 == len("".characters) makes sense.


Of course. What you were expecting was 6 right?

You learnt ASCII. Then you learnt about escaping to get around limitations in ASCII. Then you learnt that Unicode got around non-Latin issues with ASCII.

Each step has added cognitive load. It seems surprising to me that the next step wouldn't.


The surprise comes with a cost, and the cost is in bugs. The point of Python 3 is to eliminate some bugs in Python 2, and forcing developers to redesign their code around a completely new string API--no matter if that new API is more semantically sound--will come with its own cost of introduced bugs and more painful conversions, which goes against the purpose of Python 3.

I'm not saying that the string API in Python is perfection, but it had a specific role to fill in the larger Python ecosystem, and it does that very well.

Anyway... I would say that len(x) == 0 should be the same as saying that x is empty. According to the Unicode standard, the number of grapheme clusters in "\u0301" is 1, anyway! And on most systems, the string will display as an acute accent in its own space, and you can place the cursor before or after it.

Not like the set of valid "cursor positions" in a string isn't a locale-dependent value, anyway. Grapheme clusters can be thought of to be locale-independent but they are NOT the same as cursor positions.


Not to sound too factious but the decision to redesign the string API was a big reason why Python 3 has been a little on the slow cooked side. In for a penny and all that.

If I recall correctly the modifier on it's own isn't correct. There is a base that should go before it to produce a valid character. I don't recall the base unfortunately and I don't have the spec to hand.


What do you mean by "correct"? The Unicode standard talks about well-formed UTF-8/16/32, but in general, any sequence of code points is permissible. By derivation, any sequence of code points can appear in JSON strings, and any sequence of code points excluding control characters can appear in XML. I personally have read the Unicode spec, and used it to implement some of the algorithms presented. This includes the grapheme cluster breaking algorithm... the case of combining characters at the beginning of a line is, in fact, covered by the spec, and cause a break both before and afterwards. In general, there are other cases where grapheme cluster breaks are not preserved by string concatenation, and you often want a locale-specific version anyway.


> What do you mean by "correct"?

I mean this from the standard: "Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent isolation by applying them to U+00A0 no-break space. "

You can have valid Unicode that doesn't do this. But it's not intended to represent the character as some kind of textual thing. It's intended so that you can build up Unicode strings sensibly. Same idea as allowing mis-matched surrogates.


The correct answer is that strings shouldn't have length methods. There is no such thing as the 'length' of a string because as you pointed out there is no clear definition of what a 'character' is.

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-s...


The correct answer is that strings shouldn't have length methods

There are plenty of uses for getting the length of a string where you will not be running into combining characters, clusters or any other things which would require special handling.

(remember that on the internet, all the data you receive is initially in string form, and very often we want to receive data representing fixed-length values such as phone numbers or credit card numbers, so "you're wrong for wanting to know the length" is both a non-starter and, well, just plain wrong)


>There are plenty of uses for getting the length of a string where you will not be running into combining characters, clusters or any other things which would require special handling.

How do you know that? Any user supplied text may contain those things.

>remember that on the internet, all the data you receive is initially in string form

It's not. It's bytes.

>very often we want to receive data representing fixed-length values such as phone numbers or credit card numbers

This is a good example of why strings shouldn't have a length property - people don't understand what length actually does in most languages. If you think that a len() function can verify that the contents of a text field is in the format of a CC number then you don't understand the len function. It can't do what you are asking. See the examples given in the GP (len("\u0301") for example).

The fact that the people who advocate for length functions want to use them for things that they are broken for is exactly why length functions shouldn't exist.


How do you know that? Any user supplied text may contain those things.

Which is why, in addition to things like length checks, we use other checks. But length is a quick and easy check for many types of values, and should be supported.

I've been down in the depths of Unicode behavior a time or two. I know what lurks down there, and I know from experience that your "strings should never allow these operations" stance is just as dumb as the more-common "I don't know what Unicode really is" stance. Can length checks and slicing and other operations run into trouble in some corners of Unicode? Sure. But if "this might cause trouble" were grounds for forbidding programmers ever to do something, we wouldn't have computers, any useful software at all, the internet or most of the other things we like and take for granted despite the fact that we occasionally have to fix bugs in it.

Now, get down off your snobby horse and join the rest of the real world, where we know that something might not be guaranteed reliable in all logically-possible cases, but find ways to work with it anyway because in 99.99% (or more) of the cases our code deals with it's good enough.


To add to that: O(1) indexing into strings outside of the ASCII range is completely pointless and makes absolutely no sense. A language should disallow that and not encourage it.


I disagree with it being pointless. Being able to get indices into a string for fast slicing is useful. Consider Regex where references get stored to slices in the array.

What you can drop is the idea of the index being an integer with any particular semantic meaning. Rust uses byte indexes with an assertion that the index lies on a codepoint boundary, for instance.


That isn't an index into a string. It is an opaque indicator of position in a string. That opaque indicator of position happens to be implemented as an index into the underlying array, but it is not itself an index.


That sounds like an index to me, and the operation surely indexing.

I mean, if one has a hashmap

    hash = {"foo": 1, "bar": 2}
we say hash["foo"] is an indexing operation, and "foo" is an index. Seems the same to me.


> Being able to get indices into a string for fast slicing is useful.

Outside of the ASCII range? I doubt it. That most likely means you are doing something fundamentally wrong.


Outside of the anglocentric world we also think in terms of characters and, yet, we consider ASCII as a rather primitive technology.


I think something like a lexer doesn't need to know about graphemes, it just needs to know which codepoints are XID_start and XID_continue for identifiers and hardcoded codepoints for everything else.

EDIT: Or a JSON parser, etc.


A lexer scans LTR. It never jumps codepoints ahead. Indexing in a lexer never makes sense.


Obviously, it does; instead of allocating a new string for every token, you can just save the start & end index of a token, and actually allocate a new string only in the case where the parser actually requests/uses that token in the AST.


But the type signature of that token should make it clear those are byte offsets and not character offsets. You can use them for byte-wise operations like memcpy, or to explicitly decode them into a string using the original encoding of the string. You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.

It seems like this whole mess comes from conflating bytes and characters. They are not the same, any more than integers and booleans are the same just because you can jnz in most assembly languages.


> You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.

Why on earth not? Any slice of a string (with a non-pathological internal encoding) with ends on codepoint boundaries is itself a valid string. Losing that information sounds like a whole lot of pointless hassle and potential for harm. It also forces the internal string encoding to be part of the interface, since taking a slice of a string and getting a byte array requires knowing how the string is encoded to use it, but getting another string means that such details are private.

It's true that they're byte offsets, but that doesn't matter. The parser doesn't care whether it's on the 10543th grapheme or the 10597th. It just wants to know where its stuff is.


Swift has very sane way of doing this: Strings are sequences of Characters. Characters represent extended grapheme clusters. Iteration consumes as many bytes from the string as are necessary to read the next grapheme. In most cases, this is straight-up UTF-8 decoding, which can be implemented very efficiently; you only need to worry about extended grapheme clusters when you get to the non-ASCII subset of Unicode, so the code paths that require unicode table lookups are infrequently exercised.

String searching & matching return an opaque String.Index type. I assume that under the hood they do Boyer-Moore on bytes and String.Index is a byte offset, but the important thing is that String.Index values are not convertible to integers, and so you never run into the case where a user passes in a byte offset that would slice a grapheme in half. Instead, String.Index has properties for the next and previous index and a method to advance by an arbitrary amount, so you'd access the 7th character after a dash in a string as myString[myString.rangeOfString("-").advanceBy(7)]

Swift gives up on the idea that len(x+y) = len(x) + len(y). This is just something you have to remember; in UI programming, however, it makes a lot of sense because adding an accent to an existing string isn't going to make it take up more width in the text box. (x + y).utf8.count == x.utf8.count + y.utf8.count, however.

https://developer.apple.com/library/ios/documentation/Swift/...

If I were to adapt this to a domain like Python's, where string processing is pretty important, I'd allow indexing & slicing by integers (including negative integers), but I'd define them in terms of String.Index types, which iterate over extended graphemes under the hood:

  str[2] == str[str.start.advanceBy(2)]
  str[-3] == str[str.end.advanceBy(-3)]
  str[str.find('foo') + 3] == str[str.find('foo').advanceBy(3)]
  str[0:-3] == str.substring(str.start, str.end.advanceBy(-3))
  del str[0:3] == str.removeRange(str.start, str.start.advanceBy(3))


> Swift gives up on the idea that len(x+y) = len(x) + len(y). This is just something you have to remember; in UI programming, however, it makes a lot of sense because adding an accent to an existing string isn't going to make it take up more width in the text box. (x + y).utf8.count == x.utf8.count + y.utf8.count, however.

So do fullwidth characters have a length of 2?

If not, how is this useful for finding the size of text in a text box anyway?


I'm speaking more to the idea that when you concatenate strings, various important properties might not change. To get the actual size in pixels, you'd do str.sizeWithAttributes([NSFontAttributeName: myFont]).width, which is a whole other can of worms.


It's possible to do better than existing mainstream solutions:

(0) Dependent types: The type of valid indices into a string depends, well, on the string. It's this type, not `uint`, that should be used by functions like `str.find()` and `re.match()`. This makes it a compile error to, say, slice a string with an index to another string. Internally, or course, all indices are represented as `uint`s, but your program has no business knowing this. It's an abstract type. IMO, this solution is particularly useful for systems languages, where the increased flexibility and fine-grained control are worth the increase in complexity.

(1) Coalgebras: `str.find()` shouldn't return an index. Instead, it should return in O(1) time, and allocating O(1) additional space, the left and right slices of the original string, split at the desired index. (If the string can be split in multiple places, `str.find()` should return an iterator whose element type is such left-right pairs.) For this to work, it's fundamental that the left and right slices not be allocated in their own buffers, but instead share the original string's buffer. IMO, this solution is best for high-level languages, where simplifying the API is well worth a little loss in flexibility.

Neither of these APIs allows the programmer to split strings at the wrong places.


Python is an abstraction a long way above the Von Neumann machine. It should not expose implementation details like the number of bytes a string happens to be stored as. The whole point of the Python 2 -> 3 transition was to disallow just treating a string as a random pile of bytes and hoping it all works out.


> Please, tell me what you actually think is correct behavior. It is far, far more useful than pointing out that something is "wrong" on purely semantic arguments.

You say "wrong" as if that's a quote from me. But it's not. Yes, strings have to be arrays, but how they work depends on the semantics of your language.

I'm not entirely happy with Python 3 because I don't think they really took their time to improve on the warts. I doubt there's a one-size-fits-all solution.


> I'm getting tired of this argument

Perhaps this is because Unicode is an exhaustive standard. :)

Seriously, I hear ya.

> every argument had the same hole that I see in your post: What, exactly, is a "character"? Please answer me that

I'd say the answer varies depending on who's talking, what they're talking about, and who's listening.

If it's a programmer interested in listening to the Unicode consortium and interested in what Unicode.org documents define for "what a user thinks of as a character" then the answer is, according to those documents, a "grapheme".

> then you can write a proposal for how the API should work, exactly

Right.

It can get, and has indeed gotten, better than that; one can write specifications, build reference implementations, and try things out in battle for a few years. Several text processing systems (including some programming languages) have gone this route and their results should be taken in to account. (ICU is perhaps the go to reference implementation.)

> and why this works with Indic and German and Korean and everything else.

For designs and implementations of text processing systems that claim an aspiration of progressing toward fully following the Unicode specification, the simplest answer to "why does it work?" (or "why it is expected to work") with Indic and German and Korean and everything else is of course the Unicode standard itself.

For those of us who aren't blessed with the patience of a saint and thus haven't read the entire spec from start to finish a few times, one has to rely to a degree on those who do, even if it's really hard to follow what the heck they're saying.

> You're trying to prevent people from slicing up grapheme clusters, which is a noble goal.

Given that grapheme clusters correspond to what the Unicode standard specifies for identifying and preserving the integrity of "what a user thinks of as a character", it seems less a noble goal than a fundamental eventual goal for any text processing system that explicitly claims to aspire to support the world's human languages via Unicode compliance.

> we don't end up slicing grapheme clusters very often in practice.

Which "we" are you referring to?

The western world has dominated text processing to date does. This must not mean that "we", by which I mean all humans, would best continue practices that marginalize those with native languages not well served. (By ignoring the central importance of "what a user thinks of as a character"; again, I'm quoting Unicode consortium documentation.)

Or, more parochially, what about the "we" that's just programmers who want to be able to increase their confidence that they are properly processing Unicode compliant text?

> In the meantime, formats like JSON and XML are defined in terms of code points ...

Code points are the middle level of Unicode, sandwiched between byte encodings at the bottom level and graphemes at the top. There's nothing wrong with a text processing system stopping at the middle level provided folk don't lose sight of the fact that it stops at the middle level.

Thinking that code points are "what a user thinks of as a character" mostly works out OK when processing text that's English or the like, but is definitively not Unicode compliant to the degree it's used too broadly.

> Can you tell me what the correct behavior is?

I would suggest that the most appropriate authority is the Unicode consortium and its Unicode standard.

> Might you end up with bugs in programs because len(x + y) != len(x) + len(y)?

Oh for sure. But it's part of the Unicode standard. It's one of several apparently strange discontinuities that the Unicode effort introduced last century in an attempt to deal with the combination of human language reality and computer system realities. They may have chosen the wrong sweetspot. But now Unicode is a significant part of reality. I, for one, don't think it's wise or even realistic to abandon the Unicode standard as it is...


Although true, the aim of the transition was to enforce encoding-correctness. Operations on strs should never break the encoding itself[1], but preserving the semantic meaning of the text itself is basically impossible for arbitrary data.

[1] Sadly surrogate escapes break this. Rust's division into String:OsString:Vec<u8> makes that aspect a lot cleaner.


The confusion you're talking about is a Python 2 problem, not one of ambiguity.

Encoding and decoding are pretty well defined (though, I've never thought about the formal definition before). When you have an entity in its native form, it needs to be _encoded_ for the purposes of communication (in a broad sense). The encoded message can then be to be _decoded_ back to the natural form. There is no ambiguity.

Really, the reason people get them mixed up in Python is because Python 2 totally stacked it by adding str.encode and unicode.decode.

In Python 2, you can _decode_ unicode to unicode – which it does by silently _encoding_ as ascii first. This operation is total madness.

    >>> u = u'\xe9'
    >>> u.decode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
The error here is in the step where you _encode_ the unicode to acsii, something you didn't ask it to do at all.

And similarly, you can encode str to str (where str is really bytes in Python 2, another issue that adds to the confusion).

Don't get me wrong. I feel your pain. When I was using Python 2 I also got confused about what form things were in and where I needed to encode / decode.

Honestly, once I switched to Python 3, that cognitive overhead just totally vanished. str is the natural form of text, and if I need to store / communicate it I _encode_ it to bytes (utf8, generally). When I'm loading a stored/transmitted message, I _decode_ it to its natural text form.

There are edge cases that make certain situations more complex, but in terms of general usage, I feel Python 3 really got this stuff right.


Nobody has a problem with encode/decode when talking about, say, json. You have some semantic object that you want to encode to some bytes representation, and some bytes representation that you want to decode into some semantic object.

The confusions is around people viewing byte strings implicitly as ascii codepoints, and not fully understanding what the things they're looking at actually are.


I concur 100%. Most of the problems with unicode are like most problem with floats : tons of people don't get how it works (and the good reasons that make all of that tricky)


99% encoding problems in Python 2 is because str(), the fundamental type converting function, is not safe.

    output = ''.join(str(x) for x in some_list)
Can you spot the danger here? Can you imagine how many hidden str() calls in every lib/modules?


>Nobody has a problem with encode/decode when talking about, say, json.

That's maybe because JSON is by the standard supposed to be utf-8, always.


I think you've misunderstood. json encoding is the process of converting native types to utf8 encoded data.

The point of the comment is that everyone knows that you json encode native entities to the utf8 representation, and json decode from utf8 to native entities.

The same is true of unicode. It's not ambiguous at all.


>The same is true of unicode. It's not ambiguous at all.

Only utf-8 is a very specific encoding, so converting to it is straightforward, whereas unicode is an umbrella term and the most ambiguous thing you can imagine.

Even saying "the same is true of unicode" shows confusion, as utf-8 is itself unicode. As are numerous different encodings, and then you have normalization and other subtleties.


This is not my area of expertise, so my understanding may be incomplete.

If some is "in Unicode" it's a list of codepoints - that is the "thing" to be encoded. I don't see that ambiguity. It's a list of numbers, it's not fluffy kittens in the network or something. UTF-8 is a way of transcribing that list of numbers as 1s and 0s. That is the encoding process. Once it's encoded to 1s and 0s, it's not anything. It's just 1s and 0s.

One person's utf8 encoded 1s and 0s may be another person's audio signal. It doesn't really make sense to talk about the encoded data as being utf8. It's the encoded / decoding process that's utf8. Sure, it might be utf8 decodable.

In terms of json, it's the same deal. You have the "thing", simple nested objects/lists/numbers in memory. It only makes sense to talk about json when you're talking about the process of transcribing those.

In both cases you have a "thing" (I think of it as "real", but in a way it's an abstract concept - how does Python represent a dict, I have no idea), you can encode those things into a tangible list of 1s and 0s.

That's why they're the same. And people aren't generally confused about the json encoding decoding process.


Encoded things are not meant to be human readable, like some sort of "secret code". The underlying bytes of some Unicode string probably won't be human-readable if you read it byte per byte--it will be like the platform's "secret code" for your string.

Conversely, you can decode the byte message into something readable, ie. your Unicode string.


I know it's a deeply subjective and personal thing unlikely to help anyone other than me, but I've always successfully relied on the following mnemonic for this particular problem:

* unicode -> encode, u -> e (vowels)

* str -> decode, s -> d (consonants)


My way of remembering str -> decode; (s->d); sad. sad that I have to remember this and that it is still the "future of the language" despite significant and ongoing resistance.

EDIT: This last line of the article pretty much sums it up: "We structured the transition thinking the community would come along with us in leaving Python 2 behind, but that turned out not to be the case and instead we have taken some more time and are using a Python 2/3 compatible subset of the language to manage the transition."

I wonder if python 3.6 or 3.7 will acquiesce to this and give an option for something like "from __past__ import str" like we have for __future__ in 2.7 to make these backward incompatibilities easier to deal with.


> from __past__ import str

As the past did not contain the __past__ that is not going to work.


Excellent point. I guess the shims that collinmanderson references would be the ideal.


It would be nice if future python 3.x brought back shims for unicode(), iteritems, iterkeys, itervalues, xrange(), etc. That could help make the transition easier.


Why? What possible reason would there be to use the old behavior, except for laziness?


Because I'd like to be able to use the googleads Python client library within Python 3 without needing to fix 62,000 lines of code? I guess you could call that laziness if you want.


You just simply use it, if specific argument require python 2 sting you simply call bytes() on it.

For example

old_function(bytes(myarg))

if it needs to be a literal:

old_function(b"literal")

That's what I do, and works for me so far.


Nice. Thanks for attributing a negative personality trait as the only reason someone would prefer any other way.


I don't know if it's been edited since you posted this, but at the moment, the comment is a question asking why somebody would want this other than laziness.


True. I guess some would consider this a positive personality trait in a programmer. I think that's only for system design though (lazy for minimizing future effort required).


There should be one obvious way to do it.


> Although that way may not be obvious at first unless you're Dutch.

There is one obvious way: strings are encoded to bytes, bytes are decoded to strings.


I don't think it is obvious and intuitive and unambiguous what the difference is between ENcoding and DEcoding.


A string is an abstract bit of text. You need to encode this into a particular memory representation of the text.

Bytes hold a bunch of data in some encoding. It could be an image, UTF-8 or LZMA compressed ASCII. Once you know the encoding, to reconstruct the data you decode into a semantically meaningful form.

To put it another way, imagine the terms were "serialize" and "deserialize". Of course one serializes to and deserializes from binary data. Just replace "{,de}serialize" with "{en,de}code" and you're done.


You mean the prefixes 'en-' and 'de-'?


Encode and Decode are slightly subjective... why not something like to_bytes and from_bytes? Maybe not the best names, but definitely clearer on the meaning.


Not really.

Veedrac had a good analogy, think of text as something abstract, for example imagine text is an image or sound, if you want to store it in bytes you need to encode it, and to read back you decode it.

As to_bytes/from_bytes, actually python provides it too:

to_bytes -> bytes(<text>)

from_bytes -> str(<bytes>)


It makes sense for sure, just isn't super intuitive - if it was people wouldn't be so confused.


I think that's backwards


It's not backwards.

I think that reveals that the names really do have a problem. The problem is that "encode" sounds like "make this Unicode" to people who aren't familiar with Unicode.


I've really tried to figure out how to use these methods properly, but in the end, the only successful strategy I've found essentially boils down to randomly sprinkling encodes and decodes throughout my code until errors stop happening. Half the time I end up using an encode where my interpretation of the documentation suggested a decode would be required, or vice versa. (This is in Python 2. I haven't tried Python 3 yet since I don't do much Python coding these days.)


That's not the proper way. You should rarely need encode/decode if you make sure everything within Python is unicode. You ensure this by being specifying the encoding when getting data from external sources such as files, the network, or external libraries.


Of course I'm well aware this isn't the proper way to do it. I'm just saying it's the only way I've found that is successful with any degree of regularity. I've tried following the documentation as best I can, but like I said, half the time an encode ends up working where I thought I needed a decode or vice versa.


This is how to look at it.

First you need to think of text and binary (bytes) as two distinct types.

The text most of the times is how you interact with user, so things that you will display to the user, or what user enters to you.

Now, the text is stored in unicode (how, it's not our concern, Python abstracts it from us), the bytes is representation how it's stored in files, send over a network etc.

Now if you need to store text in a file, or send over network you need to encode it (most of the time as UTF-8), if you receive data which supposed to be text you decode it. It might be helpful to think of a text as a sound and bytes as mp3. You encode sound as mp3 to store it in a file, and when you want to play it back you decode it.

There are some things can cause a confusion. For example if you write text to a file, python will automatically apply the conversion. Things like that are there so you don't have to do the conversion every single time.


Python 3 really does help with this. You rarely need to use encode and decode, it's done automatically when opening a file in text mode. You do have to specify the encoding you want though.


It is easy to confuse the names, but it takes 2 seconds in the REPL or in Idle to figure it out, and the concept behind them is pretty obvious, IMO.

I'll take an occasional 2 seconds in the REPL over the headache of debugging codec issues any day of the week.


For me this confusion arises because of ducktyping together with ambiguous interfaces. Take open() as example, it can return a file that reads either in binary more or automatically decodes into unicode strings, just based on if you passed a string argument mode='rb' vs mode='r', also if you enable universal newlines it will implicitly assume some encoding. This in itself all kind of makes sense once you know it, but the danger lies in that it's code that worked in py2, and even now it will still run because most functions that work on bytes also work on unicode. It's not until you try to combine these variables with data from other sources(like string literals) which were/weren't unicode you notice this and then you don't know which side wad correct so you randomly just shotgun either decode on one side or encode on the other side. This could be 100 lines away from where the problem should actually have been solved and people being people solves the symptoms instead of the causes.


> .encode() and .decode() are so ambiguous and unintuitive

They are not. What is unintuitive is the default encoding in Python where an encode can trigger an implicit encode and the other way round. The `encode()`/`decode()` availability of strings was never a problem has you have many bytes -> bytes and str -> str codecs;


Disagree. The original commenter is correct in saying that the naming scheme is not obvious. Something like to_bytes(encoding) would be a lot more clear.


> Disagree. The original commenter is correct in saying that the naming scheme is not obvious. Something like to_bytes(encoding) would be a lot more clear.

But then function would do something completely different. "\x01\x02".encode('zlib') for instance is a bytes to bytes operation. The problem is that "foo".encode('utf-8') does not give you an exception. If the coercion would not be enabled you would get an error:

    >>> reload(sys).setdefaultencoding('undefined')
    >>> 'foo'.encode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/Cellar/python/.../encodings/undefined.py", line 22, in decode
        raise UnicodeError("undefined encoding")
    UnicodeError: undefined encoding
That's not any worse than what Python 3 does:

    >>> b'foo'.encode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'bytes' object has no attribute 'encode'


> "\x01\x02".encode('zlib') for instance is a bytes to bytes operation.

I think the inclusion of things like zlib (or rot13 or whatever) was a conceptual error that just fosters confusion.


> I think the inclusion of things like zlib (or rot13 or whatever) was a conceptual error that just fosters confusion.

We should not optimize languages for idiots. There is nothing confusing about such an operation for anyone who can use their brain. Python 3 still contains those operations but instead of x.encode(y) you now do encode(x, y).


The principle of least astonishment and optimizing for idiocy are not the same thing.


How is an attribute error clearer than an exception that says something like: operation does not make sense of this type?


"\x01\x02".encode('zlib') is a really odd API. Just making sure: this is a real thing, and you're in favor of it?


It is in Python 2, but not in Python 3, where str.encode() and bytes.decode() have been restricted to only convert between strings and bytes.

    >>> "foo".encode('zlib')
    LookupError: 'zlib' is not a text encoding; use codecs.encode() to handle arbitrary codecs


> "\x01\x02".encode('zlib') is a really odd API. Just making sure: this is a real thing, and you're in favor of it?

Of course this is a real thing and it's very frequently used. Yes I am in favour of it as the codec system in Python is precisely the place where things like this should live.


I agree that encode() and decode() are ambiguous, I find myself pausing to make sure I'm using the right one.

You can use bytes(string,encoding) to replace encode(). Unfortunately it doesn't have a default encoding, which makes it a pain to use. And str(bstring) isn't symmetric, it can't replace decode().


The problem is that both u''.encode() and ''.encode() (at least in Python 2) exists. Why?


Unicode worked just fine by Python 2.6. I had a whole system with a web crawler and HTML parsers which did everything in Unicode internally. You had to use "unicode()" instead of "str()" in many places, but that wasn't a serious problem.

By Python 2.7, there were types "unicode", "str", and "bytes". That made sense. "str" and "bytes" were still the same thing, for backwards compatibility, but it was clear where things were going. The next step seemed to be a hard break between "str" and "bytes", where "str" would be limited to 0..127 ASCII values. Binary I/O would then return "bytes", which could be decoded into "unicode" or "str" when required. So there was a clear migration path forward.

Python 3 dumped in a whole bunch of incompatible changes that had nothing to do with Unicode, which is why there's still more Python 2 running than Python 3. It was Python's Perl 6 moment.

From the article: "Obviously it will take decades to see if Python 3 code in the world outstrips Python 2 code in terms of lines of code." Right. Seven years in, Python 2.x still has far more use than Python 3. About a year ago, I converted a moderately large system from Python 2 to Python 3, and it took about a month of pain. Not because of the language changes, but because the third-party packages for Python 3 were so buggy. I should not have been the one to discover that the Python connector for MySQL/MariaDB could not do a "LOAD DATA LOCAL" of a large data set. Clearly, no one had ever used that code in production.

One of the big problems with Python and its developers is that the core developers take the position that the quality of third party packages is someone else's problem. Python doesn't even have a third party package repository - PyPI is a link farm of links to packages elsewhere. You can't file a bug report or submit a patch through it. Perl's CPAN is a repository with quality control, bug reporting, and Q/A. Go has good libraries for most server-side tasks, mostly written at Google or used at Google, so you know they've been exercised on lots of data.

That "build it and they will convert" attitude and the growth of alternatives to Python is what killed Python 3.


> That "build it and they will convert" attitude and the growth of alternatives to Python is what killed Python 3.

Well said.


> We have decided as a team that a change as big as unicode/str/bytes will never happen so abruptly again. When we started Python 3 we thought/hoped that the community would do what Python did and do one last feature release supporting Python 2 and then cut over to Python 3 development for feature development while doing bugfix releases only for the Python 2 version.

I'm guessing it's not a coincidence that string encoding was also behind the Great Sadness of Moving From Ruby 1.8 to 1.9. How have other mainstream languages made this jump, if it was needed, and were they able to do it in a non-breaking way?

https://news.ycombinator.com/item?id=1162122


C and C++ are so widely used that transitions like this are made not at the language level but at the level of platforms or other communities. Some parts of the C/C++ world made this transition relatively seamlessly, while others got caught in the same traps as Python.

The key is UTF-8: UTF-8 is a superset of 7-bit ASCII, so as long as you only convert to/from other encodings at the boundaries of your system, unicode can be introduced to the internal components in a gradual and mostly-compatible way. You only get in trouble when you decide that you need separate "byte string" and "character string" data types (which is generally a mistake: due to the existince of combining characters, graphemes are variable-width even if you're using strings composed of unicode code points, so you don't gain much by using UCS-4 character strings instead of UTF-8 byte strings).

My theory is that the python 3 transition would have gone much smoother and still accomplished its goals if they had left the implicit str/bytes conversion in place but just made it use UTF-8 instead of ASCII (although in environments like Windows where UTF-16 is important this may not have worked out as well).


You are correct that there's no real benefit in UTF-32 over UTF-8, which is why Go and Rust (and others) have worked with UTF-8 in memory just fine. However, the actual encoding of a str object is irrelevant, and it's not the point. The whole point of the str/bytes difference in Python is that you make Python keep track of whether you've done the conversion or not. In Python 2, you can be sloppy, and the programs are buggy as a result!

You're putting the cart before the horse here in terms of the Python 3 transition. The str/unicode fix was one of the driving factors for Python 3 to exist in the first place, and if you removed it, then what's the point?

Again, look at Go or Rust. Both of them have separate types for strings and bytes, even though they have the same representation in memory, and as a result we don't have that kind of bug in our program.


> The str/unicode fix was one of the driving factors for Python 3 to exist in the first place, and if you removed it, then what's the point?

The reason python 3 exists is that most python 2 code had latent unicode-related bugs that would only manifest when they were exposed to non-ascii data. The backwards-incompatible barrier between str and bytes was the solution the python 3 team chose for this problem; adopting utf-8 as the standard encoding would have been another solution which I claim would have been more backwards-compatible (essentially moving to the go/rust model, which prove that you don't necessarily need separate byte and character string types for correct unicode handling).


But not every 8-bit byte string is valid UTF-8 so that could still cause a world of pain.


This is all true, but (speaking as one) I think the deeper reason is that C and C++ developers are just already used to great pain associated with manipulating character data, whereas scripting language developers expect these things for free.


But apart from being "Internet compatible", it makes little sense to move from ascii only to utf-8. It only makes sense if you're already trying to deal with stuff that doesn't fit in ascii. Don't get me wrong, I think ascii was a terrible hack (I seem to recall one of the designers called it his biggest mistake/regret - and that they should've gone with some kind of prefix-encoding).

But the thing with unicode-everywhere is that it's more beginner friendly: you can have unicode in your variable names, and strings without giving it another thought.

    $ cat u.py 
    from __future__ import print_function
    å="æ"
    print(å)
    $ python2 u.py 
      File "u.py", line 2
    SyntaxError: Non-ASCII character '\xc3'
      in file u.py on line 2, but no
      encoding declared;
      see http://python.org/dev/peps/pep-0263/
      for details
    $ python3 u.py 
    æ
Sure, the explanation of what's wrong is right there in the error-message, but it seems a bit of an unnecessary hurdle for people to get around to write important programs, that ask you to type in your name, and then prints out the name ten times ;-)


Well, Perl 6 also introduces a distinction between strings (Unicode) und buffers (octets), but it also introduces loads of other changes, to the point where it's its own language.

That said, the Unicode features were a major thing planned for PHP 6, and (afaict) one of the reasons there has never been a PHP 6, but rather they went straight from 5 to 7.

I'm not aware of any graceful transitions.


Perl 5 introduced Unicode support in 5.6.0 (2000) but it was kind of a mess. It was essentially redesigned in 5.8.0 (2002). At that point it was fairly buggy, but by 5.8.8 (2006) it was in pretty good shape. The most recent versions of Perl 5 (5.22.1 was released a few days ago) have excellent Unicode support.

That said, Perl 5 does not have different types for strings and bytes, which is definitely a source of bugs. Since Unicode support is essentially opt-in for a library (you have to explicitly decode/encode data somehow) it's easily possible for a library you use to break your data. Most of the major libraries (database, HTTP, etc.) have long since been patched to respect Unicode in appropriate ways, so the state of Unicode in Perl 5 is good overall.


In the Reddit discussion of this, someone linked to this criticism [1] of Python 3's Unicode handling written by Armin Ronacher, author of the Flask framework.

I am not competent to say whether this is spot on or rubbish or somewhere in between [2], but it seemed interesting at least.

[1] http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

[2] Almost all of my Python 2 experience is in homework assignments in MOOCs for problems where there was no need to care about whether strings were ASCII, UTF-8, binary, or something else. My Python 3 experience is a handful of small scripts in environments where everything was ASCII.


It's wrong. The whole point of cat is to concatenate files. But if you concatenate two files with different encodings, you end up with an unreadable file. So you want cat to error out if one of the files that was passed has a different encoding from the encoding you told it to use, which is exactly what the python cat will do.


> The whole point of cat is to concatenate files.

Yes.

> So you want cat to error out if one of the files that was passed has a different encoding.

No. I expect it to read bits from stream A until it is exhausted and then read bits from stream B until it is exhausted. All the time just writing the ones and zeros read to the output stream (of bits). And no, a byte does not have to be "8 bit" (http://www.lispworks.com/documentation/HyperSpec/Body/f_by_b...).


And concatenating a file of 9-bit bytes with a file of 8-bit bytes will produce something useful? No. If you don't know what your bits represent then you will corrupt them. Python does not need to faithfully reproduce all the historical oddities of unix.


It might - depending on what I intend to do with the file (I still have the offsets of the individual files because I know the original file sizes).

API-wise, in my humble experience, it's hell to deal with operations that are supposed to work on bit-streams but try to be smart and ask me for encodings of those - this is information I might not even have when building on those basic operations. The "oddities", how you call them, are the result of not over-abstracting the code to handle yet-unknown problems.

You want to concatenate text files with different encodings? Convert them. Expecting a basic tool to do this for you (or carp on "problems") quintessentially leads to cat converting image formats for you and demanding to know how to combine those images: (addition, subtraction, append to left/top/right/bottom of first image etc).


> API-wise, in my humble experience, it's hell to deal with operations that are supposed to work on bit-streams but try to be smart and ask me for encodings of those - this is information I might not even have when building on those basic operations. The "oddities", how you call them, are the result of not over-abstracting the code to handle yet-unknown problems.

That's how C ended up as the security nightmare that it is. There are a lot of things you can do in C that you can't do in Python - you can reinterpret strings or floats as integers, you can subtract pointers from integers, you can write to random memory addresses.... Sometimes these things are useful, but most of the time they just lead to your program breaking in an unhelpful, nonobvious way.

Python is not that kind of language; it will go to some pains to help you not make mistakes. If you want to do bit-twiddling in Python there are APIs for it, and you could implement a "bit-level cat" using them, but it's never going to be the primary use case. Arguably there should be better support for accessing stdin/stdout in binary mode, but that would make it very easy to accidentally interleave text and binary output which would again result in (probably silent) corruption. (Writing a "binary cat" that concatenates two files of bytes would not lead to any of the problems in the linked article - it's only trying to use stdin/stdout that's causing the trouble in the link).


> That's how C ended up as the security nightmare that it is.

And that's how I ended up completely wasted after a friend had to throw a party after a GNU version of a common unix tool was able to accept his first name as valid parameters.

And no, C's problem are not based on encoding issues. Those are not even a first-class symptom.

> Python is not that kind of language

Maybe. I don't care much, even though I do speak Python fluently. Though, I do care about minimal functional units whose documentation I can grasp in minutes, not hours.

> [Python] will go to some pains to help you not make mistakes.

This is not specific to python, this is specific to [language] developers. If all you do is text processing, you will think in characters and their encoding. I don't. A lot of programmers don't - because they deal with real world data that is almost never most efficiently encoded in text.

To reiterate: We are all dealing with bit-streams. Semantics of those are specific to their context. If your context is "human readable text" - deal with it. But please don't make me jump through hoops if I actually just want to deal with bit-streams. If you need magic to make your specific use-case easier, wrap the basic ops in a library and use it.

Last but not least: This all is completely off-topic when the question is about "why python3" - it's great for your use-case, but from an abstract point of view, python3 was just the rational continuation of python2, cleaning up a lot of inherited debt. Though it might fit your world-view, it's not necessarily what it was about.


Do you actually work with systems not using 8 bit bytes or are you just being pedantic?


Yes, I do. But the point I was trying to make was "it's bits, not bytes" (because "byte" has been defined to mean 8 bits only recently).


It seems simple to me - if you want bytes, open the file in binary mode, if you want strings open it in text mode.

The only glitch is with stdin/stdout. They're opened on your behalf before your program even starts, and the assumption is that you'll be reading and writing text in the default OS encoding. This doesn't mesh well with the Unix pipe paradigm.


If you need binary reads/writes for stdin/stdout, you can use the underlying binary buffer objects (the "buffer" attributes on sys.stdin/stdout).


IMO the biggest reason to use Python3 is its concurrency support via async + await.

Fixing the unicode mess is nice too of course, but you can get most of the benefits in Python2 as well, by simply putting this at the top of all of your source files:

from __future__ import unicode_literals

Also make sure to decode all data from the outside as early as possible and only encode it again when it goes back to disk or the network etc.


So much this! asyncio was the main selling point for me, but in general, why not follow the language?

I never really understood the "rather stay with py2.7" thing. I get it with big old monolithic applications. You don't "just" rewrite those, but _every new python project_ should be done with the latest stable release.

Is anyone starting their PHP projects on PHP4? Any new node projects in 0.10? Of course not, that would be moronic.


Because you don't know what libraries you may depend on in the future when you start a new project. I was a fervent Python 3 supporter, and wrote every of my fresh projects in Python 3 instead of 2, until one day I found I needed LLVM in my project, yet the python port at that time was for 2 only.

I mostly avoid writing Python 3 nowadays, because I don't want to rewrite my project or find painstakingly an alternative solution when I could have just imported a module that runs fine under Python 2.


For the record, you do not get most of the benefits by doing the future import. The silent Unicode/bytes coercion and lack of standard library support is still there, making it almost impossible to write correct non-ascii-handling software in Python 2.


I chose to port from CPython2 to PyPy4, rather than to CPython3. It just made more sense. I for one see no value in Python3 (unicode has been supported since 2.6). My reasons for migrating to PyPy4 instead of Python3-

1) It was easier than porting to CP3.

2) It gave me a tangible benefit by removing all CPU performance worries once and for all. Added "performance" as a feature for Python. Worth the testing involved.

3) It removed the GIL. If you use PyPy4 STM, which is currently a separate JIT. Which will be at some point merged back into PyPy4.

So for me, Python3 can't possibly compete, and likely never will with PyPy4 once you consider the performance and existing code that runs with it. PyPy3 is old, beta, not production-ready, based on 3.2 and Py3 is moving so fast I don't think PyPy3 would be able to keep up if they tried.

Python3 is dead to me. There's not enough value for a new language. I'm not worried about library support because Py2 is still bigger than 3 and 2.7 will be supported by 3rd party libraries for a very long time else choose irrelevance (Python3 was released in 2008, and still struggling to justify its existence...). My views on the language changes themselves are stated much better by Mark Lutz[0]. I'm more likely to leave Python entirely for a new platform than I am to migrate to Python3.

PyPy is the future of Python. If the PyPy team announces within the next 5 years they're taking the mantle of Python2, that would be the nail in the coffin. All they have to do is sit back and backport whatever features the Python2/PyPy4 community wants into PyPy4 from CPython3 as those guys run off with their experiments bloating their language. I believe it's all desperation, throwing any feature against the wall. Yet doing irreparable harm bloating the language, making the famous "beginner friendly" language the exact opposite.

I already consider myself a PyPy4 programmer, so I hope they make it an official language to match the implementation. There's also Pyston to keep an eye on which is also effectively 2.x only at this time.

[0]http://learning-python.com/books/python-changes-2014-plus.ht...


There's just a bit of hyperbole in your comment. Most major libraries have been ported to Python 3. I wonder if the opposite of what you're saying will happen--i.e, the libraries that don't support Python 3 will be left behind. Fabric is an example of that for me.


Of course you may be right, but obviously I don't think so at this time. I feel I made a smart, safe bet. I'm still writing valid 2.7 code that could be migrated to 3.x at any point if for any reason my current plan would fall through. It would be just as easy for me to migrate to 3.x as it would anyone else with 2.x codebases.

So far, solving CPU performance in Python and removing the GIL is pretty much the holy grail. Python2 would have to completely collapse (no signs of this) for CPython3's ecosystem to outweigh the long-tail of libraries only on 2.x and the truly next-level dynamic language CPU performance.

While I am enjoying "performance as a feature" and GIL-free Python at the moment, I can still migrate to Python3 at any time if I lose my safe bet with PyPy4/Py2.

To me it looked and still looks like a no brainer.


> I'm still writing valid 2.7 code that could be migrated to 3.x at any point

Well, that's kinda true, although there is somewhat of a learning curve doing 2-to-3 migrations. Depending on your situation, this may or may not matter, but if you need to "move fast" on this at some point, it'd at least be beneficial to know what's involved (even though it's not that onerous IMO).


I've used (tested) Python3 many times over the years and releases. I'd consider the changes trivial. I'm not anti-Python3, I'm pro-PyPy4.

In my attempts to test Python3 releases though I've ran across bugs and performance regressions from 2.x. After I continually ran into this in a few releases I eventually threw my hands up. The last release I ran tests on was 3.4 and I'm no longer interested in later releases until something as substantial as PyPy's CPU performance and removal of the GIL (PyPySTM) shows up to beat what I have now.

It's slick to have the entire Python2 ecosystem, major performance boost to your code, and still leave the door open for Python3 if they ever stop bloating the language.

Other than being dramatically slower than PyPy4, Python3 is also feature soup. I strongly dislike technical churn rather than true technical innovation (which is what PyPy represents). I'm more in line philosophically with Go, I'd prefer to remove features until you're down to a very concise and stable core. Python3 has many negatives, but from my perspective they keep piling on more.


Writing valid 2.7 is not enough as you can write yourself into a non-portable corner that is hard to escape from.

You need to use all of the __future__ imports and be mindful of constructs that 2to3 will handle wrong. Better yet is to bite the bullet and make your code work with automated tests running 2to3 if all of your library dependencies have py3 support. Then you can continue to write in 2.7 and anything that runs afoul of py3 will be caught early.


Yes, I could do that. But as it is, I'm no worse off than anyone else with Python2 codebases.

I used to use the future imports and the things you're suggesting but since my job is Python2.x I stopped. I prefer to just keep my head there all the time. If I were to migrate to CPython3 instead of PyPy4 as I've done, it would be a wholesale move. Employer moves, I move. The underlying incentives aren't there though at the moment for either of us.

I'm certainly not going to push my employer into something that isn't in their best-interests (or mine, work for nothing). If anything, they need to move to PyPy for the same reasons I did.

Py3 needs a bigger ecosystem than 2.x, needs an answer to PyPy4's CPU performance and removal of the GIL, and convince employers it's better than CPython2 and/or PyPy4. Those are 4 potentially insurmountable tasks.

The best thing to do is start removing features rather than adding them, such as this 4th string formatting method in 3.6. It (Python3) is just very unappealing.


Is there something fundamentally different from Py3 that makes a well-performing functional PyPy version impossible or extremely difficult?


Late reply here, but no. The reason there's no non-beta PyPy3 on a later version than 3.2 is that there's not enough demand for it (or no one willing to donate large enough sums of money for it to make it the primary focus).

Industry players who want to remove the Python CPU bottleneck and willing to stick with Python rather than moving on to Go or JS on Node, are donating to the 2.7 based PyPy4.


Going to PyPy4 would be nice. But mod_wsgi doesn't support it. It has something to do with PyPy not implementing Python's embedding API I belive.

So I'll be sticking with 2.7 for now.


I love when people with native english skills write monsters like this: "If people's hopes of coding bug-free code in Python 2 actually panned out then I wouldn't consistently hear from basically every person who ports their project to Python 3 that they found latent bugs in their code regarding encoding and decoding of text and binary data."

This should be under penalty ;)

Anyone to divide it into few simpler sentences?

UPDATE: And another one from our connected sentences loving author: "We assumed that more code would be written in Python 3 than in Python 2 over a long-enough time frame assuming we didn't botch Python 3 as it would last longer than Python 2 and be used more once Python 2.7 was only used for legacy projects and not new ones."


The first one:

> If people's hopes of coding bug-free code in Python 2 actually panned out

Python2 developers wanted to write bug-free code.

code = for the purpose of processing text and binary data

> then I wouldn't consistently hear from basically every person

Python2 developers could not write bug free code. So they complained.

complained = complained about their algorithms having bugs when they rewrote those algorithms in Python3

> that they found latent bugs in their code regarding encoding and decoding of text and binary data.

Python2 code written by the same developers had bugs that they did not know about.

When the same developers rewrote their code in Python3, they found the bugs.

(If Python3 did not exist, then it would be very hard to write bug-free code in Python2.)

The second one:

> We assumed that more code would be written in Python 3 than in Python 2 over a long-enough time frame assuming we didn't botch Python 3 as it would last longer than Python 2 and be used more once Python 2.7 was only used for legacy projects and not new ones.

If we designed Python 3 correctly, then we expect Python 3 to live longer than Python 2. We also expect more code to be written in Python 3 for the same reason. We also expect only old projects will be written in Python 2.7.


If people's hopes of coding bug-free code in Python 2 actually panned out

then they would have bug free code

thus when they port their code to python 3, the unicode changes would not reveal latent (existing) bugs

thus they would not blog about or tell people about said bugs, hence the author of that sentence would not have heard about such bugs

moral of the story: Python 2 hides certain kind of unicode related bugs that are not exposed until you port to Python 3.


Maybe something like:

"I consistently hear from basically every person who ports their project to Python 3 that they found latent bugs in their code (regarding encoding/decoding of text and binary data). This would not be the case if people's hopes of coding bug free code had actually panned out."

That's a little better, but still not great.


Division of this sentence would reduce it's legibility.

It's a simple IF <condition> THEN <result>.

You can argue that he's overly verbose, but breaking the IF/THEN into multiple sentences reduces their connection and the ability to understand.

"If we were actually creating bug free code in Python 2 then porting code to Python 3 would be seamless, which it is not".


Since python3 is not backwards compatible with python2, why didn't the python devs leverage the opportunity for creating a more performant non-GIL runtime for python3?


> more performant non-GIL runtime for python3

Because making one of those those is a huge amount of work, may introduce more serious backwards incompatibilities (like C extensions), and not everyone has the knowledge of how to do it so they'd need to either learn from scratch themselves or find people interested in doing it for them.


Removing the GIL is non-trivial. But I'm surprised that they didn't take advantage of Python 3 to move to a more performant register-based interpreter. Several people expressed interest in building one, and working prototypes were even made, but the core developers didn't seem much interested in performance upgrades.

I think their attitude is generally, "if it's performance-critical, write it in C", which was a good approach 15 years ago. However, Python's competition now comes from languages like Go and Rust that have both good performance and are user-friendly and productive (with features such as expressive syntax and a comprehensive standard library).


Removing the GIL is a difficult problem: http://python-notes.curiousefficiency.org/en/latest/python3/...


I think the question was if you aren't going to be backwards compatible, why not unshackle yourself completely and design a new language without the GIL and make it as pythonic as possible?

A scripting language that support multi-threading is possible, right? I think TCL does it.


I guess they were happy with breaking backwards compatibility a little bit, not not as much as simply removing the GIL and not adding back any implicit synchronisation at all.


I think this question can only be raised in hindsight. I think nobody suspected the unicode transition to be so slow and painful. As there probably won't be such a big transition evermore, I wouldn't hold my breath for it happening.


It's a shame the GIL won't be removed because it's perceived to be too difficult.

It's trivial to remove the GIL - probably a week of mechanical work. Just don't depend on global variables in the interpreter. No global state, no problem. But the Python C API has to be changed to always store interpreter state into a struct, and a pointer to that interpreter state struct has to be passed as the first argument in all C API calls. Not rocket science; this is what Lua does.

It's a political decision to keep the GIL, not a technical one. As for preserving C API backwards compatibility, it's a straw man argument - the Python API broke from 2.x to 3.x anyway. There's no such thing as "lesser breakage" - only breakage.


Supporting multiple interpreters in a single process is not what most people mean when they talk about removing the GIL. Objects from one interpreter could not be used safely from another interpreter. This would be handy in a few situations but is essentially not all that different to multiprocessing.

What people usually mean when they talk about removing the GIL is having multi-threaded code make use of multiple cpu cores (as it does in Jython.) This would involved splitting the GIL into more fine-grained locks. Unfortunately experiments taking this approach have so far shown significant performance impacts for single threaded code.


It would be inefficient to share data structures or interpreter internal state across threads. This would lead to the poor performance you described with mutex contention.

It is better to have an interpreter instance per thread, each having their own separate variable pool, and pass messages between the different interpreter instances. This model scales well with multicore systems and is faster than the multi-process equivalent.

It's also a simpler model to implement and maintain. Python's current multithreading model is too complicated from an implementation point of view.


Because the runtime with the GIL is more performant and python users know that it's practically impossible to write correct software using threads.


Yes, this is the point. The GIL matters only if you are using threads with CPU-bound tasks (most of my work is I/O bound). When I have CPU-bound problems that need parallelism and are not yet well implemented in numpy/scipy, I would ratter use AMQP instead of threads.


Is there a way to do it? My understanding always was that removing GIL means that the code that runs single-threaded (which is like 95% of Python code out there) will get a performance hit.


Yes it does mean that and I believe it is unavoidable, because you need to replace the coarse-grained GIL with a fine-grained locking instructions. However, I think the trade off is now worth it, since multi-core CPUs are ubiquitous. Besides, Python itself was never notable for its performance to begin with, so I've always found the argument very odd.


It is perfectly avoided by not removing the GIL. :-) Multi-core CPUs may be ubiquitous, but multi-threaded Python code is not, by far. So performance of say, my Python code, would decrease, thank you, I don't need that. It's not that Python would have superb performance, I just don't see the reason to pay for something that's not actually needed.

It's also often forgotten that one of the easiest paths to parallelism is to just run multiple processes (and for many problems you only need to split the input data, the processes don't need to even communicate), and this solution naturally uses multiple CPUs without any GIL worries.

I think the actual use cases that require GIL removal are very little (something like a server, perhaps), and if you actually need to do that, I think you're better served with some JVM language or Go.


What I'm saying is that not only does Python not have superb performance, its performance is actually largely irrelevant. The advantage of Python is in ease of use, capabilities as glue language, readability, etc. So that's why I really don't see why you wouldn't remove the GIL at a reasonable slowdown factor (2x was a figure thrown around IIRC).

People always proclaim that you can just use multiple processes for parallelism. It's nice when it suits your (embarrassingly parallel) problem but when you need to share a large amount of data between processes it's a major hassle.


> but multi-threaded Python code is not, by far.

Eh, isn't that because of the GIL?


You may be right. But it's 2 levels of comments and I am still waiting for some examples of use case. Any use case I can think of can be better served with Java or C++ or C.

In fact, as I understand it, you can actually call multi-threaded applications from Python. You can only execute Python code (in the same interpreter) single-threaded.

Although, an embedded scripting language (say, in a game) could in theory benefit. But isn't Python already too big for that kind of applications?


A common use case is GUI, where you offload heavy computation into a separate thread to avoid blocking the main thread. I once saw a GitHub project of a Python GUI abandoned where the README said he/she gave it up because the lack of threading results in unresponsive GUI all the time.


Because that's difficult to do. But the PyPy team has already done it for Python2 code. Check into PyPy-STM.


So Python 2 did not have super obvious string handling. One of the odd things that they seemingly could have fixed pretty easily is to change the default encoding from 'ascii' to 'utf8'. That would have fixed a bunch of the UnicodeDecodeErrors that were the most obvious problem with strings: http://www.ianbicking.org/illusive-setdefaultencoding.html

If they had to make Python 3 anyway, I think the main thing they were missing is that they should have added a JIT. That makes upgrading to Python 3 a much easier argument. If the only point of the JIT was to add a selling point to Python 3, that probably would have been worth it.


It seems to me if bytes/unicode was the only breaking change we would probably be over the transition by now.

There are a lot of other subtle changes that makes the transition harder: comparison changes and keys() being an iterator for example. These are good long term changes, but I wish they weren't bundled in with the bytes/unicode changes.


We migrated to Go from Python 2, since instead of incompatible Python 3 we needed faster Python 2 replacement.


Str is tip of the iceberg. Python before 2.7 and current Python is completely different language semantically; methods, functions, statements, expressions, Global interpreter lock behavior.. This is sad that this blog post and discussions around it didn't mention anything about it.


The article isn't covering all the differences between Python 2 and Python 3. Based on this article as well as other articles I've read in the past, the Unicode issue was the original reason they decided it was necessary to break backward-compatibility, but once that decision was made, there was no reason not to make any further backward-incompatible improvements.


IMHO blog post don't really answer it's own title.


The reason, I still did not port to Python 3:

(and yes, Unicode in Py2 is a mess ...)

They just broke to many things (unnecessarily!) internally. Particularly they changed many C APIs for enhancement modules, so that all of them had to be ported, before they could be used with Python 3. They did not even consider a portability layer ... why not??

Some (not all) of the bad decisions (like the u"..." strings) they did change afterwards, but than it was a little late.

So many modules are still not ported to Python 3 -- so the hurdle is a little to high -- for small to nil benefits!

So, the problem (from my side) is not Unicode at all ... just the lack of reasonable support from the deciders side.

---

Maybe, some time later, when I have to much spare time.


Agreed. Also this:

"So expect Python 4 to not do anything more drastic than to remove maybe deprecated modules from the standard library."

But why break all the existing libraries that use those modules, even if there's now "better" ways. In every comparison I've ever seen on performance, robustness, etc python always loses to the other big languages. Except in one area: the availability of user-land submitted packages and extensions. So why break them for a little perf boost?


Sounds really a bit weird to me.

I hope, that they really think hard, which modules really must be removed.

I would only consider security reasons to really remove a library module. Or at least mark it deprecated for a long while until it is used only really rarely.


This is a pretty good explanation of unicode in Python: http://nedbatchelder.com/text/unipain.html


I like Python3 personally. It's new and better but a different branch. I'm annoyed by people abbreviating it as "Python" and treating it as a substitute for Python2. In my opinion, the "Python" name should be exclusively used for Python2, and Python3 should've been always used as one word. The whole Python3 situation caused unnecessary confusion to the outside (non-Python) people, which I think could be avoided.


Since I'm trying to keep a small footprint, I rely on the system version of Python on Mac OS X, which is 2.7.10 now.

To use anything newer, I'd have to ask users to install a different interpreter, or bundle a particular version that adds bloat. There's no point. The most I've done is to import a few things from __future__; otherwise, my interest in Python 3 begins when Apple installs it.


The Go authors have solved this problem thoroughly. When working in Go, I usually never have to think about this.

https://blog.golang.org/strings


Go came out in 2009. What's your point? I'd sure hope they'd look at languages older than them.


How long is the transition going to take? Serious question. Because I'm rather tired of starting new work and finding some module that drags me back to 2.x.


Technically, "forever", since there will be people who never port their code. If you're depending on one of those holdouts, it's time to find a new dependency, because if they haven't ported by now they won't and that's your problem because...

in practical terms, the transition is about to be over, since now the Linux distros are all-in on converting to Python 3 for their current or next releases and that will forcibly move the unported libraries in the bin of obsolescence.


From the outside, Python 3 seems like a much better language. I don't have strong views of its object system (I avoid OOP as much as I can) but it seems like the string/bytes handling is much better, and I'm also a fan of map and filter returning generators rather than being hard-coded to a list implementation (stream fusion is a good thing). Also, I fail to see any value in print being a special keyword instead of a regular function (as in 3).

What I don't get is: why has Python 3 adoption been so slow? Is it just backward compatibility, or are there deeper problems with it that I'm not aware of?


why has Python 3 adoption been so slow?

I can tell you about our situation.

We are an animation studio with decades of legacy Python 2 code. We sponser pycon and are one of the poster children for python.

We have absolutely no plans to switch to Python3.

Here are the various reasons:

- Performance is a big deal, and moving to a version of python that is slower is a no-go off the bat.

- Python3 has no compelling features that matter to us. The GIL was the one thing that should have been tackled in Python3.

- Since the GIL is here to stay, our long-term plan will likely involve removing more python from the pipeline rather than putting a huge effort into a python3 port.

- We have dependencies on 3rd-party applications (Houdini, Maya, Nuke) that do not support Python3

- We have no desire to port code "just because". Each production has the choice of either spending effort on Real Features that get pixels on the screen, or on porting code for No Observable Benefit. Real Features always win.

- Python3 has a Windows-centric "everything is unicode" view of the world that we do not care about. In our use case, the original behavior where "everything is a byte" is closer to UNIX. A lot of the motivation behind Python3 was to fix its Windows implementation. We are a Linux house, and we do not care about Windows.

- Armin's discussion about unicode in Python3 hits many of the points spot on.

Why has adoption been so slow? Simply because we have no desire to adopt the new version whatsoever. We'll be using Python2.7 for _at least_ the next 5 years, if not more.

It's far more likely that we'll adopt Lua as a scripting language before adopting Python3.


It's far more likely that we'll adopt Lua as a scripting language before adopting Python3.

And you're welcome to do that.

But "we want a language frozen in time forever so we never have to maintain code" -- which seems to be what you're aiming for -- is not a goal you can achieve short of developing your own in-house language and never letting it make contact with the public (since as soon as it goes public it will change).

Meanwhile, the libraries are moving on and sooner or later they'll either move to Python 3, or be replaced by equivalent libraries with active maintenance, and the distros are winding down their support for Python 2. Switching to something else, and probably just rolling an in-house language you can control forever, is likely your only option if this is your genuine technical position.


I think you're mischaracterizing the parent's post. Python 3 is slower, offers no compelling features, and uses a string storage model that he doesn't agree with. That criticism does not mean that he or she desires "the language frozen in time".

The real feature needed is to eliminate the GIL. That would be worth breaking compatibility over.


Python 3 is slower, offers no compelling features

Python 3.0 was slower than 2.7 due to several key bits being implemented in pure Python in 3.0. Since the 3.0 release (remember, Python's on 3.5 now) things that needed it have been rewritten in C, and as of Python 3.3 the speed difference is one. Also, on Python 3.3+ strings use anywhere from one-half to one-fourth the memory they used to.

As for "no compelling features", well...

* New, better-organized standard library modules for quite a few things including networking

* Extended iterable unpacking

* concurrent.futures

* Improved generators and coroutines with 'yield from'

* asyncio and async/await support in the language itself

* The matrix-multiplication operator supported at the language level (kinda important for all the math/science stacks using Python)

* Exception chaining and 'raise from'

* The simplified, Python-accessible rewritten import system

etc., etc., etc.


I'm 100% not picking on you but, I started programming in Python around 2004. When Py3k came out, I had exactly zero of these gripes. Not once did I ever say, "I wish urlsplit were in urllib.parse instead of urllib", or anywhere else. I had no problems with iterables, the async stuff (honestly with the GIL, why does any of that matter), exceptions or imports. I recognize that some users of Python did, library/framework devs, etc. But Py3k is a huge, incompatible, and confusing change for 90% of Python devs, and the benefits are things they didn't, and still don't care about. It was honestly just a bad decision.


for 90% of Python devs

Given how many people and projects suddenly said "yeah, actually, we want to be on Python 3 now" after seeing the new stuff in 3.4 and 3.5, I think you're overestimating the number of people who don't care about these features.


I take your point, but I have a few counterarguments:

1. Python 3.4 was released 7 years after 3.0. That's a very long time.

2. Assuming you're referring to the async features (none of the other stuff is really momentous), I can understand that for framework devs. It's really not a big deal for most users though, and because of the GIL, it's not as if they'll suddenly reap the benefits of parallelism. All it really means is nicer algorithm expressions and event loops.

3. There's no technical reason the new stuff couldn't have been added to Python 2. The roadblocks are manpower and politics.

4. Even if we stipulate async/await/asyncio are huge, busted Unicode support, 7 years of development, and breaking compatibility with everything is just a terrible tradeoff.

There's just no way this was a good idea, and Python devs could earn a lot of credibility back if they just said "oops". But there's no chance of that.


Short version sounds like: Python is a legacy language. We're not porting our legacy code and we're not starting new projects in Python.

I wonder:

1. Would this have changed if there was a Python 2.8 with some new features (but still a GIL) that ran most 2.7 code?

2. Is your experience representative, i.e. are teams just not starting new projects in any version of Python, even though so many did in the last decade?


Same industry, smaller scope here.

- We have dependencies on 3rd-party applications (Houdini, Maya, Nuke) that do not support Python3

This is our reason. It's a bit of a conundrum meets catch-22 situation. On one hand we would start using python3 if there would be a support for it, on the other hand no one wants to because of porting legacy code seems bothersome if everything works as it should. To be honest though, all of our python code is to augment those 3rd party applications. Everything we have that's not tied to those applications is C(99 more or less).


Thank you for your candid and insightful comments.

I am curious - if there was a GIL-removed version of Python 2, would you change your assessment that your "long-term plan will likely involve removing more python from the pipeline"? i.e. is the GIL the primary (or even sole) factor in that?


I can only speak for my personal experience - I write all my new code in Py3, and almost every still-developed library works great with Python 3.. The authors usually ported it several years ago. But sometimes you'll need something that's not still supported -

Perhaps you want to parse an old obscure file format, and the only code you find for it is from a usenet post in 1996. That code isn't being updated, and no one has ported it. That means you need to do the work to update it, and that can be hard when you aren't familiar with what it's supposed to do.

The other place I've seen people sticking with Py2 is when they've got a huge chunk of internal code. Some companies have been writing python for 20 years, and the original authors have long since left the company. It can be hard to write a business case for having someone spend several weeks updating all the old code, particularly if it's purely internal, and doesn't touch the internet.


I think you're exaggerating the obscurity of libraries that only run on python 2.

The list I've gone off of is https://python3wos.appspot.com/. I use python a lot, and only now in late 2015 I might finally use python3 if starting a new project. When I started a new project last year, we used python2.

The biggest hold-out for me was gevent, which was only released 5 months ago. Gunicorn with gevent workers is my preferred stack for running python apps.

If you use protobufs or thrift, those both aren't yet on python3.

The wall of shame currently lists requests as not working on python3, though I think that might be a fluke.

These ones might not be a deal-breaker since you can have a separate environment for infrastructure & app code, but for some reason a lot of the infrastructure tools still haven't updated to python 3 (supervisor, ansible, fabric, graphite).

All together, it adds up to a not-insignificant number of things that aren't yet on python3. And even if nothing you use when you first start a project is python2 only, you have no idea what libraries you might need or want in the future and if those might be python2 only.


That's fair. I only have my experience to draw from, but I've personally found it very straight forward.

Gevent said it looks like they're supporting Python3 as of the 1.1 release - http://www.gevent.org/whatsnew_1_1.html

I've used requests with Python3 quite a bit.. I have no idea why it's not listed as supported on the WoS, but their page shows it as working since 2012. https://pypi.python.org/pypi/requests/

Protobugs looks like it's now Py3 compatible, per the devs- https://github.com/google/protobuf/issues/646

ThriftPy has supported Python3 for a while (although ThriftPy is slower than the official lib). Apache Thrift has recently begun working with Python3 as well, however (https://issues.apache.org/jira/browse/THRIFT-1857)

I run Ansible/etc in their own virtualenvs, so they aren't part of system python for me, but you're right - I did have to write an Ansible module recently, and I recall that I did have to use Py2.

The only major package I recall having a problem with was PIL - Eventually I moved to Pillow as a drop-in replacement.


I wouldn't really trust the "Python 3 Wall Of Superpowers", as it doesn't appear to be anywhere near current. The first listed example of a 3.x incompatible package is requests[1], which is not only compatible with 2.7--3.4[2], but has been 3.x compatible since 2012-01-23[3].

[1]: http://docs.python-requests.org/en/latest/

[2]: http://docs.python-requests.org/en/latest/#feature-support

[3]: https://pypi.python.org/pypi/requests


protobuff supports 3.0 (though apparently the pip3 version still requires you to manually run 2to3, but then it works) and there's already a port of fabric (ish) by the original author to py3. And Ansible should be python2 until distros start shipping py3 by default, (which they have so I assume Ansible3 will come out soonTM).

If you're willing to do a bit more than "pip install x", well I guess it doesn't work, back to py2, you can use almost everything on py3. (and yes requests is ported too)


A freshly ported library of nontrivial size will include many bugs, especially if the porting is done automatically. Not a risk worth taking.


It's not only only about big libraries. Unless you do web-dev and can rally only on the currently developed libraries, from time to time you will get a specific, abandoned library. Even if there is just a single such library, Python 2.7 may be preferable.

Moreover, in machine learning Python 2.7 is still considered the default version of Python. (E.g.: some part of OpenCV need Python 2.7; until recently Spark supported only Python 2.7.)


>> why has Python 3 adoption been so slow

Really this topic is coming to an end. Most libraries support Python 3 and if they don't there's better alternatives.

For new users and new projects there's no reason now except personal preference to choose Python 2 and in fact beginners who start with Python 2 are just instantly incurring a learning debt upon themselves to be paid down the track when they have to move to Python 3.

The community has some extremely vocal Python2 diehards but their arguments no longer hold water.

>>why has Python 3 adoption been so slow? Whatever, it's just history now.


What about Google's AppEngine? I would dearly love to run Python 3 code on it, but 2.7 is the latest version that's supported. Google seems to have no roadmap to Python 3 on the platform (Managed VMs don't solve the problem - the SDK does not support Python 3) which leads me to worry that they're going to deprecate support for the language entirely on it.


History it isn't. There is a vast amount of Python 2 code in existence. I still write a lot of Python 2, doing maintenance on existing projects.


What about MySQLdb? That's the main extension module that has been keeping me on Python 2 for server-side software. That, plus a lack of perceived upside to CPython 3, which is still an interpreter with a GIL after all. PyPy 3 might be compelling if it can run MySQLdb or a bug-compatible replacement. True, PyPy still has a GIL, but at least it also has a JIT compiler.


MySQLdb has other issues besides Python 3 support. It's no longer maintained and hasn't been updated in quite some time. The Django documentation (and I) now recommends mysqlclient-python, which is compatible with systems that rely on MySQLdb

https://github.com/PyMySQL/mysqlclient-python


Another endorsement from me. We migrated with

    import pymysql as MySQLdb
and everything just worked.



Check into PyPy-STM, they were able to remove the GIL. This will eventually be merged back into PyPy4 (the 2.7 compatible version).


Perhaps better to ask, why have Python devs vowed never to do this again? There is a reason it will be "just history" soon.


I'm coding exclusively in python3, and I agree the "least small" changes from python2 made it cleaner BUT

> I'm also a fan of map and filter returning generators rather than being hard-coded to a list implementation

I find myself often having to wrap expressions in list(...) to force the lists. (which is annoying)

Generators make things much more complicated. They are basically a way to make (interacting, by means of side-effects) coroutines, which are difficult to control. In most use cases (scripting) lists are much easier to use (no interleaving of side effects) and there is plenty of memory available to force them.

Generators also go against the "explicit is better than implicit" mantra. It's hard to tell them apart from lists. And often it's just not clear if the code at hand works for lists, or generators, or both.

So IMHO generators by default is a bad choice.

> stream fusion is a good thing

I don't think generators qualify for "stream fusion". I think stream fusion is a notion from compiled languages like Haskell where multiple individual per-item actions can be compiled and optimized to a combined action. Python instead, I guess, just interleaves actions, which might even be less efficient for complicated actions.


> I find myself often having to wrap expressions in list(...) to force the lists.

Out of curiosity, why do you need to force the lists?

> Generators make things much more complicated. They are basically a way to make (interacting, by means of side-effects) coroutines

Huh? Generators are a way to not make expensive computations until you have to--as well as to not use memory that you don't need. Basically, if all you're doing with a collection of items is iterating over it (which covers a lot of use cases--but perhaps not yours), you should use a generator, not a list--your code will run faster and use less memory.

> In most use cases (scripting) lists are much easier to use (no interleaving of side effects) and there is plenty of memory available to force them.

Generators don't have to have side effects. And there are plenty of use cases for which you do not have "plenty of memory available" (again, perhaps not yours).

> IMHO generators by default is a bad choice.

I think lists by default was a bad choice, because it forces everyone to incur the memory and performance overhead of constructing a list whether they need to or not. The default should be the leaner of the two alternatives; people who need or prefer the extra overhead can then get it by using list() (or a list comprehension instead of a generator expression, which is just a matter of typing brackets instead of parentheses).


I think you're referring to:

  for i in range(large_number):
     ...
(similarly enumerate, zip etc)

i.e., where you don't bind the generator to a variable, but instead immediately consume and reading from it as an iterator has no other side-effects.

And that's the only usage of generators that in my usage is both common and practical. Other use has always quickly become a mess for the above named reasons.

I don't disagree generators can be occasionally useful (or often, for your special applications). But mostly it's a pain that they are the only thing many APIs return and that they are not visually distinctive (in usage) from lists and iterators. For example

  c = sqlite3.connect(path)
  rows = c.execute('select blablabla')
  do_some_calculation(rows)
  print_table(rows)
Gotcha! "rows" was probably already empty when print_table was supposed to print it. But how can you know? Hunt down all the code, see what the functions do and what they want to receive (lists, iterators, generators? probably they don't even know). And what if the functions change later? Even subtler bugs occur if the input is consumed only partly.

So by far the common (= no billions of rows) sane thing to do is

  rows = list(c.execute('select blablabla'))
Which is arguably annoying and requires a wrapper for non-trivial things.


None of what you say addresses my main point, which is that a list is always extra overhead, so making a list the default means everyone incurs the extra overhead whether they need it or not. You may not care about the extra overhead, but you're not the only one using the language; making it the default would force everyone who uses the language to pay the price.

Or, to put it another way, since there are two possibilities--realize the list or don't--one of the two is going to have to have a more verbose spelling. The obvious general rule in such cases is that the possibility with less overhead is the one that gets the shortest spelling, i.e., the default.

Also, if you've already realized a list, it's too late to go back and un-realize it, so there can't be any function like make_generator(list) that saves the overhead of a list when you don't need it. So there's no way to make the list alternative have the shorter spelling and still make the generator alternative possible at all.

As far as your sqlite3 example is concerned, why can't do_some_calculation(rows) be a generator itself? Then print_some_calculation would just take do_some_calculation(rows) as its argument. Does the whole list really have to be realized in order to print the table? Why can't you just print one row at a time as it's generated?

Basically, the only time you need to realize a list is if you need to do repeated operations that require multiple rows all at the same time. But such cases are, at least in my experience, rare. Most of the time you just need one row at a time, and for that common case, generators are better than lists--they run faster and use less memory.

If you need to do repeated operations on each row, you just chain the generators that do each one (similar to a shell pipeline in Unix), as in the example above. This also makes your code simpler, since each generator is just focused on the one operation it computes; you don't have to keep track of which row you're working on or how many there are or what stage of the operation you're at, the chaining of the generators automatically does all that bookkeeping for you.


What makes me snarky is the replacing of

    '%s %s' % ('one', 'two')
With

    '%s %s'.format('one', 'two')
The latter is just more annoying to type. Stupid argument I know but I find myself grumbling to myself every time...


The second example should be:

    '{} {}'.format('one', 'two')
Either way, the former example still works in Python 3.5, so that syntax hasn't gone away. `format` is preferred, though. This Stack Overflow question has some good answers as to why: http://stackoverflow.com/questions/5082452/python-string-for...


You mean,

    '{} {}'.format('one', 'two')
Your experience may vary, but in my experience, when I switched to .format(), I found a number of bugs in code that used % instead. As mentioned, you can continue using %.


I had the same experience.

I love love love printf() and it's ilk, so switching to something else (no matter how well designed) seemed asinine at first.

But .format() has really started to grow on me and did uncover some subtle bugs in old code.


The thing about the latter form is that you can do something like '{0} {1} {0} {2}'.format('apple', 'banana', 'orange') which results in 'apple banana apple orange'


You can also use keyword arguments, which is really helpful in certain situation. Example from the docs:

  'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')


You could do that before with % substitution too. But I prefer .format() because (1) it's the new idiom and (2) will coerce to string without me binding the type in the format string.


Simplified to this in Py 3.6:

    f'{one} {two}'


Do you have a reference for that? Wouldn’t that be introducing a huge security hole in all programs?


PEP 498. Literals only, does not add any additional security problems.


Ah, right; I missed the ‘f’ prefix. And since it’s only done when parsing the expression, it is not a security problem. Thanks!


Is it actually happening? Proper string interpolation? In all its extendable glory?


Pep was accepted, and I believe can be tried in a nightly or alpha release. It is not extensible though, GvR didn't see much utility, yet at least.


It's not replaced, both are perfectly valid. And this addition also exists in Python 2.


No need to get snarky, because the % formatting is still in Python 3. It is also my preferred formatting method.


Let's not also forget string.Template ;)

https://docs.python.org/2/library/string.html#string.Templat...


I actually found a good use for string.Template, but that doesn't mean I understand why it exists in addition to the other two formatting sub-languages.


It was created long before .format and is better in some cases for i18n where simplicity is best.


The super, super controversial automatic interpolation feature is what I'm really looking forward to.


The main reason for most bigger projects is that they rely on that one library that is not compatible with Python3. Although there are fewer and fewer of those fortunately.

Another reason is that the advantage of switching is just not that big, if you already have everything working in Python2.


It's lack of backward compatibility, and not enough of an upside for most developers to go through the effort to upgrade their existing code. It's especially tedious for open source projects, because making a codebase compatible with both 2 and 3 can be a lot of work.


PEP3003 was a moratorium on language changes in order to allow alternate implementations time to catch up. This meant that Python 3.1 and Python 3.2 didn't include any new language features. Releases since the moratorium ended have been much more compelling, IMO.


I think it's because 3.0 and 3.1 were basically "we broke everything, it's slower, and there are no new compelling features." The versions after that have been fine, but I think the first two versions were such clunkers that it created a bad schism, and now that there's a schism nobody wants to support two languages, so people just stick with 2.7. I think if they had just put some new feature that people had to upgrade for in 3.0, they would have avoided this mess, and people would have grumbled for a few months and adapted.

I don't think the lesson is "never break compatibility". The lesson is "don't compete with yourself by releasing a product that is actively worse than your current version"


I think one thing that stalled adoption was the difficulty of migrating to Unicode string constants. In Python 2, you could make a unicode string as u"Entré", but in Python 3, the 'u' was not permitted. Allowing the superfluous u" notation in Python 3 was a big aid in writing 2 and 3 compatible software. Don't remember when that was introduced, Python 3.2?

Within the last six months we've moved to writing all new code in Python 3 and migrating a fair bit of legacy code as well. Been fairly smooth on Linux -- a bit rockier on Windows.


> I think one thing that stalled adoption was the difficulty of migrating to Unicode string constants.

This wasn't just down to the fact that Python 3 didn't support u"" (it was 3.3 that was added, for reference), but also down to the fact that much of the eco-system still supported RHEL5's default of Python 2.4 which meant `from __future__ import unicode_literals` wasn't an option (it is, almost certainly, a less good option, but it's in many ways good enough).


It feels like there are a whole bunch of factors (though I'm no expert).

It took a few versions of 3 to hit a sweet spot (in some cases features that were removed in the initial version of 3 have slowly been re-added in subsequent versions). There were a lot of crucial libs that needed to be ported. Just general inertia.


I think the reason is that it's basically only now that Linux distros are starting to ship Python 3 as the default. When RHEL, CentOS and Debian moves completely to Python 3, the rest of the world will follow.


Is the shipped default really so important, esp. for third-party software? E.g. for RHEL you get python3 packages through Red Hat Software Collections, with support and "intended for production use".

(It of course limits software that is to be shipped with the distro itself)


RHEL/CentOS 7 was released in June of 2014 with Python 2.7.

Following their glacial release schedule, maybe we'll see Python 3 by 2019 in RHEL 8.


Those who use value a more rapid package update schedule shouldn't use those distros. One of their most salient features is the ancient software packages.


one of many reasons could also be devs wanting a faster implementation or a better multi-core story, so Python3 has probably lose ground to languages like Go, Clojure, Javascript.


Ok, fine. Can we have the print statement back?


Genuinely curious, why do you prefer `print` to be a statement rather than a function? I've heard a lot of criticisms of Py3, but this is the first time I've heard this one.


I think a function is "better", but the print statement should have been preserved.

Print is used either for:

1) Writing to stdout/stderr 2) Debugging the hacky way

The print function is better for the former (though more often than not, I use Armin Ronacher's click.echo for compability)[0], but I fastly prefer the print statement for (2). Not dealing with parentheses is always a plus; I can add and remove print statements far more quickly.

I find it to be an increase in friction, and I don't see any real downside in leaving it in.

[0] http://click.pocoo.org/5/api/#click.echo


I agree with you on 1), I much prefer `print(..., file=sys.stderr)` to `print >>sys.stderr, ...`.

For 2), though, I have always used auto-inserting parens in my vim, so I never really experienced any pain w.r.t. parentheses. I can see how that would be a pain, though.


I prefer the print function for debuging, because I can just do s/print/log.debug if I want to demote debug printing to the logger.


I've heard this complaint multiple times actually, though it was just because it's not what people are used to.

I'm fine with leaving it out, but I would like it back in. The reason is that in my mind Python has been the language of simplicity and elegance, where you are constantly pleasantly surprised by the ease of use. Python, to me, is the language of childish glee. The parentheses on the print function is Python putting on a suit and going to work.


I'm fairly new to Python (within just a few years of my first use of it) but have already been hacking at some Py2 apps to update them to Py3 because, well, Unicode. Much (not most, but much) of the random code I've tried to run with Py3 works without actual code modification beyond import statements.

Two primary exceptions: byte/string encoding in Py2, and print. So much of the Py2 syntax works without change in Py3, and while I wholeheartedly agree with "print('foo')" as the proper way to print something, "print 'foo'" is littered all over the random Py2 scripts I've found in my travels.

I am, in fact, converted to print(), for the record. The first time (in Py2) I tried to extrapolate what I learned about (asterisk)args, I attempted to do this:

  print *df.columns
Whoops. Then came Py3:

  print(*df.columns)
Huzzah! Instantly converted.

(edit again: Can't seem to get that asterisk to work without a full-on space)


why do you prefer `print` to be a statement

The best quip I've seen about this is:

   Python 3 broke "hello, world".
I just tried, and the "hello, world" program from K&R still compiles and runs.

The article claimed that Guido started Python back in 1989. Wiki tells me that Python 3 was released in 2008. So, if after nearly 20 years of existence, you break the quintessential program of your (or rather any) computer language, then, ipso facto, you've just abandoned your current language and switched to a new one.

Others have mentioned that the print statement could have been retained, while also allowing a function. I would have preferred that. Not "the Python way" of having one way to do something, but there's also backward compatibility to consider.


I've heard this one many times, but the suggestion is as brainless as it sounds. I'll be surprised if you get anything resembling a decent reason.


in the beginning,

   >>> print "hello world"
was the best marketing slogan for Python's down to earth simplicity versus competitors that one could possible have devised. Not even any brackets!!

It defined the ethos of the language, was the first thing people came across when considering Python for the first time, and probably punched well above its weight in bringing people in.


I never considered the first impression aspect of print statement vs print function.

I feel that simplicity is a driving factor in python's 'virality' from one programmer to the next, and in that context your case is very compelling.


>>> print("hello world")

And that's so much harder to grok?


Short answer: Because the change breaks a lot of code for no reason other than a stylistic one. It wouldn't be difficult for it to be a function, but the interpreter to allow the old invocation.

Though it does serve as a neat diagnostic "this program wasn't built with python3 in mind" message when you encounter the syntax error, so maybe it balances out...


Print as a function is a definite improvement in terms of functionality and API (`print >>fileobj, ...` always looked like syntax from another language). I too though I'd be bothered by the extra (), mainly due to muscle memory, but after spending a week with it, I could hardly feel any inconvenience.

Setting up an abbrev or a snippet also helps. I use these a lot:

    p<tab>   -> print(|)
    pp<tab>  -> import pprint; pprint.pprint(|)


> `print >>fileobj, ...` always looked like syntax from another language

Yes, AWK specifically.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: