First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".
But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years.
Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.
For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.
I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.
For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.
In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.
I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.
File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.
This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.
Seems like an apt acronym for Windows... :-)
On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.
Are you sure the issue wasn't something else?
(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)
Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.
(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)
You have to convert between them, but neither uses proper Unicode to represent filenames.
But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.
Git builds a bunch of logic like this in around handling line endings in text files.
Bytes without encoding, don't have any meaning, they are just... random bytes.
I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.
For all programs, for the simple reason that:
> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.
Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else, which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.
> Repository data is bytes, not Unicode.
It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.
 though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well
This is a problem with the Python 3 standard library; in many places it requires Unicode when it shouldn't.
str is not Unicode in fact if you don't use fancy characters internally it stores text as a byte array.
You should think of text the same as of image or sound, what you see in the screen or hear in the speaker is the actual thing, but if you need to save it on disk you encode it as for example png or wav.
Feel free to s/Unicode/str/ in what I posted if you prefer that terminology. The problem is still the same.
An example of the problem: Python's standard streams (stdin|out|err) in Python 2 are streams of bytes, but in Python 3 they're streams of Unicode (or str if you prefer that terminology) characters. The problem is twofold: first, if my standard streams are hooked to a console, Python can't always properly detect the encoding of the bytes coming from the console, so it can give me the wrong Unicode characters; second, if my standard streams are hooked to pipes, there is no encoding it can pick that is right, since the bytes aren't even coming from a console (where at least there is some plausible argument for saying the user meant to type Unicode characters, not bytes). What Python 3 should have done was keep the standard streams as bytes, since that's the only common denominator you can rely on, and then let the application decide how to decode them if it decides it needs to, just as in Python 2.
If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version. Most people will use it for text, so the defaults make sense. Personally I would like if there was no automatic conversion when using files/network/pipes etc. but I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.
Yes, that's the best you can do, but it's still not always correct. I agree that it should be, but "should be" and "is" aren't always the same.
> If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version.
Yes, but there are still standard library functions that will use the regular streams, and that might conflict with what your application is doing. There is no way to tell Python as a whole "use binary streams everywhere because they are pipes for this application".
> Personally I would like if there was no automatic conversion when using files/network/pipes etc.
That would work if (a) Python could always detect that condition (it can't) and (b) the entire standard library adjusted itself accordingly.
> I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.
Python 2 worked fine with the standard streams being binary, and applications wrapping them to decode to Unicode when necessary. Python 2.7 even back ported the TextIOWrapper and similar classes to make the wrapping as simple as possible. A similar approach could have been taken in Python 3 (binary streams and a simple wrapper class), but it wasn't.
"%s/%s" % (repository_data_1, repository_data_2)
And have it work on Python 2 and 3, you're screwed.
I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.
We're discussing the linked article, so I'm talking in the context of the linked article. I know it works now, but Python 3 initially removed %-formatting for bytes. I guess I should have used past in my comment, "you were" screwed instead of "you are". From the article:
> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.
Python 3's behavior as far as forcing you to explicitly recognize data type conversions is more correct, yes.
Python 3's behavior in assuming that nobody would ever need to do "text-like" operations like string formatting on byte sequences was not. At least this particular wart was fixed. But there are still a lot of places where Python makes you use the str "textual" data type when it's not the right one.
Python 3's behavior in making individual elements of a byte string integers instead of length-one byte strings is, frankly, braindead.
You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)
Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.
Do not ever use text-mode `open` without specifying an encoding.
 I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.
Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.
> you can also specify …
And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.
Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.
I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.
You can also specify encoding when calling open.
They bolted on a separate set of functions that took UCS-2 and now take UTF-16.
The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.
They put together a code page for UTF-8 but it's behind a 'beta' warning.
NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.
> They put together a code page for UTF-8 but it's behind a 'beta' warning.
Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.
But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.
It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.
The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.
Python set the entire community back 10 years or more by making this drastic mistake.
TBH I do think the problem is easier to address in a statically typed world.
If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.
In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...
And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.
I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.
Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.
Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then (before a large influx of users), but a corresponding complaint about UTF-8 is that because it was 8-bit safe, a lot of tools also felt they could kick the can on dealing with it more directly (as a default), and Python 2 seems to be among them. Hindsight has told us a lot about the problems to expect (and exactly why Python 3 did what it felt it had to do), but they probably weren't as clear in 2000. (In further hindsight, imagine if Astral Plane Emoji had been standard and been common around 2000 instead of 2010 how much further we might be in consistent Unicode implementation today. I suppose that makes 2010 another red letter date for Unicode adoption.)
That's true, but I would argue that given the difficulty and backlash we've seen moving from Python 2 to Python 3, such a move would have risked destroying Python's rapid forward momentum and condemned it to the ash heap of programming language history.
I'm just saying the move to Python 3 turned out to be a huge deal to a lot of people (it surprised me), and for that reason, trying such a big jump at Python 2 would have been risky and could have derailed Python's forward progress at a critical point.
Would the downvoters like to share their reasons for disagreement?
It probably would have been a lot less risky with so many fewer daily users, so many fewer huge projects to migrate.
When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.
I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.
Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.
I had to switch back to treating headers as bytes for as long as possible.
It is a stupid client which doesn't send valid ascii for http headers of course.
What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.
If you are representing strings as bytes, you are intrinsically using an encoding.
> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.
Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.
But yes, this strategy largely avoids encoding issues... until it doesn't.
It's just binary data that might resemble a string. No encoding necessary.
No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.
oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.
It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.
If you want to handle all headers, you have to be prepared to just get binary data.
...or a smart malicious actor.
>assuming the world is Unicode is flat out wrong
True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ().
(Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)
> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.
Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.
And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.
So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.
The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.
> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.
Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.
The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.
On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.
As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.
The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.
Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?
The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.
Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.
With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.
It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.
I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.
Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.
A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.
Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.
That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.
Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.
PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.
Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.
Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.
All of this was known back then.
But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).
Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.
What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.
[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.
FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.
In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.
> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.
I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).
And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.
Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.
Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?
If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.
No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.
Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.
Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):
It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.
> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.
Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte when doing I/O.
Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".
I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.
IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).
>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode
>standard library APIs that formerly took bytes now incorrectly take Unicode strings
What do you mean by "incorrectly"?
It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)
People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.
> Python is far from the only popular tool to assume paths must be valid unicode.
It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."
There’s only every single input from the system at large, no big.
C# char is a UTF-16 code unit, not a Unicode code point.
Most code points "fit" into just one UTF-16 code unit, but not all.
For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.
API members that operate on code points universally take a string and an index.
That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.
Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).
The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!
And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.
Ruby 1.8 had "everything as bytes" and there was no concept of encodings.
Ruby 1.9 introduced explicit encodings on every string. By default, strings would be encoded as the same encoding as your source file. The default was ASCII. You could control this explicitly with a magic comment, and so many folks added the "UTF-8" comment, to get strings encoded as utf-8 by default.
Ruby 2.0, which was not as large a transition as Ruby 1.8 -> 1.9, even though it sounds like a larger one, said that encodings of files were UTF-8 by default, and therefore, strings generally became UTF-8 by default as well. Most folks just removed their magic comments.
I feel like this is the essence of the article: specific constraints/choices of Mecurial made their port to Python 3 difficult. Working with early Python 3 certainly did not help. But there seems to have been some stubbornness here mixed with a lot of retroactive justification.
> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code.
This is almost ridiculous. You are going to write a JIT partial 2to3 instead of just increasing your length limits and/or using an autoformatter? (Of course, it turns out they eventually did do that... after a bit more stubborness regarding the autoformatter.)
> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial.
Couldn't this have been a very occasional copy and paste, instead of a downstream dependency? [six](https://six.readthedocs.io/) "consists of only one Python file, so it is painless to copy into a project."
> Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility.
Yes, can't disagree. Early adopters who attempted to write 2- and 3- compatible code suffered the most.
Having just done transitions on a number of much smaller projects I had the same thought. Changes to string handling tripped me up and the changes to relative imports took some thinking. But the biggest frustration was the nagging question: Why am I doing this?
edit: missing word
Lack of security updates past 2019 forced our hand. Did you find a way around that?
Amazon is maintaining Python 2 for at least 4 years, as part of their Amazon Linux long term support release. Google app engine will support Python 2 for an unknown amount of time; they haven't announced an end date. PyPy is Python 2, with (to the best of my limited knowledge) no plans to deprecate support. There are also other LTS releases out there which include Python 2 support.
IOW, the forcing function of the PSF no longer supporting Python is not as big a factor as was hoped.
For example, the python-saml package (for managing SAML-based single sign-on) has separate Python 2 and Python 3 versions, and implements a security-sensitive protocol which means it has (in the fairly recent past) gotten security updates for issues serious enough to rate an assigned CVE. If you're using it, having the current maintainers walk away from the Python 2 version is a serious risk...
Is Amazon planning to support pytest for at least 4 years? It will have its last 2.7-supporting release very soon.
It's particularly uncool that Guido brought up the prospect of lawyers (https://github.com/naftaliharris/tauthon/issues/47#issuecomm...) to force it not to be called Python and opposed to letting people who care about keeping Python 2 alive evolve it as "Python 2". (I know he has the legal right to insist on the name change. Still uncool.)
Besides if the Tauthon people are serious about maintaining their fork long term it needs to become more than a mere fork and a real language ecosystem of its own, in the long run having a different name will probably help with that, assuming that they ever get there.
EDIT: Also reading the rest of the thread I realize that the post that you linked out of context is slightly misleading (but I blame github's aggressive folding more than you here). Guido's answer comes after the following exchange:
stefantalpalaru: "Disregard Guido's objection. The "Python" trademark doesn't extend to "py2" or "py28". Read this for details: https://www.python.org/psf/trademarks/"
Guido: "Isn't the whole point that we're trying to solve this without lawyers?"
stefantalpalaru: "The whole point is that you've been sabotaging Python 2 for years and when someone does what needed to be done from the start, you come up with silly objections."
Guido: "OK, bring in the lawyers."
In that light, and given the other poster's ridiculously inflammatory take, Guido's answer seems rather level headed and appropriate IMO. He stands his ground, so to speak.
Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media). So Perl and Raku are now considered to be different languages, albeit from the same inspiration.
Now, if Python 2 people would decide to rename Python 2 to something else, I guess it would be a mirrored parallel :-)
It's not a mirrored parallel, it's the Python folks learning from Perl's mistakes and making sure that this parallel won't come to be.
But I'm very disappointed that the Python Software Foundation isn't explictly supporting people who want to keep Python 2 compiling and running on modern systems. I think that would be well within their remit to "promote, protect, and advance the Python programming language".
This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.
Even before Python 3.0 appeared, I came across scientists saying "I prefer to stick with Fortran because new Python versions break old code too frequently".
This case is different, because it's a project that uses the Python name, but actively adds features to the language. This is the classic example of brand confusion - someone might try to use it, find something to complain about, and PSF's reputation suffers as the result. They also get support overhead from the users of the fork (even if all they do is tell them to go away, that is still triage time that could be spent on other issues).
You can always download an old version and the respective libraries and use them to reproduce any results you want. That doesn't mean that old version should be supported anymore.
I don't see Guido as in the wrong for that. It'd be a smack in the face when you spend years trying to finally push people to switch (for better or for worse) and then a project like this takes the SEO and gets to run freely with it.
Imagine if Stroustrup had done D and insisted that it be called C++ and wanted everyone to stop using the language everyone knew as C++ on Jan 1st 2020.
They aren't stopping people from using Python 2, the language or Python 2, the software.
They are stopping people from using the name “Python” as the name of forked implementations of Python 2 not maintained by the PSF. No implementation not maintained by PSF is allowed to be called unqualified Python; the name is an important indicator of provenance. There are and have been plenty of third-party Python (2 and otherwise) implementations, the implementations just need their own names.
The effort to claim the binary name python for Python 3 is actively hostile to Python 3 and a thing that runs unmodified Python 2 unmodified on the same operating system installation. (It's unclear to me how much this is a PSF push, but at least the PEP isn't telling distros to refrain from this hostile-to-comaptibility action.)
> No implementation not maintained by PSF is allowed to be called unqualified Python
The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.
For who? This costs the PSF manpower/overhead that they don't want to expend on a thing they don't want to maintain. It dilutes the language that the PSF are stewards of, and would further cause a schism in the python community. None of those things sounds good for python, its ecosystem, or the PSF. They sound good for, like, a few curmudgeonly companies and individuals that don't want to migrate.
I can't parse your first sentence, so I can't respond to it.
For users of the Python 2 language who have a lot of Python 2 code and for whom migration doesn't make cost/benefit sense on technical merits of Python 3.
There's Tauthon. There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2. There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.
It would be great if there was a common venue for collaboration for these by the parties who are interested in keeping Python 2 going. (I'm not suggesting that Python 3 core devs should do the work.) Like a foundation for Python software.
The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.
Yes, but I don't believe I've seen any (real) suggestions to change PEP 394.
> There's Tauthon.
Which I claim is actively bad for python's ecosystem in the long term. It shouldn't be supported by any organization that wants what is best for Python.
> There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.
That works just fine without any help. pypi continues to support python2 tags and wheels, and I doubt that'll change anytime soon.
> There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2.
So the entire reasonable bit here is that the PSF should provide something to help various enterprise companies manage backporting security patches. Which, like, I'm not sure what infrastructure is actually needed for that. They already make security patches public. Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.
If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.
That such improvements are considered actively harmful comes from a point of view where there's a top-down imperative to shut down Python 2 in order to make Python 3 succeed. It's not harmful from the point of view of the code people have written in Python 2 being valuable.
The notion that there user community needs to work for Python (by porting to Python 3) and that Python 2 needs to be shut down as opposed to Python development valuing the existing code that had been developed is the core problem with Python 3.
But it really doesn't. If the new features are that valuable, you can convert your code. It's not actually that hard (I have a few 100kloc ported forward now, with millions of lines of dependencies that says so).
Any project that forks changes name:
nagios -> icigna
mysql -> mariadb
NetBSD -> OpenBSD
FreeBSD -> DragonflyBSD
Python -> PyPy, Jython, IronPython
It would be crazy for them to keep the same name and not be compatible. It would cause confusion and also lead to increase of support tickets in wrong bug trackers.
Anyway, the core problem is a top-down effort to try to make a programming language of Python 2.x’s level of usage stop to the extent it’s stoppable under its license, because its creators wanted to do something else, as opposed to facilitating its user community to pool resources to continue its development. Does the PSF have a legal obligation to do such facilitation? No. Is the lack of such facilitation bad for parties who bought into Python when it was Python 2? Yes.
They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.
You're arguing that the psf should treat python2 and 3 as different languages. In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).
In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.
I meant compatible (in the sense that old programs keep running and you can add new stuff to old programs using the new features) evolutions.
> You're arguing that the psf should treat python2 and 3 as different languages.
For practical purposes, they are different languages and the PSF has been treating them as distinct things.
> In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).
It indeed is bad. I hope that every other programming language community and designer takes a close look at what happened and makes sure never to do a Python 3 analog of their language.
> In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.
That's the core problem from the perspective of Python 2 users. The organization that was the steward of the language that they invested in (in the form of writing code in the language) decided not only that a different programming language is more important for the org but that the old language needed to be shut down in order to benefit their new thing.
It's OK for people to get bored with a project and move onto something else, but with the level of usage that Python 2 had and has, it's very problematic for the language steward organization to turn around and seek to shut the language down instead of continuing to evolve it in a way that's respectful of the language users' investment in the language.
You had like 10 years of warning and it's "disrespectful"? I don't think there's a chance of productivity if you're starting from that baseline level of entitlement. Sure, mandates are annoying. But I just can't fathom that.
There are tradeoffs.
That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.
You're mistaken. I have python3 binaries and python2 binaries that share dependencies.
You're correct that fully automatic transpilation is impossible, but that doesn't mean that there can't be shared source. It does however mean that things like per-file flags or whatnot aren't possible. Python became a better language with text vs. bytes support, but that support couldn't be done in a backwards compatible way. Oh well.
> You can add Rust to your app with with rewriting all C++.
It's not as good as you seem to think. It's a nonstarter for a lot of people otherwise interested in adopting rust into existing codebases. Certainly not better than the py2/3 situation.
Kotlin interop also is troublesome, although granted better than rust/cpp or py2/3.
> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.
That python didn't get replaced by a different language is an incredible testament to the foresight of the python language stewards.
How does reusing the name facilitate the development? Every time there is a fork of an open-source project the name changes, precisely to avoid confusions. Reusing the Python name in a fork that is not just a redistribution, but a new version with new features and syntax, is just confusing, unusual and does not help anyone.
There's no `--std=python2` flag you can pass to the interpreter, unfortunately.
Indeed, C++ has rarely made any breaking changes. A decade or so ago, GCC did cause some major ecosystem breakage, by cracking down on C++ constructs which had never been valid according to the spec but which GCC had previously allowed. When that happened, there was a flag to (at least partially) revert to the old behavior: -fpermissive.
This literally does not parse. How do you know "nobody" cares about those exceptions?
Dilution of what is commonly accepted to be Python would not be a good thing, and would further add to confusion.
I know that platform upgrades are painful, but we need to move with the times or we'll all be mired in technical debt and old technology.
The whole point of Tauthon is that it is compatible with Python 2 (in the direction that old programs work).
Consider anyone who wants to build something with Python, whether it's a library, application, or service. What's better, having to build for Python 3 and 2, or just Python 3?
Thank God that Guido did this, despite knowing all the blowback he'd get. To me, that's super cool.
For example, https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stabili... describes the "Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread) for which resources (funding plus competent staff) are very difficult to find."
 The thread at https://twitter.com/khinsen/status/930749714567434240 includes "Lots of C modules written for Python 1.4 are waiting for enthusiastic code archeologists ;-)".
I don't think Hinsen is alone in that situation. I can well believe there are some people who, for example, plan to retire in about 5 years and would rather keep with with a Python 2 zombie than spend time to port working code to Python 3.
I'll admit Python 3 is still slower at a lot of things. But that feels like saying your new dog is even worse at math than your old one.
The C extension thing isn't Python's fault. It's the job of library and app authors to update. Do we complain that Vulkan has bad SunOS support? This is totally backwards.
Could Hinsen (and others) not just version their deps? It's not like people are erasing Python 2 off the internet. If his main worry is reproducibility, he should be doing that anyway.
I don't want to give the impression I like the whole Python 3 thing. I think it was a pretty big mistake and a huge missed opportunity. I'm very sympathetic to people who had to put in a lot of work for basically no good reason--Python 3 didn't really offer anything significantly better than 2 until... 3.5 (3.4 if you think the first pass at async was useful, I personally don't).
But I also find the ballyhooing about it really insufferable. Yeah it was a mistake; Armin Ronacher (as usual) was right. It was also over 11 years ago. Time to forget all about this and build cool stuff, please please please.
Try "python -vv -c 'pass'" - I'm only showing the first few dozen lines, and I've trimmed some of the paths for conciseness:
% python -vv -c 'pass'
import _frozen_importlib # frozen
import _imp # builtin
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# installing zipimport hook
import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
# installed zipimport hook
import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
import '_io' # <class '_frozen_importlib.BuiltinImporter'>
import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
import _thread # previously loaded ('_thread')
import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
import _weakref # previously loaded ('_weakref')
import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
# miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/__init__.py
# code object from 'miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc'
# trying miniconda3/lib/python3.7/codecs.cpython-37m-darwin.so
# trying miniconda3/lib/python3.7/codecs.abi3.so
# trying miniconda3/lib/python3.7/codecs.so
# trying miniconda3/lib/python3.7/codecs.py
# miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc matches miniconda3/lib/python3.7/codecs.py
# code object from 'miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc'
import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44c90>
# trying miniconda3/lib/python3.7/encodings/aliases.cpython-37m-darwin.so
# trying miniconda3/lib/python3.7/encodings/aliases.abi3.so
# trying miniconda3/lib/python3.7/encodings/aliases.so
# trying miniconda3/lib/python3.7/encodings/aliases.py
# miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/aliases.py
# code object from 'miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc'
import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd67d10>
import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd440d0>
# trying miniconda3/lib/python3.7/encodings/utf_8.cpython-37m-darwin.so
# trying miniconda3/lib/python3.7/encodings/utf_8.abi3.so
# trying miniconda3/lib/python3.7/encodings/utf_8.so
# trying miniconda3/lib/python3.7/encodings/utf_8.py
# miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/utf_8.py
# code object from 'miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc'
import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44bd0>
import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
# trying miniconda3/lib/python3.7/encodings/latin_1.cpython-37m-darwin.so
# trying miniconda3/lib/python3.7/encodings/latin_1.abi3.so
# trying miniconda3/lib/python3.7/encodings/latin_1.so
# trying miniconda3/lib/python3.7/encodings/latin_1.py
... many, many more lines omitted ...
One of the things that bugged me in Python2 was that every startup imported UserDict:
# trying python2.7/UserDict.so
# trying python2.7/UserDictmodule.so
# trying python2.7/UserDict.py
# python2.7/UserDict.pyc matches python2.7/UserDict.py
import UserDict # precompiled from python2.7/UserDict.pyc
% python2.7 -c 'import os; print(os.environ.__class__.__bases__)'
(<class UserDict.IterableUserDict at 0x1029b14c8>,)
% python3 -c 'import os; print(os.environ.__class__.__bases__)'
# trying python3.6/collections/abc.cpython-36m-darwin.so
# trying python3.6/collections/abc.abi3.so
# trying python3.6/collections/abc.so
# trying python3.6/collections/abc.py
# python3.6/collections/__pycache__/abc.cpython-36.pyc matches python3.6/collections/abc.py
# code object from 'python3.6/collections/__pycache__/abc.cpython-36.pyc'
import 'collections.abc' # <_frozen_importlib_external.SourceFileLoader object at 0x103c2ecf8>
It's just hard to fix.
I'm not sure the os.environ example I gave is low-hanging fruit now. The collections.abc module might be imported anyway.
This is neat! Python 3.7 added the `PYTHONPROFILEIMPORTTIME=1` environment variable to help track down these sorts of import overheads:
% env PYTHONPROFILEIMPORTTIME=1 python -c pass
import time: self [us] | cumulative | imported package
import time: 523 | 523 | zipimport
import time: 722 | 722 | _frozen_importlib_external
import time: 156 | 156 | _codecs
import time: 2254 | 2409 | codecs
import time: 1293 | 1293 | encodings.aliases
import time: 7192 | 10893 | encodings
import time: 1108 | 1108 | encodings.utf_8
import time: 182 | 182 | _signal
import time: 1069 | 1069 | encodings.latin_1
import time: 395 | 395 | _abc
import time: 1486 | 1881 | abc
import time: 1540 | 3420 | io
import time: 100 | 100 | _stat
import time: 975 | 1075 | stat
import time: 1481 | 1481 | genericpath
import time: 1734 | 3214 | posixpath
import time: 2558 | 2558 | _collections_abc
import time: 2234 | 9079 | os
import time: 1407 | 1407 | _sitebuiltins
import time: 3498 | 3498 | sitecustomize
import time: 85 | 85 | usercustomize
import time: 4129 | 18196 | site
For some of my programs, Python startup time is the main overhead. I avoid NumPy and SciPy if at all possible because they have a huge startup overhead.
Some of this is inherent in those packages. NumPy internally imports everything so someone can do "import numpy as np; np.package.subpackage.module.function()" without doing the intermediate imports.
This means NumPy is optimized for programmers (especially novice programmers) using NumPy in long-lived processes where startup cost is a negligible overhead.
Which isn't all use-cases for numeric computing.
15 years ago I supported a CGI-based web app. It was very important to pull out all the stops (delay imports until needed, use zip packages) because it was easier to do that than to re-write everything for another architecture.
The dog does count pretty well after all.
> It's the job of library and app authors to update.
Why? Linus Torvalds doesn't agree with you, for one.
As Hinsen points out,
] Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it’s mostly a calamity for computational science.
Some scientific code has been able to run unchanged since the 1970s, through multiple new Fortran language releases.
Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities.
But why not recognize that for some situations Python 3 is not better?
Hinsen also comments on your proposal:
] The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don’t agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter.
> Could Hinsen (and others) not just version their deps?
He addresses that, I think. One of the other commenters gives a more complete reply at https://metarabbit.wordpress.com/2017/11/18/numpy-scipy-back... ending "Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.".
> Time to forget all about this and build cool stuff, please please please.
I'll quote Hinsen again "I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter."
Your implicit statement is that mmtk (Hinsen's code base) isn't "cool stuff". Why? Simply because it's old, or because you don't know about it or need it? What other cool old stuff will die because it's part of a community without the resources to update?
Instead, accept that that loss is part of the trade-offs, be empathetic to those who suffer, and bear those lessons in mind for future work you do.
Second, I admit to engaging in hyperbole when I said "Python 3 is better in every way"; usually I'm on the other side of these, but I'm just so fed up with people complaining. But you're right, there are still ways Python 3 isn't "better". I'd love to have productive, technical discussions about them, but we can't seem to get beyond the "Python 3 was a super bad idea" stuff, and I'm totally uninterested in that.
But beyond that, you and I are mostly talking about different things. Python 3 isn't NumPy or SciPy. If you're building extensions on top of them, you need to look at their compatibility commitments. If you want them to make more commitments, you have to convince them. This isn't specific to software engineering; this is due diligence for anything you're gonna put years of work into.
Django's page  is a great example of this. Python has one too . I don't have any idea about SciPy/NumPy; It looks like SciPy 1.2.0 was an LTS release supported until 1/1/2020, but what do I know.
But importantly, the end result of this "hey, do 100x the work otherwise our science won't be reproducible" stuff will be to force people out of producing free software for scientific computing. And the non-free stuff is expensive, good god. Surely this isn't what you want.
A better tactic here is to work with the developers in establishing more compatibility between releases. You probably aren't gonna get Fortran levels of compatibility--a language and platform that's seen very, very little change over the decades. But then again, the core selling point of scientific Python is that you get to use a modern platform with modern features. Asking for that along with a 50 year compatibility guarantee is a laughably tall order: you can't have it both ways without exponential amounts of work. So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources. And the best place to do that is probably their contact page , not Twitter, HN, or random blogs.
Perhaps your "fed up"-ness means you overlook conversations which do go beyond that? Or do you put me into that category as well?
> Python 3 isn't NumPy or SciPy. ... this is due diligence for anything you're gonna put years of work into.
Hinsen's essay discussed these issues related to "software layers and the lifecycle of digital scientific knowledge". He put Python in layer 1, and NumPy/Scipy in layer 2.
In his essay he also said "I would like to see the SciPy community define its point of view on these issues openly and clearly. ... It’s OK to say that the community’s priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don’t have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community’s business, and that those who have such needs should look elsewhere."
So I'm not convinced that we are talking about different things as you are making points I already referred to, albeit indirectly.
I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"
But Hinsen said "Layer 4 code is the focus of the reproducible research movement" and "the best practices recommended for reproducible research can be summarized as “freeze and publish layer 4 code” -- a solution you mentioned earlier.
It's just that reproducibility isn't the only goal for stability.
Another is to be able to go back to a 15 year old project and keep working on it, without taking the hit of rewriting it to a new, albeit similar, language.
I also have a small amount of umbrage about your comment:
> So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources.
I earlier wrote "Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities."
Did you overlook that because of your '"fed up"-ness', or was that not enough for you?
I do put you in that category, because you seem to be focused much more on the negative, rather than being constructive and trying to find solutions to problems.
> I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"
I've read and directly disagreed with his essay. His points are:
- Python 2 going away orphans a lot of software, because there's a lack of resources/willingness to port to Python 3.
- Python 3 didn't provide enough value to the scientific community to justify all the breakage (this is true for almost every community, btw).
- SciPy breaks compatibility roughly every 2-3 years, which is a bad fit for the pace of scientific computing.
- Beyond that, breaking compatibility threatens reproducibility.
- The SciPy community doesn't seem to know or care about compatibility concerns.
- Projects written on top of SciPy libraries ("Layer 3" code) have to keep updating, and they don't always have resources/willingness to do that.
- It would be cool if SciPy laid out a support schedule.
- It isn't cool that SciPy says, "hey use us", and then breaks compat all the time.
- There are some languages/platforms that haven't changed in decades, this isn't an excuse.
Here's what I've said:
- Agree Python 3 didn't provide enough value.
- If you want to build something on SciPy that you expect to last for decades, you should look for a compat guarantee. If you don't, that's on you.
- If you want new features plus decades of compat, that's a ludicrous amount of work.
- If you want to find a way forward, start a dialogue with SciPy devs.
Hinson's examples of Fortran and Java are illuminating. Fortran's a platform that's seen very minimal evolution over its history. That's exactly the reason people want to use SciPy instead of Fortran. Java's a platform with... billions of engineering hours? It's ironic that a guy who doesn't want to spend the resources to update his own software is asserting that someone else can continually deliver a modern scientific computing platform with new features while never breaking compat, they just don't feel like it ("It's all a matter of policy, not technology"). That's wrong, it's a question of resources.
My diagnosis here is communication breakdown. Everyone here wants the same thing: use a modern software stack for scientific computing. So again I'll say get on the mailing lists, get on IRC, go to the conferences, and talk to the engineers. Be constructive.
Guido has absolutely every right here.
More amazing to me is that in Catalina, the release famous for breaking just about everything else, “Python 2” is still there and works as it always has! Of course, Apple did announce that it will be ripped out in the next release. :)
What I don't think people realize is that not only are you expected to move to 3.x, but you'll have to keep up or fall behind with new 3.x releases. During that same period (since 2008) 3.x has had 9 big releases. Of course that 2.x stability was done with the assumption you'd move to 3.x and isn't sustainable for PSF indefinitely.
They did? Damn, I was using that...
Every huge task, like porting from Python 2 to Python 3 or any other huge task is either everybody's task or just a small group's one. And since latter seems more reasonable to not interfere with ongoing development, former is the only way I have seen such tasks to succeed.
Artificial rules to create comfort for one group at the expense of another group, like the following
>> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.
sound pretty much wrong to me.
If there is a pain, it should become everybody's pain, or otherwise people will simply burn out and hate own work, like the author did. There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.
Overall, described situation looks like management issue and not a technical one to me.
The author addresses this. The difference is that when porting to Rust you'd likely get a faster and more correct program in the end. (Huge caveat of big rewrites, of course). Whereas with Python 3 they feel like they did all the porting work and got nothing valuable in return.
The Rust compiler statically checks those decisions, while in Python issues with string types will only be caught at run-time, so everywhere your test suite has missing coverage, porting is likely to introduce regressions. That is one way in which a Rust port would be easier.
It would take quite a bit of change in a language for a port to be safer than an upgrade, but it's not completely impossible.
The end result of this is that I just spent a good chunk of last week reviewing a pull request with 70,000 lines of changes, which was one of the final in a series of ~10k line pull requests that came in through the fall.
All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes. Here is an api boundary where we need to encode / decode." etc.
It was a nightmare of effort that I'm glad to have behind us.
The issue is they changed the types out from underneath you.
And then left it to each library to decide which type it was actually going to accept.
Really the string transition was just a poor choice in my opinion. Python2 already had unicode strings that were easy enough to specify (just prefix with a `u`).
It would have been better to just delineate that barrier better from an API standpoint.
I understand the appeal of having unicode for the default string literal type, but it was actively hostile to existing projects.
You do, but it's easy: run a compile, fix the errors, repeat until no more errors.
> It would have been better to just delineate that barrier better from an API standpoint.
Isn't that exactly what the Python 3 transition was? i.e. stop accepting non-unicode "strings" (actually just arbitrary byte sequences) for APIs that semantically require a string, reserve them for APIs that actually want a byte sequence.
The reason this doesn't work is that previously the double-quote literal was a "string" type. The string type was, yes, just a sequence of bytes, but in an ascii-centric world that also mean text.
Python2 added unicode string literals that accepted unicode code points. Most APIs were happy to sloppily mix the two and generally work quite adequately.
Python3 then made the hard distinction between byte-string and unicode-string. Not an unreasonable position to take on the face of it. The issue is many python2 APIs were written from the perspective of "accepts string literal types", where that could be either bytestring or unicode string.
Now suppose you have a large codebase in python that spans the entire stack from database interaction, to webserver, to desktop application. All built on double quoted string literals. Accepting unicode strings in the places that needed that (user-facing places mainly, utf-8 bytestrings anywhere being stored on disk or sent over network)
Then you go to switch to python3, and suddenly all of your string literals are interpreted as unicode instead of bytestring / ascii sequences. So now you need to go through every place in your codebase that accepts strings as an argument and determine, "is this a user-facing string, or a utf-8 bytestring", because they used to be basically the same thing, and now they aren't.
It's not "difficult" really, it's just a pain in the neck.
Python is dynamically typed and weakly typed, but still typed. That's precisely the problem! The difference is just that a statically typed language gives you all the information, and a dynamically language doesn't, but still fails. Just without providing you the necessary information up-front.
There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan...
People who claim that dynamic typing is a thing claim that Python is strongly typed. (This is of course nonsense; there's no such thing as dynamic typing, because types are by definition something that expressions in a language have, not something that runtime values have).
> There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan....
That is not a "nice explanation". It is writing to obscure rather than to clarify. And it certainly acknowledges that one cannot have differently typed values in a dynamic language.
However the makers of Delphi spent many years preparing for this, so when the time came for us to switch we only had to spend half a day or so to migrate our half a million lines of code.
No because much of the stdlib works in terms of native strings and will choke (or worse silently fuck up) on the other. Yes also in python 2, the stdlib was absolutely not “unicode clean”.
So a transitional / polyglot codebase has and needs not 2 but 3 string types: bytes, unicode, and native. And neither “unicode literals” nor “bytes literals” were good things to apply across the board.
It took years before the advent of six, Python 3 u’’ literals, and modernize. The author discusses this at length.
The compiler support for C++11 (and especially inconsistencies in Debian packages, compiled flags, etc) was a very painful issue for several years. But auto is that useful ...
Python 3 could have required all strings began with u" or b", but they didn't - they did something which encouraged breakage.
> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)
Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.
There was this assumption that Unicode code points were the correct single unit to talk about Unicode. You iterate over code points, you talk about string lengths in terms of code points, you slice in terms of code points. Much like the infamy of 16-bit Unicode, this is an assumption that has kinda gotten worse over time. Now we can and do want to talk about bytes, code points, and newer sets like extended grapheme clusters. I think this is probably the big failing of Python 3's Unicode model. Making a string type operate on extended grapheme clusters might fix it, but we'd be in for the same sort of pain, and the flexibility of "everything is bytes, we can iterate over it differently" of Go and Rust is much nicer in comparison.
The second thing was this assumption that everything remotely looking like text was Unicode, despite this maybe not being true. HTTP has parts that look like plain text, like "GET" and "POST" and the headers like "Content-Type: text/html". But the correct way to view this as ASCII bytes, and no other encoding makes sense; binary data intermixed with "plain text" definitely happens, and the need to pick and choose between either Unicode or Bytes caused major damage in the standard library which still persists to this day -- some parts definitely chose the wrong side. Take a look at the craziness in the "zipfile" module for one other example. It's probably fixed now, but back then, I basically had to rewrite it from scratch in one of my other projects.
They eventually relented and added back a lot of the conveniences to blur the line between bytes and unicode again, like adding the % formatting operator for bytes, which I think shows that their insistence on separating the two didn't really pan out in practice. And yet, migration is still a pain.
It would "kinda work out", if your Unicode strings were ASCII in practice, and only then. Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.
Which is to say, it only worked out for English input, and even then only until the point where you hit a foreign name, or something like "naïve". Then you'd suddenly get an exception - and it happened not at the point where the offending input was generated, but at the point where two strings happened to be combined.
This was a horrible state of affairs for basically everybody except the English speakers, because there was a lot of Python code out there that was written against and tested solely on inputs that wouldn't break it like that.
Intermixing binary data with text can be represented just fine in a type system where the two are different. For your HTTP example, the obvious answer is that the values that are fundamentally binary, like the method name or the headers, should be bytes, while the parts that have a known encoding should be str - there's nothing there that requires actually mixing them in a single value. In those very rare cases where you genuinely do have something like Unicode followed by binary followed by Unicode in a single value, that is trivially represented by a (str, bytes, str) tuple.
The problem with the Python stdlib isn't that bytes and Unicode are distinct. It's that it's overly strict about only accepting Unicode in some places where bytes should be legal, too. This is orthogonal to them being separate types.
They could have just changed the default encoding to utf8. (For those too lazy to configure their Python properly.)
There, problem solved - and no need for a breaking Python 3.
Even such a breaking change would be a molehill compared to the mountain of breaking changes in Python 3.
Point is, they had one job, and they failed.
The most messed-up thing about Python 3 is that it's supposed to be justified by doing Unicode right and they still got it wrong.
Having strings be sequences of Unicode code points is a super-bizarre design. That is, Python 3 strings indeed are semantically sequences of Unicode code points rather than sequences of Unicode scalar values. You can not only materialize lone surrogates (defensible for compatibility with UTF-16) but you can also materialize surrogate pairs in addition to actual astral characters. You still can't materialize units that are above the Unicode range, though, so it's not like C++'s std::u32string.
Looking at the old PEPs, it appears to have arisen by accident rather than as an actual design.
BTW: I believe the http headers supposed to be encoded using ISO-8859-1 it's essentially same thing as US-ASCII, but it covers an entire byte.
Go has string and byte, and you can't mix it, you have to cast. Java has String, char and byte and similarly you need to do cast. Rust has Bytes and String (I don't know Rust enough, but I'm pretty sure it doesn't implicit conversion between them).
Also Python 3 doesn't distinct between Bytes and Unicode, Python 3 has distinction between bytes and text (str - BTW: Guido actually expressed regret that he did use "str" instead of "text", because it would be much clearer)
In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes, how the bytes are stored internally is an implementation detail, if you need to write to a file or to network, you encode the text using various encodings (most popular is UTF-8) and you decode it back when reading.
> In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes
Python 3 strings store Unicode code points. When you iterate over a Python 3 str, you get back Unicode code points. As mentioned elsewhere, this is not a Unicode scalar value, and can include things like unpaired surrogates. This is also not an extended grapheme cluster, which is the current best-effort description as to what counts as a "single character".
So, you really do need to be concerned about what your strings contain. If you don't want people to care, don't give them the ability to iterate, slice, or index into str to retrieve Unicode code points, and leave them as opaque blobs, as some of those other languages do.
Yes, but at this point you're arguing about implementation details. The idea is that if you use it as a string it is string, if you need bytes, you need to perform a conversion. It shouldn't be your concern how it is stored internally.
If we are going into Python internals, the string can be stored as multiple versions from basic C-string to unicode code points. If you perform conversion it will cache the result so it can be used in other places. I don't remember the details, since I looked at the code long time ago, but it isn't that simple.
This is not an implementation detail, it is fundamental to how the str type in Python 3 operates. I have not talked at any point about the internal storage of this type, just the interface it publicly exposes.
When working with e.g. filepaths, Rust has an OsStr type.