Hacker News new | past | comments | ask | show | jobs | submit login
Mercurial’s journey to and reflections on Python 3 (gregoryszorc.com)
411 points by ngoldbaum on Jan 13, 2020 | hide | past | favorite | 367 comments



I've been involved in multiple non-trivial libraries and frameworks that supported both python2 and python3 for many years with the same codebase ... and it really wasn't anything like this. The python3 "adaptation" effort for mercurial was just bungled by multiple terrible decisions.

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years.

Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.


> The natural string type is the right type in many places

For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.

I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.


I was an early adopter of Mercurial and the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support.

For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.

In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.

I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.


> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.


The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.


> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.


Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?


On windows? Of course you can.


I'm certain you can on Linux as well. Only Macs old HFS would not allow it.


Isn't this a fairly recent change?


NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.


Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!


Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.


Hum, any program that doesn't treat filenames as bytestreams on unix is broken. Doubly so if its primary purpose is preserving and archiving files.

Are you sure the issue wasn't something else?


The point is that filenames aren't bytestreams on windows, and if you treat them as such then your program won't work.


By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)


> By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.


Sure, but that's also not the only resulting option; instead of erroring you could also do something nice like help those users migrate to cleaner Unicode encodings of their filenames by asking them to correct mistakes or provide information about the original encoding. It takes more code to do that than just throwing an error, of course, but who knows how many users that might help that don't even realize why their repositories don't work correctly on, say, Windows.


Windows filenames basically are bytestreams. But the bytes come in pairs.


Not really. Certain byte sequences are invalid.


Certain byte sequences are invalid in unix filenames too. So that can't be the factor that decides if they are bytestreams or not.


If hg borked on non-ascii characters, it sounds like the problem was rather that it didn't treat that data as a bag-of-bytes. Not the other way around?


He was trying to use Windows. For Windows, you pretty much have to go through unicode to utf-16, can't be arbitrary bytes, can't be utf8.

(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)


Windows uses arbitrary shorts that are sort of supposed to be utf-16. Just like Unix uses arbitrary bytes that are sort of supposed to be utf-8.

You have to convert between them, but neither uses proper Unicode to represent filenames.


Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.


Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).


True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.


Everything isn't bytes. Strings without an encoding don't have a specific byte representation.


It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.


We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.


UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.


Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.


> For many programs, yes.

For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else[0], which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.

> Repository data is bytes, not Unicode.

It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.

[0] though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well


> For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

This is a problem with the Python 3 standard library; in many places it requires Unicode when it shouldn't.


This is a really bad way of thinking. The distinction in Python 3 is between text (str) and bytes.

str is not Unicode in fact if you don't use fancy characters internally it stores text as a byte array.

You should think of text the same as of image or sound, what you see in the screen or hear in the speaker is the actual thing, but if you need to save it on disk you encode it as for example png or wav.


You can just read "requires text when it shouldn't". But I don't recommend this terminology: in most modern computer programs, including Python 3 implementations, "text" and "Unicode" mean the same thing, but outside of this context Unicode is just more precise: sometimes "text" means ASCII and sometimes it means things non-represantable in current version of Unicode.


> The distinction in Python 3 is between text (str) and bytes.

Feel free to s/Unicode/str/ in what I posted if you prefer that terminology. The problem is still the same.

An example of the problem: Python's standard streams (stdin|out|err) in Python 2 are streams of bytes, but in Python 3 they're streams of Unicode (or str if you prefer that terminology) characters. The problem is twofold: first, if my standard streams are hooked to a console, Python can't always properly detect the encoding of the bytes coming from the console, so it can give me the wrong Unicode characters; second, if my standard streams are hooked to pipes, there is no encoding it can pick that is right, since the bytes aren't even coming from a console (where at least there is some plausible argument for saying the user meant to type Unicode characters, not bytes). What Python 3 should have done was keep the standard streams as bytes, since that's the only common denominator you can rely on, and then let the application decide how to decode them if it decides it needs to, just as in Python 2.


I believe the behavior is correct though. Python uses encoding specified through LANG/LC_* which is the encoding that supposed to be used, and all properly behaved applications use it.

If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version. Most people will use it for text, so the defaults make sense. Personally I would like if there was no automatic conversion when using files/network/pipes etc. but I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.


> Python uses encoding specified through LANG/LC_

Yes, that's the best you can do, but it's still not always correct. I agree that it should be, but "should be" and "is" aren't always the same.

> If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version.

Yes, but there are still standard library functions that will use the regular streams, and that might conflict with what your application is doing. There is no way to tell Python as a whole "use binary streams everywhere because they are pipes for this application".

> Personally I would like if there was no automatic conversion when using files/network/pipes etc.

That would work if (a) Python could always detect that condition (it can't) and (b) the entire standard library adjusted itself accordingly.

> I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

Python 2 worked fine with the standard streams being binary, and applications wrapping them to decode to Unicode when necessary. Python 2.7 even back ported the TextIOWrapper and similar classes to make the wrapping as simple as possible. A similar approach could have been taken in Python 3 (binary streams and a simple wrapper class), but it wasn't.


Complaining that the world is not as it should be does not solve the issue.


Repository data bytes does not show up as string literals in your code, or keyword argument names, or http header names. The vast majority of code involved in this struggle is misc business logic, not repository tracked file contents itself.


Python 3's approach means bytes/str poisons the whole expression. So if you want to do something like:

"%s/%s" % (repository_data_1, repository_data_2)

And have it work on Python 2 and 3, you're screwed.


And Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things. Python 2 would let you do that, and it would often cause subtle bugs with non-ASCII data. Python 3 requires you to encode/decode, so you're working consistently and explicitly with binary or text.

I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.


> I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

We're discussing the linked article, so I'm talking in the context of the linked article. I know it works now, but Python 3 initially removed %-formatting for bytes. I guess I should have used past in my comment, "you were" screwed instead of "you are". From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.


> Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things.

Python 3's behavior as far as forcing you to explicitly recognize data type conversions is more correct, yes.

Python 3's behavior in assuming that nobody would ever need to do "text-like" operations like string formatting on byte sequences was not. At least this particular wart was fixed. But there are still a lot of places where Python makes you use the str "textual" data type when it's not the right one.

Python 3's behavior in making individual elements of a byte string integers instead of length-one byte strings is, frankly, braindead.


That example works fine in both Python 2 and 3 if you’re not mixing types incorrectly. If you are, it will appears to work on Python 2 before failing the first time you encounter non-ASCII data and tends to greatly confuse people with errors which would have been caught immediately on Python 3. I’ve seen teams waste hours trying to track down errors like that.


Exactly this. The amount of times I saw juniors fixing thses sort of obscure subtle bugs with str_var.decode("utf-8").encode("latin-1") and this after attempting every which combination of the above two de/encode operations is mind boggling.


It works after Python 3.5. From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.


The rule of thumb (not just for Python, but anything that deals with encoding) is to use binary encoding at the bounds of your program (reading/writing files, sending/receiving data over network etc) it applies to everything including tools like this. If you follow it your life will be simpler.

You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)


> You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.

Do not ever use text-mode `open` without specifying an encoding.


Node.js tries to be helpful in defaulting file writes to UTF-8, but defaults file reads to returning a raw byte buffer [0]. So you have to either remember to treat the two operations differently, or, like in Python, manually specify the encoding for both.

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.


The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.


> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.


Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.


> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.


Every Python program should be tested with Emoji characters, they're a real torture test.


Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.


Good point. I do almost all of my Python on Windows where it's much easier to get an error.


Every program in general should be tested with Emoji characters at this point.


Not a bad idea, but I think Python is more likely to have hidden bugs that this will uncover. A language that accepts bytes as input and emits the same on output will probably work fine on UTF-8 for example.


That's the Python 2 mentality and a large part of this discussion was that it didn't work in hindsight, that you can't just be "encoding oblivious", but it usually doesn't show up as an obvious problem until you least expect it. Our input and output devices are aren't always homozygous on byte encoding (and quite possibly very rarely are; we have decades and decades of kludges around this), and testing every program with Emoji has become one of my favorite pastimes for finding failure cases.


It defaults to the system encoding. I don't use Python on Windows, but Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8. Perhaps Python needs to be updated to reflect that?

You can also specify encoding when calling open.


> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.


> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.

But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.


The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.


To be fair... the problem was more in Python 2 where this stuff was often conflated. Python 3 really just brought the problem in to stark relief.

TBH I do think the problem is easier to address in a statically typed world.


> I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.

Python set the entire community back 10 years or more by making this drastic mistake.


It might be my own pro-typed-language bias showing but this migration from byte strings to unicode strings is really where dynamically typed languages really don't shine.

If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.

In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...

And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.


> added unicode as an afterthought like Python did

I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.

Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.


Another useful red letter date for language/tool adoption is the standardization of UTF-8 in 1993. Before UTF-8 there were a lot of tools, especially in the POSIX world, that didn't feel comfortable without an 8-bit safe encoding format.

Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then (before a large influx of users), but a corresponding complaint about UTF-8 is that because it was 8-bit safe, a lot of tools also felt they could kick the can on dealing with it more directly (as a default), and Python 2 seems to be among them. Hindsight has told us a lot about the problems to expect (and exactly why Python 3 did what it felt it had to do), but they probably weren't as clear in 2000. (In further hindsight, imagine if Astral Plane Emoji had been standard and been common around 2000 instead of 2010 how much further we might be in consistent Unicode implementation today. I suppose that makes 2010 another red letter date for Unicode adoption.)


And it was much later than 1993 that unicode conclusively defeated latin-1. Something like 2010?


> Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then

That's true, but I would argue that given the difficulty and backlash we've seen moving from Python 2 to Python 3, such a move would have risked destroying Python's rapid forward momentum and condemned it to the ash heap of programming language history.


To add on to this, I'm not agreeing with the backlash from Python 2 to 3. And I wouldn't want it in the ash heap of history - I definitely think there's a definite place for nice, quick, easy dynamic langs like Python, particularly for exploratory programming.

I'm just saying the move to Python 3 turned out to be a huge deal to a lot of people (it surprised me), and for that reason, trying such a big jump at Python 2 would have been risky and could have derailed Python's forward progress at a critical point.

Would the downvoters like to share their reasons for disagreement?


I think the question goes back to the size and scale of users at the 1 to 2 jump versus the 2 to 3 jump. Python didn't really start to hit most of its "forward progress" in terms of both user adoption and being so deeply integrated into systems. There was no Django for Python 1, for one example. As another example, I'm pretty sure Debian and its heavy reliance on Python for so much of its system scripting didn't happen until Python 2, either, but a quick search didn't turn up a reliable date.

It probably would have been a lot less risky with so many fewer daily users, so many fewer huge projects to migrate.


You may be right. I first used Python on a regular basis in 2002 (after release of Python 2), so I wasn't aware it had so little adoption prior to Python 2. But it definitely was picking up by 2002.


First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.

I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.


Yes, I was really surprised that they avoided upgrading to Python 2.7-level best practices and future statements for as long as they did and tried to hide it from most developers thru custom compatibility layers. Huh? That's step 0, getting except, stdlib imports, and print statements up to date. Folks can deal with that, that's the easy part.

Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.


The late start was mostly due to having to retain Python 2.4/2.5 compatibility until May 2015 and it was literally impossible to use some future statements or some Python 3 syntax until 2.6 was required. I have updated the post to reflect this.


IC, that’s unfortunate. Believe that is the time to cut a legacy branch/release rather than block progress for a decade.


Interesting you mention http headers. I had a program converted Python 2 -> Python 2 which was crashing occasionally, and it turned out it was because I was being sent a http request which wasn't valid unicode, so decoding failed.

I had to switch back to treating headers as bytes for as long as possible.

It is a stupid client which doesn't send valid ascii for http headers of course.


I believe the headers are encoded using ISO-8859-1 not Unicode. That encoding has 1:1 mapping with bytes so wouldn't break this way. Treating them as UTF-8 was the bug.


This is exactly the sort of encoding issues that the python 2 to 3 transition has flushed out. People get frustrated with python 3, yet the actual failure was their mishandling of encoding issues -- papered over by python 2.


But that's not what frustrates people with the transition. It's that they suddenly get encoding issues where there should have been no encoding to begin with!


No observed encoding issues.


When I treated headers as bytes, there wasn't an "encoding".

What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.


> When I treated headers as bytes, there wasn't an "encoding".

If you are representing strings as bytes, you are intrinsically using an encoding.

> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.

But yes, this strategy largely avoids encoding issues... until it doesn't.


> If you are representing strings as bytes, you are intrinsically using an encoding.

It's just binary data that might resemble a string. No encoding necessary.


This is false more often than not. Many programs taking user input will treat it as string, assuming specific encoding or compatibility with screen output/some api, at least in some code paths. For example if you print an error message when you can't open some file, you are very likely to assume its encoded in a way terminal can handle, so its no longer "just binary data".


Yes, I have to worry about how to make a "best effort" to show it to users, but in all internal code paths it must stay as "just binary data", else I lose information. This is exactly how chrome and Firefox handle headers internally.


It might resemble a particular encoding of a string... and the way you got that string to that particular sequence of bytes is by... encoding it.


> and the way you got that string to that particular sequence of bytes

No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.


In that context, you aren't using strings. You are using bytes. HTML without interpreting it as strings isn't really HTML, nor is it a string. It's just a blob that is passing through.


> When I treated headers as bytes, there wasn't an "encoding".

oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.

It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.


I'll admit, I'm not positive what the encoding should be. However there is a bunch of people who do clearly send UTF-8, and I can also promise you there are headers out there which just have binary nonsense in them. See for example https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

If you want to handle all headers, you have to be prepared to just get binary data.


Yes, and using ISO-8859-1 is the way to handle them without issues. You will never get error when decoding it that way. If you are using UTF-8 there are character combinations that are invalid.


> It is a stupid client which doesn't send valid ascii for http headers of course.

...or a smart malicious actor.


> But you don't need all b"" everywhere.

as a mercurial user i never understood this decision. for instance look at this recent commit: https://www.mercurial-scm.org/repo/hg/rev/b4c82b704180

would anyone disagree with the fact that an error message should be a string?

a source transformer to add b'' all over the place? really?

and i still don't understand why the hg transition had to be more complex than: https://docs.djangoproject.com/en/1.11/topics/python3/

... and of course now this: https://www.mercurial-scm.org/wiki/OxidationPlan

i wonder what does matt mackall think of all these developments?


Why are you so certain about your assertions here about when they did and did not need to use explicit byte strings?


I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3. Coming from C#, I never get used to Python 2's approach. It's a pain in the ass working with non-Latin characters in Py2 starting from simply output in console, especially on Windows.

>assuming the world is Unicode is flat out wrong

True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ([1]).

[1] https://bugs.python.org/issue15809 (Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)


The most amusing quote in the entire article is this (emphasis mine):

> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.

And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.

So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.


> And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.

> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.

Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.


Paths ARE Unicode strings on 99% of the computers with humans sitting in front of them. NTFS, HFS+, and APFS all use Unicode but more importantly, the experience of not using valid Unicode where that’s possible is horrible: undeletable files, crashes, etc. I’ve seen that many times over the years (it was popular with malware authors) but never a time where this was a desirable behavior.

The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.


This just isn't true. In Windows paths are UCS2 i.e. arbitary sequences of unicode code units, inclusing unpaired surrogates. This means that there are paths that will work on Windows but cannot be encoded as e.g. valid UTF-8. As a result Rust has a bespoke encoding just for representing Windows paths in a way that's compatible with UTF-8 ("WTF-8"). It also means that you can't make a guaranteed lossless conversion from a filesystem path to a Rust string; you have to handle the possibility of errors.

On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.

As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.


This all - including complicated equality comparisons - is why paths should have their own dedicated type, and not just be raw strings. Thankfully, Python has had pathlib for a while now.


Paths are Unicode strings on Windows. Yes, POSIX adds a lot more spice to the mix, but if the intent is a cross-platform tool, then Unicode is a reasonable lowest-common-denominator assumption for filenames in 2020.


Paths are Unicode strings everywhere but Unix/Linux. And I would even argue that this is a broken aspect of POSIX today. We should make Unicode the baseline for paths in POSIX-compliant systems, but there's probably too much hand-wringing for that to ever happen.


Paths are sequences of 16-bit values on Windows, not necessarily valid UTF-16. It's basically the same as in POSIX, just one byte wider per character.


> if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.


Right. But that's a very different issue, and it's not at all about string literals as such.

Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?


>> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated

The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.

Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.

With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.


The lack of u'' in early versions of Python 3 is a valid complaint, but it's a separate one.

It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.


Another reason the complaint doesn't make sense is that the author then praises Rust which is more similar to Python 3 than 2.


From other comments the annoyances for the author were about the standard library using Unicode for system level API; Rust had a OSString type that works with the GIGO model of posix


> but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.

Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.

Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.

PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.


> Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.

Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.

All of this was known back then.


Is there any hope of "fixing" it now without going through another massive migration struggle (which will simply not happen)?


No one is complaining that Python 2 didn't DTRT when it comes to Unicode.

But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).

Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.

I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.


str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.

[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.

FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.

In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.


With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.

> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).


I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(

And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.

Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.


> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.


> Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.

Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.


Ah, that makes sense. Thank you for the clarification.


To add to your earlier dialog partner, here are the doc pages for the relevant Rust functions/methods, embedded with runnable examples:

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/primitive.str.html#method.as_b...

https://doc.rust-lang.org/std/str/fn.from_utf8.html

Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.htm...


IMO Python is doing exactly the same thing that Go does (I know too little about Rust to comment) the only difference is that Python respects the LANG variable while Go is just fixed on using UTF-8.


> Python is doing exactly the same thing that Go does

It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.


Here's your problem: you should not care how python is representing it internally.

> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte[] when doing I/O.


Python 2's approach was bad, no argument, but the transition plan for 2-to-3 just didn't work. They thought everyone would run 2to3 in a big bang, and then we'd all switch over to 3 in a few years. Instead it dragged out over a decade because in reality we needed to write code that was compatible with both 2 and 3 (the "6" approach) until enough things were on 3 to drop 2 support.

Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".


I'm not sure they really thought 2to3 would be used for a big bang. I seem to recall the general initial messaging was that Python 3 was a new language and you would need to do a language port to get to it.


> I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.

IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).


My comment is about this sentence:

>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode

>standard library APIs that formerly took bytes now incorrectly take Unicode strings

What do you mean by "incorrectly"?


POSIX APIs take bytes, generally. Python wraps these APIs to take unicode and doesn't allow you to pass bytes, even if you need to. Filenames, for example, are just bytes, and if you force them to always be valid unicode you will make it so that you can't interact with files that have names that aren't valid unicode. That's just one example.



An extremely frustrating part of the Python 3 migration is how many times Python module maintainers have had to hear "oh, now it's safe to migrate." This page currently leads off with a comment saying it's been fine any time since 3.4. You say 3.6. When I was maintaining a popular Python module, I heard the same at 3.1, and 3.2. (I didn't maintain it long after that.)


There are very few places where the bytes/string difference matters for posix paths. Python is far from the only popular tool to assume paths must be valid unicode.


> There are very few places where the bytes/string difference matters for posix paths.

It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)

People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.

> Python is far from the only popular tool to assume paths must be valid unicode.

It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."


> There are very few places where the bytes/string difference matters for posix paths.

There’s only every single input from the system at large, no big.


I don't quite agree. There's lots of systems where it's always unicode, and the a lot of systems where it's always ASCII, and then some systems where stuff is weird (and should be unicode :x)


There was a different API to get this behavior since 3.4: https://www.python.org/dev/peps/pep-0428/#id39


Which means it's been true (and broken) for many many years until maintainers finally succumbed to external pressure and unbroke the API.


Just beware that C# is not exactly "Unicode" either.

C# char is a UTF-16 code unit, not a Unicode code point.

Most code points "fit" into just one UTF-16 code unit, but not all.

For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.


C# isn't exactly quiet about this property, and yes, it can be annoying from an API perspective, but in C# this was likely a pragmatic choice to remain compatible (and familiar) with C++, COM, etc. where most developers would be coming from.

API members that operate on code points universally take a string and an index.

That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.


I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding, combined with silent coercion between unicode/bytes whenever needed. Those two features in combination made Python brittle and dangerous when handling non-ascii characters, not the "strings are bytes" default.

Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).


> I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding

The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!

And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.


Interestingly enough, Ruby 1.8 -> 1.9, the big version jump there, there was this kind of transition. The remainder of this post is all IIRC, it's been a while...

Ruby 1.8 had "everything as bytes" and there was no concept of encodings.

Ruby 1.9 introduced explicit encodings on every string. By default, strings would be encoded as the same encoding as your source file. The default was ASCII. You could control this explicitly with a magic comment, and so many folks added the "UTF-8" comment, to get strings encoded as utf-8 by default.

Ruby 2.0, which was not as large a transition as Ruby 1.8 -> 1.9, even though it sounds like a larger one, said that encodings of files were UTF-8 by default, and therefore, strings generally became UTF-8 by default as well. Most folks just removed their magic comments.


It's surprising how many people believe that you can use a magic comment to make Python use UTF-8 encoding as the default. All the magic comment affects is the encoding of the source file, not the run-time.


Enforcing UTF-8 as the default encoding, barring a magic comment otherwise, would hardly have been the biggest compatibility break in the 2.x line. It could have been done in any minor release, IMO.


To be fair, IDLE is pretty garbage in most ways.


> in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case.

I feel like this is the essence of the article: specific constraints/choices of Mecurial made their port to Python 3 difficult. Working with early Python 3 certainly did not help. But there seems to have been some stubbornness here mixed with a lot of retroactive justification.

> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code.

This is almost ridiculous. You are going to write a JIT partial 2to3 instead of just increasing your length limits and/or using an autoformatter? (Of course, it turns out they eventually did do that... after a bit more stubborness regarding the autoformatter.)

> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial.

Couldn't this have been a very occasional copy and paste, instead of a downstream dependency? [six](https://six.readthedocs.io/) "consists of only one Python file, so it is painless to copy into a project."

> Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility.

Yes, can't disagree. Early adopters who attempted to write 2- and 3- compatible code suffered the most.


> Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

Having just done transitions on a number of much smaller projects I had the same thought. Changes to string handling tripped me up and the changes to relative imports took some thinking. But the biggest frustration was the nagging question: Why am I doing this?

edit: missing word


> Why am [I] doing this?

Lack of security updates past 2019 forced our hand. Did you find a way around that?


> Lack of security updates past 2019 forced our hand. Did you find a way around that?

Amazon is maintaining Python 2 for at least 4 years, as part of their Amazon Linux long term support release. Google app engine will support Python 2 for an unknown amount of time; they haven't announced an end date. PyPy is Python 2, with (to the best of my limited knowledge) no plans to deprecate support. There are also other LTS releases out there which include Python 2 support.

IOW, the forcing function of the PSF no longer supporting Python is not as big a factor as was hoped.


Security updates in Python itself aren't the only issue; a Python 2 project may also depend on packages with security issues of their own, which require continued upstream maintenance.

For example, the python-saml package (for managing SAML-based single sign-on) has separate Python 2 and Python 3 versions, and implements a security-sensitive protocol which means it has (in the fairly recent past) gotten security updates for issues serious enough to rate an assigned CVE. If you're using it, having the current maintainers walk away from the Python 2 version is a serious risk...


I'm a maintainer for a somewhat popular Python package that had support all the way back to 2.4, but I've had to systematically remove support for those versions. The problem is all the CI infrastructure and testing packages are removing support.

Is Amazon planning to support pytest for at least 4 years? It will have its last 2.7-supporting release very soon.


This would only help the server side of mercurial though. There's no client-side supported distribution really. Pypy is not that popular yet.


I don't know about others, but when I used Mercurial, it was via installing it through brew. And if brew installed pypy as a dependency so Mercurial could still use Python 2, I probably wouldn't have noticed.


You'd notice, because it wouldn't work: https://www.mercurial-scm.org/wiki/PyPyPlan


PyPy is keeping Python 2 support indefinitely, I believe.


There's a project for keeping Python 2 alive: https://github.com/naftaliharris/tauthon

It's particularly uncool that Guido brought up the prospect of lawyers (https://github.com/naftaliharris/tauthon/issues/47#issuecomm...) to force it not to be called Python and opposed to letting people who care about keeping Python 2 alive evolve it as "Python 2". (I know he has the legal right to insist on the name change. Still uncool.)


I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6. Having incompatible forks of the language share the same name is a pain for everyone involved. I can completely understand the "mainline" python maintainers not wanting to have to deal with that.

Besides if the Tauthon people are serious about maintaining their fork long term it needs to become more than a mere fork and a real language ecosystem of its own, in the long run having a different name will probably help with that, assuming that they ever get there.

EDIT: Also reading the rest of the thread I realize that the post that you linked out of context is slightly misleading (but I blame github's aggressive folding more than you here). Guido's answer comes after the following exchange:

stefantalpalaru: "Disregard Guido's objection. The "Python" trademark doesn't extend to "py2" or "py28". Read this for details: https://www.python.org/psf/trademarks/"

Guido: "Isn't the whole point that we're trying to solve this without lawyers?"

stefantalpalaru: "The whole point is that you've been sabotaging Python 2 for years and when someone does what needed to be done from the start, you come up with silly objections."

Guido: "OK, bring in the lawyers."

In that light, and given the other poster's ridiculously inflammatory take, Guido's answer seems rather level headed and appropriate IMO. He stands his ground, so to speak.


Re: I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6.

Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media). So Perl and Raku are now considered to be different languages, albeit from the same inspiration.

Now, if Python 2 people would decide to rename Python 2 to something else, I guess it would be a mirrored parallel :-)


That's precisely what's happening with the third-party Python2 forks. The renaming of Perl 6 occurred last October, specifically because of the problems caused by the confusion between the incompatible Perl 5 and 6 that caused a lot of trouble to the Perl people on either side for many years.

It's not a mirrored parallel, it's the Python folks learning from Perl's mistakes and making sure that this parallel won't come to be.


I think it makes a great deal of sense for the Python core team to say "we're finished with Python 2 and want nothing more to do with it".

But I'm very disappointed that the Python Software Foundation isn't explictly supporting people who want to keep Python 2 compiling and running on modern systems. I think that would be well within their remit to "promote, protect, and advance the Python programming language".

This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

Even before Python 3.0 appeared, I came across scientists saying "I prefer to stick with Fortran because new Python versions break old code too frequently".


PSF does not object to people who keep Python 2 compiling and running, such as ActiveState (https://www.activestate.com/company/press/press-releases/act...).

This case is different, because it's a project that uses the Python name, but actively adds features to the language. This is the classic example of brand confusion - someone might try to use it, find something to complain about, and PSF's reputation suffers as the result. They also get support overhead from the users of the fork (even if all they do is tell them to go away, that is still triage time that could be spent on other issues).


"Does not object" is better than nothing, but I think it would be better if the PSF actively helped to coordinate this work (again, without bothering the core team). As far as I'm concerned, this is exactly the sort of thing that the PSF exists for.


>This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

You can always download an old version and the respective libraries and use them to reproduce any results you want. That doesn't mean that old version should be supported anymore.


Longtime Python dev who was also annoyed by the 2 - 3 transition here.

I don't see Guido as in the wrong for that. It'd be a smack in the face when you spend years trying to finally push people to switch (for better or for worse) and then a project like this takes the SEO and gets to run freely with it.


Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it? It's ungraceful not to hand off maintainership on good terms to someone who wants to do the work.

Imagine if Stroustrup had done D and insisted that it be called C++ and wanted everyone to stop using the language everyone knew as C++ on Jan 1st 2020.


> Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it?

They aren't stopping people from using Python 2, the language or Python 2, the software.

They are stopping people from using the name “Python” as the name of forked implementations of Python 2 not maintained by the PSF. No implementation not maintained by PSF is allowed to be called unqualified Python; the name is an important indicator of provenance. There are and have been plenty of third-party Python (2 and otherwise) implementations, the implementations just need their own names.


> They aren't stopping people from using Python 2, the language or Python 2, the software.

The effort to claim the binary name python for Python 3 is actively hostile to Python 3 and a thing that runs unmodified Python 2 unmodified on the same operating system installation. (It's unclear to me how much this is a PSF push, but at least the PEP isn't telling distros to refrain from this hostile-to-comaptibility action.)

> No implementation not maintained by PSF is allowed to be called unqualified Python

The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.


> The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.

For who? This costs the PSF manpower/overhead that they don't want to expend on a thing they don't want to maintain. It dilutes the language that the PSF are stewards of, and would further cause a schism in the python community. None of those things sounds good for python, its ecosystem, or the PSF. They sound good for, like, a few curmudgeonly companies and individuals that don't want to migrate.

I can't parse your first sentence, so I can't respond to it.


> For who?

For users of the Python 2 language who have a lot of Python 2 code and for whom migration doesn't make cost/benefit sense on technical merits of Python 3.

There's Tauthon. There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2. There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

It would be great if there was a common venue for collaboration for these by the parties who are interested in keeping Python 2 going. (I'm not suggesting that Python 3 core devs should do the work.) Like a foundation for Python software.

The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.


> The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.

Yes, but I don't believe I've seen any (real) suggestions to change PEP 394.

> There's Tauthon.

Which I claim is actively bad for python's ecosystem in the long term. It shouldn't be supported by any organization that wants what is best for Python.

> There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

That works just fine without any help. pypi continues to support python2 tags and wheels, and I doubt that'll change anytime soon.

> There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2.

So the entire reasonable bit here is that the PSF should provide something to help various enterprise companies manage backporting security patches. Which, like, I'm not sure what infrastructure is actually needed for that. They already make security patches public. Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.


> Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.

If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

That such improvements are considered actively harmful comes from a point of view where there's a top-down imperative to shut down Python 2 in order to make Python 3 succeed. It's not harmful from the point of view of the code people have written in Python 2 being valuable.

The notion that there user community needs to work for Python (by porting to Python 3) and that Python 2 needs to be shut down as opposed to Python development valuing the existing code that had been developed is the core problem with Python 3.


> If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

But it really doesn't. If the new features are that valuable, you can convert your code. It's not actually that hard (I have a few 100kloc ported forward now, with millions of lines of dependencies that says so).


That project is not Python 2 though, it added features that made it incompatible with both Python 2 and Python 3. Just look at their effort to add wheel support.

Any project that forks changes name:

nagios -> icigna

mysql -> mariadb

NetBSD -> OpenBSD

FreeBSD -> DragonflyBSD

Python -> PyPy, Jython, IronPython

It would be crazy for them to keep the same name and not be compatible. It would cause confusion and also lead to increase of support tickets in wrong bug trackers.


In those cases, the original project lived on. Here Python 3 is the incompatible fork, but because the technical fork is done by the folks who control the name and who want to shut the old thing down, so the compatible evolution of Python 2 had to change the name.


Your analogy is not appropriate. The actual situation with Tauthon is as if someone was not happy with C++17, so they forked C++14, added new features and changed syntax and then insisting to call it C++. It's just confusion for the users and it's in the best interest of the PSF to protect the Python name.


I'd agree if the Python core devs were still interested in evolving Python 2.x. But they aren't, so now no one else gets to do Python 2.8, either. It would be the best if the PSF provided a venue for Python 2.x development even if the folks who went on to do Python 3 weren't the people working on it.

Anyway, the core problem is a top-down effort to try to make a programming language of Python 2.x’s level of usage stop to the extent it’s stoppable under its license, because its creators wanted to do something else, as opposed to facilitating its user community to pool resources to continue its development. Does the PSF have a legal obligation to do such facilitation? No. Is the lack of such facilitation bad for parties who bought into Python when it was Python 2? Yes.


> core devs were still interested in evolving Python 2.x.

They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

You're arguing that the psf should treat python2 and 3 as different languages. In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.


> They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

I meant compatible (in the sense that old programs keep running and you can add new stuff to old programs using the new features) evolutions.

> You're arguing that the psf should treat python2 and 3 as different languages.

For practical purposes, they are different languages and the PSF has been treating them as distinct things.

> In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

It indeed is bad. I hope that every other programming language community and designer takes a close look at what happened and makes sure never to do a Python 3 analog of their language.

> In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.

That's the core problem from the perspective of Python 2 users. The organization that was the steward of the language that they invested in (in the form of writing code in the language) decided not only that a different programming language is more important for the org but that the old language needed to be shut down in order to benefit their new thing.

It's OK for people to get bored with a project and move onto something else, but with the level of usage that Python 2 had and has, it's very problematic for the language steward organization to turn around and seek to shut the language down instead of continuing to evolve it in a way that's respectful of the language users' investment in the language.


> a way that's respectful

You had like 10 years of warning and it's "disrespectful"? I don't think there's a chance of productivity if you're starting from that baseline level of entitlement. Sure, mandates are annoying. But I just can't fathom that.


It's not about how many years of warning there was. It's about making users of the language to rewrite by mandate as opposed to the new features being incrementally adoptable into existing code bases. Sure, that means there are some language changes you never get to make.

Java, JavaScript, C, and C++, for example don't break investment in old code like Python 3 did. They form a reasonable baseline.


And we have kotlin, typescript, and rust due to those languages unwillingness to make breaking changes. The cpp committees unwillingness to remove old garbage from the language is iirc the most cited issue with the language by longtime users.

There are tradeoffs.


You can add Kotlin to you app without rewriting all Java. You can add TypeScript to your app without rewriting all JavaScript. You can add Rust to your app with with rewriting all C++. Seems reasonable.

That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.


> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

You're mistaken. I have python3 binaries and python2 binaries that share dependencies.

You're correct that fully automatic transpilation is impossible, but that doesn't mean that there can't be shared source. It does however mean that things like per-file flags or whatnot aren't possible. Python became a better language with text vs. bytes support, but that support couldn't be done in a backwards compatible way. Oh well.

> You can add Rust to your app with with rewriting all C++.

It's not as good as you seem to think. It's a nonstarter for a lot of people otherwise interested in adopting rust into existing codebases. Certainly not better than the py2/3 situation.

Kotlin interop also is troublesome, although granted better than rust/cpp or py2/3.

> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

That python didn't get replaced by a different language is an incredible testament to the foresight of the python language stewards.


> as opposed to facilitating its user community to pool resources to continue its development

How does reusing the name facilitate the development? Every time there is a fork of an open-source project the name changes, precisely to avoid confusions. Reusing the Python name in a fork that is not just a redistribution, but a new version with new features and syntax, is just confusing, unusual and does not help anyone.


The difference is that the major C++14 implementations are still supported, and probably will be for as long as the concept of C++ exists.

There's no `--std=python2` flag you can pass to the interpreter, unfortunately.


There is no '--std=C++14-with-arbitrary-things-from-c++20' flag either, which is what this fork does. We can discuss whether breaking backwards compatibility was bad or necessary, but creating another fork of Python that backports some features of Python 3 is just adding confusion. If their primary purpose is supporting Python 2.7 applications, they can do just fine without calling it Python.


There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

Indeed, C++ has rarely made any breaking changes. A decade or so ago, GCC did cause some major ecosystem breakage, by cracking down on C++ constructs which had never been valid according to the spec but which GCC had previously allowed. When that happened, there was a flag to (at least partially) revert to the old behavior: -fpermissive.


> There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

This literally does not parse. How do you know "nobody" cares about those exceptions?


'-std=C++14' already includes a few extensions, it is not pure C++14 but a superset. And then there is '-std=gnu++14' too.


No, this is like if someone was not happy with C++17 AND gcc removed its support for C++14. Instead, I can still happily compile C++14, C++11, C++03 and C++98 code with gcc.


I disagree. If someone else wants to continue to develop Python 2 outside the Python foundation and formalised development community, then that is their prerogative, but Python has the right to decide what is and is not Python.

Dilution of what is commonly accepted to be Python would not be a good thing, and would further add to confusion.

I know that platform upgrades are painful, but we need to move with the times or we'll all be mired in technical debt and old technology.


Yes, them keeping Python 2 alive for 10 years where Python 3 was developed caused a lot of issues, it would be extremely short sighted to allow third (incompatible) python into the mix.


> it would be extremely short sighted to allow third (incompatible) python into the mix

The whole point of Tauthon is that it is compatible with Python 2 (in the direction that old programs work).


Letting "Python 2" zombie around is unacceptable. Python 3 is better in every way, and has been since 3.3 (which Armin deserves a lot of credit for).

Consider anyone who wants to build something with Python, whether it's a library, application, or service. What's better, having to build for Python 3 and 2, or just Python 3?

Thank God that Guido did this, despite knowing all the blowback he'd get. To me, that's super cool.


"better in every way" ... except for 1) startup time (according to the linked-to article), 2) support for existing Python 2 code, and 3) support for Python 2 C extensions.

For example, https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stabili... describes the "Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread[1]) for which resources (funding plus competent staff) are very difficult to find."

[1] The thread at https://twitter.com/khinsen/status/930749714567434240 includes "Lots of C modules written for Python 1.4 are waiting for enthusiastic code archeologists ;-)".

I don't think Hinsen is alone in that situation. I can well believe there are some people who, for example, plan to retire in about 5 years and would rather keep with with a Python 2 zombie than spend time to port working code to Python 3.


Startup time is complex, but base startup only increased about 20ms, and that's being generous.

I'll admit Python 3 is still slower at a lot of things. But that feels like saying your new dog is even worse at math than your old one.

The C extension thing isn't Python's fault. It's the job of library and app authors to update. Do we complain that Vulkan has bad SunOS support? This is totally backwards.

Could Hinsen (and others) not just version their deps? It's not like people are erasing Python 2 off the internet. If his main worry is reproducibility, he should be doing that anyway.

---

I don't want to give the impression I like the whole Python 3 thing. I think it was a pretty big mistake and a huge missed opportunity. I'm very sympathetic to people who had to put in a lot of work for basically no good reason--Python 3 didn't really offer anything significantly better than 2 until... 3.5 (3.4 if you think the first pass at async was useful, I personally don't).

But I also find the ballyhooing about it really insufferable. Yeah it was a mistake; Armin Ronacher (as usual) was right. It was also over 11 years ago. Time to forget all about this and build cool stuff, please please please.


It takes me 37ms to load hackernews a 20ms start time is embarrassing what is it even doing?


It does a lot of module imports. Most of those probe the filesystem.

Try "python -vv -c 'pass'" - I'm only showing the first few dozen lines, and I've trimmed some of the paths for conciseness:

    % python -vv -c 'pass'
    import _frozen_importlib # frozen
    import _imp # builtin
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    # installing zipimport hook
    import 'zipimport' # <class '_frozen_importlib.BuiltinImporter'>
    # installed zipimport hook
    import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
    import '_io' # <class '_frozen_importlib.BuiltinImporter'>
    import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
    import 'posix' # <class '_frozen_importlib.BuiltinImporter'>
    import _thread # previously loaded ('_thread')
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import _weakref # previously loaded ('_weakref')
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    # miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/__init__.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/__init__.cpython-37.pyc'
    # trying miniconda3/lib/python3.7/codecs.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/codecs.abi3.so
    # trying miniconda3/lib/python3.7/codecs.so
    # trying miniconda3/lib/python3.7/codecs.py
    # miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc matches miniconda3/lib/python3.7/codecs.py
    # code object from 'miniconda3/lib/python3.7/__pycache__/codecs.cpython-37.pyc'
    import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
    import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44c90>
    # trying miniconda3/lib/python3.7/encodings/aliases.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/aliases.abi3.so
    # trying miniconda3/lib/python3.7/encodings/aliases.so
    # trying miniconda3/lib/python3.7/encodings/aliases.py
    # miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/aliases.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/aliases.cpython-37.pyc'
    import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd67d10>
    import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd440d0>
    # trying miniconda3/lib/python3.7/encodings/utf_8.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.abi3.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.so
    # trying miniconda3/lib/python3.7/encodings/utf_8.py
    # miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc matches miniconda3/lib/python3.7/encodings/utf_8.py
    # code object from 'miniconda3/lib/python3.7/encodings/__pycache__/utf_8.cpython-37.pyc'
    import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x10dd44bd0>
    import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
    # trying miniconda3/lib/python3.7/encodings/latin_1.cpython-37m-darwin.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.abi3.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.so
    # trying miniconda3/lib/python3.7/encodings/latin_1.py
    # miniconda3/lib/python3.7/encodings/__pycache__/latin_1.cpython-37.pyc
    matches miniconda3/lib/python3.7/encodings/latin_1
       ... many, many more lines omitted ...
This can be sped up a lot using a zipimport of the Python standard library, https://docs.python.org/3/library/zipimport.html?highlight=z... , where all of the standard library is put in a zipfile. Then there's a file access to get the zip metadata.

One of the things that bugged me in Python2 was that every startup imported UserDict:

    # trying python2.7/UserDict.so
    # trying python2.7/UserDictmodule.so
    # trying python2.7/UserDict.py
    # python2.7/UserDict.pyc matches python2.7/UserDict.py
    import UserDict # precompiled from python2.7/UserDict.pyc
This is because os.environ was an instance of UserDict:

    % python2.7 -c 'import os; print(os.environ.__class__.__bases__)'
    (<class UserDict.IterableUserDict at 0x1029b14c8>,)
Under Python3 this is spelled collections.abc.MutableMapping:

    % python3 -c 'import os; print(os.environ.__class__.__bases__)'
    (<class 'collections.abc.MutableMapping'>,)
which triggers its own set of imports:

    # trying python3.6/collections/abc.cpython-36m-darwin.so
    # trying python3.6/collections/abc.abi3.so
    # trying python3.6/collections/abc.so
    # trying python3.6/collections/abc.py
    # python3.6/collections/__pycache__/abc.cpython-36.pyc matches python3.6/collections/abc.py
    # code object from 'python3.6/collections/__pycache__/abc.cpython-36.pyc'
    import 'collections.abc' # <_frozen_importlib_external.SourceFileLoader object at 0x103c2ecf8>
There's better performance using SDD than a HDD, which is in turn better than using a networked filesystem.


I'm guessing startup time just isn't an important goal for cython then? I've know they refused to implements some optimization that would significantly increase complexity but this seem like low hanging fruit?


Oh, there's been plenty of work to reduce the Python startup cost.

It's just hard to fix.

I'm not sure the os.environ example I gave is low-hanging fruit now. The collections.abc module might be imported anyway.

This is neat! Python 3.7 added the `PYTHONPROFILEIMPORTTIME=1` environment variable to help track down these sorts of import overheads:

  % env PYTHONPROFILEIMPORTTIME=1 python -c pass
  import time: self [us] | cumulative | imported package
  import time:       523 |        523 | zipimport
  import time:       722 |        722 | _frozen_importlib_external
  import time:       156 |        156 |     _codecs
  import time:      2254 |       2409 |   codecs
  import time:      1293 |       1293 |   encodings.aliases
  import time:      7192 |      10893 | encodings
  import time:      1108 |       1108 | encodings.utf_8
  import time:       182 |        182 | _signal
  import time:      1069 |       1069 | encodings.latin_1
  import time:       395 |        395 |     _abc
  import time:      1486 |       1881 |   abc
  import time:      1540 |       3420 | io
  import time:       100 |        100 |       _stat
  import time:       975 |       1075 |     stat
  import time:      1481 |       1481 |       genericpath
  import time:      1734 |       3214 |     posixpath
  import time:      2558 |       2558 |     _collections_abc
  import time:      2234 |       9079 |   os
  import time:      1407 |       1407 |   _sitebuiltins
  import time:      3498 |       3498 |   sitecustomize
  import time:        85 |         85 |   usercustomize
  import time:      4129 |      18196 | site
Investigating further, the "import os" which triggered the UserDict/collections.abc is a consequence of "import site". If I use "python -S" then those aren't imported.


I can't help but interpret your response as "Python 3 is better in every way ... for the ways I think are important."

For some of my programs, Python startup time is the main overhead. I avoid NumPy and SciPy if at all possible because they have a huge startup overhead.

Some of this is inherent in those packages. NumPy internally imports everything so someone can do "import numpy as np; np.package.subpackage.module.function()" without doing the intermediate imports.

This means NumPy is optimized for programmers (especially novice programmers) using NumPy in long-lived processes where startup cost is a negligible overhead.

Which isn't all use-cases for numeric computing.

15 years ago I supported a CGI-based web app. It was very important to pull out all the stops (delay imports until needed, use zip packages) because it was easier to do that than to re-write everything for another architecture.

The dog does count pretty well after all.

> It's the job of library and app authors to update.

Why? Linus Torvalds doesn't agree with you, for one.

As Hinsen points out,

] Unfortunately, the need for long-term stability is rather specific to scientific users, and not even all of them require it (see e.g. these two tweets by Titus Brown). So while Python 3 is probably a step forward for most Python users, it’s mostly a calamity for computational science.

Some scientific code has been able to run unchanged since the 1970s, through multiple new Fortran language releases.

Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities.

But why not recognize that for some situations Python 3 is not better?

Hinsen also comments on your proposal:

] The implication is that breaking changes in the infrastructure layers are OK and must be absorbed by the maintainers of layers 3 and 4. In view of what I just said about layer 4, it should be obvious that I don’t agree at all with this point of view. But even concerning layer 3, I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter.

> Could Hinsen (and others) not just version their deps?

He addresses that, I think. One of the other commenters gives a more complete reply at https://metarabbit.wordpress.com/2017/11/18/numpy-scipy-back... ending "Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.".

> Time to forget all about this and build cool stuff, please please please.

I'll quote Hinsen again "I find it a bit arrogant. The message to research communities with weaker code development traditions, and thus fewer resources, is that their work doesn’t matter."

Your implicit statement is that mmtk (Hinsen's code base) isn't "cool stuff". Why? Simply because it's old, or because you don't know about it or need it? What other cool old stuff will die because it's part of a community without the resources to update?

Instead, accept that that loss is part of the trade-offs, be empathetic to those who suffer, and bear those lessons in mind for future work you do.


Well, I'd like to start off by saying I think we agree overall. I do think Python 3's advantages didn't merit its disadvantages until many years after the initial release.

Second, I admit to engaging in hyperbole when I said "Python 3 is better in every way"; usually I'm on the other side of these, but I'm just so fed up with people complaining. But you're right, there are still ways Python 3 isn't "better". I'd love to have productive, technical discussions about them, but we can't seem to get beyond the "Python 3 was a super bad idea" stuff, and I'm totally uninterested in that.

But beyond that, you and I are mostly talking about different things. Python 3 isn't NumPy or SciPy. If you're building extensions on top of them, you need to look at their compatibility commitments. If you want them to make more commitments, you have to convince them. This isn't specific to software engineering; this is due diligence for anything you're gonna put years of work into.

Django's page [1] is a great example of this. Python has one too [2]. I don't have any idea about SciPy/NumPy; It looks like SciPy 1.2.0 was an LTS release supported until 1/1/2020, but what do I know.

But importantly, the end result of this "hey, do 100x the work otherwise our science won't be reproducible" stuff will be to force people out of producing free software for scientific computing. And the non-free stuff is expensive, good god. Surely this isn't what you want.

A better tactic here is to work with the developers in establishing more compatibility between releases. You probably aren't gonna get Fortran levels of compatibility--a language and platform that's seen very, very little change over the decades. But then again, the core selling point of scientific Python is that you get to use a modern platform with modern features. Asking for that along with a 50 year compatibility guarantee is a laughably tall order: you can't have it both ways without exponential amounts of work. So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources. And the best place to do that is probably their contact page [3], not Twitter, HN, or random blogs.

[1]: https://www.djangoproject.com/download/#supported-versions

[2]: https://devguide.python.org/#status-of-python-branches

[3]: https://www.scipy.org/scipylib/mailing-lists.html


You write "we can't seem to get beyond the "Python 3 was a super bad idea" stuff, and I'm totally uninterested in that."

Perhaps your "fed up"-ness means you overlook conversations which do go beyond that? Or do you put me into that category as well?

> Python 3 isn't NumPy or SciPy. ... this is due diligence for anything you're gonna put years of work into.

Hinsen's essay discussed these issues related to "software layers and the lifecycle of digital scientific knowledge". He put Python in layer 1, and NumPy/Scipy in layer 2.

In his essay he also said "I would like to see the SciPy community define its point of view on these issues openly and clearly. ... It’s OK to say that the community’s priority is developing new features and that this leaves no resources for considering stability. But then please say openly and clearly that SciPy is a community for coding-intensive research and that people who don’t have the resources to adapt to breaking changes should look elsewhere. Say openly and clearly that reproducibility beyond a two-year timescale is not the SciPy community’s business, and that those who have such needs should look elsewhere."

So I'm not convinced that we are talking about different things as you are making points I already referred to, albeit indirectly.

I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"

But Hinsen said "Layer 4 code is the focus of the reproducible research movement" and "the best practices recommended for reproducible research can be summarized as “freeze and publish layer 4 code” -- a solution you mentioned earlier.

It's just that reproducibility isn't the only goal for stability.

Another is to be able to go back to a 15 year old project and keep working on it, without taking the hit of rewriting it to a new, albeit similar, language.

I also have a small amount of umbrage about your comment:

> So just like you're asking other engineers to be empathetic and respect your need for more compatibility with your extensions, you need to be more empathetic and respect their resources.

I earlier wrote "Now, yes, I know the reasons for the changes to Python. I know the funding and organizational realities."

Did you overlook that because of your '"fed up"-ness', or was that not enough for you?


> Perhaps your "fed up"-ness means you overlook conversations which do go beyond that? Or do you put me into that category as well?

I do put you in that category, because you seem to be focused much more on the negative, rather than being constructive and trying to find solutions to problems.

> I'm also not sure you understood all of Hinsen's points. I say this because you wrote ""hey, do 100x the work otherwise our science won't be reproducible" stuff"

I've read and directly disagreed with his essay. His points are:

- Python 2 going away orphans a lot of software, because there's a lack of resources/willingness to port to Python 3.

- Python 3 didn't provide enough value to the scientific community to justify all the breakage (this is true for almost every community, btw).

- SciPy breaks compatibility roughly every 2-3 years, which is a bad fit for the pace of scientific computing.

- Beyond that, breaking compatibility threatens reproducibility.

- The SciPy community doesn't seem to know or care about compatibility concerns.

- Projects written on top of SciPy libraries ("Layer 3" code) have to keep updating, and they don't always have resources/willingness to do that.

- It would be cool if SciPy laid out a support schedule.

- It isn't cool that SciPy says, "hey use us", and then breaks compat all the time.

- There are some languages/platforms that haven't changed in decades, this isn't an excuse.

Here's what I've said:

- Agree Python 3 didn't provide enough value.

- If you want to build something on SciPy that you expect to last for decades, you should look for a compat guarantee. If you don't, that's on you.

- If you want new features plus decades of compat, that's a ludicrous amount of work.

- If you want to find a way forward, start a dialogue with SciPy devs.

Hinson's examples of Fortran and Java are illuminating. Fortran's a platform that's seen very minimal evolution over its history. That's exactly the reason people want to use SciPy instead of Fortran. Java's a platform with... billions of engineering hours? It's ironic that a guy who doesn't want to spend the resources to update his own software is asserting that someone else can continually deliver a modern scientific computing platform with new features while never breaking compat, they just don't feel like it ("It's all a matter of policy, not technology"). That's wrong, it's a question of resources.

---

My diagnosis here is communication breakdown. Everyone here wants the same thing: use a modern software stack for scientific computing. So again I'll say get on the mailing lists, get on IRC, go to the conferences, and talk to the engineers. Be constructive.


Did you see the other replies on the thread?

Guido has absolutely every right here.


I downgraded Deluge to the Python 2 version because the new one doesn't work in Windows and I use both operating systems.


It’s funny, on the Mac one becomes used to constant changes, rewriting damn near everything just to stand still. Yet I designed my Mac app long ago to depend on the system “Python 2” (bound to C++), because it seemed that both the installation itself and the Python language and libraries were very stable. Looking back, this turned out to be sustainable for a remarkably long time, as “Python 2” really did evolve only additively and there was almost no reason to even touch 15-year-old code that was relying on Python 2. For the Mac platform especially, this reliability is unheard of.

More amazing to me is that in Catalina, the release famous for breaking just about everything else, “Python 2” is still there and works as it always has! Of course, Apple did announce that it will be ripped out in the next release. :)


I think this weird thing happened with Python 2. I believe Python 2.6 (Oct-2008) was the last "feature release" and 2.7 (Jul-2010) was intended as a bridge. So since 2008, 2.x users have been shielded from most all of the normal churn of any widely used language that's in active development.

What I don't think people realize is that not only are you expected to move to 3.x, but you'll have to keep up or fall behind with new 3.x releases. During that same period (since 2008) 3.x has had 9 big releases. Of course that 2.x stability was done with the assumption you'd move to 3.x and isn't sustainable for PSF indefinitely.


> Of course, Apple did announce that it will be ripped out in the next release

They did? Damn, I was using that...


I have never seen such rejection in Django community, despite real problems, like with WSGI design, handling I/O and thus working with bytes a lot.

Every huge task, like porting from Python 2 to Python 3 or any other huge task is either everybody's task or just a small group's one. And since latter seems more reasonable to not interfere with ongoing development, former is the only way I have seen such tasks to succeed.

Artificial rules to create comfort for one group at the expense of another group, like the following

>> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

sound pretty much wrong to me.

If there is a pain, it should become everybody's pain, or otherwise people will simply burn out and hate own work, like the author did. There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

Overall, described situation looks like management issue and not a technical one to me.

Edit: typos.


> There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

The author addresses this. The difference is that when porting to Rust you'd likely get a faster and more correct program in the end. (Huge caveat of big rewrites, of course). Whereas with Python 3 they feel like they did all the porting work and got nothing valuable in return.


> There is no way porting to Python 3 can be harder than porting to Rust. Rust is statically typed and not garbage collected. Everyone would have to think if they need string or array of bytes anyway, but also, who owns them.

The Rust compiler statically checks those decisions, while in Python issues with string types will only be caught at run-time, so everywhere your test suite has missing coverage, porting is likely to introduce regressions. That is one way in which a Rust port would be easier.


I used to switch unit tests from jasmine 1.3 to mocha because jasmine is kind of a mess, and jasmine 1.3 tests look too much like they should still work in jasmine 2.0, except some of the corner cases on equivalence of objects are wrong. So some of your tests would go red with no code change, but others would be green and stay green even when the code no longer functions properly. Like cutting the wires to your smoke detector.

It would take quite a bit of change in a language for a port to be safer than an upgrade, but it's not completely impossible.


We are on the brink of completing the transition to python3 at my work.

The end result of this is that I just spent a good chunk of last week reviewing a pull request with 70,000 lines of changes, which was one of the final in a series of ~10k line pull requests that came in through the fall.

All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes. Here is an api boundary where we need to encode / decode." etc.

It was a nightmare of effort that I'm glad to have behind us.


> All of this was the heroic effort of one of my coworkers who had the unenviable task of combing through our entire codebase to determine "This is unicode. This is bytes.

Dynamic typing!


Not dynamic typing's fault.

The issue is they changed the types out from underneath you.

And then left it to each library to decide which type it was actually going to accept.


But with static typing, the compiler can let you know when you're doing something wrong with the new Unicode-based string.


Well, it can, except you then need to go through and update all of your internal APIs to be correct.

Really the string transition was just a poor choice in my opinion. Python2 already had unicode strings that were easy enough to specify (just prefix with a `u`).

It would have been better to just delineate that barrier better from an API standpoint.

I understand the appeal of having unicode for the default string literal type, but it was actively hostile to existing projects.


> Well, it can, except you then need to go through and update all of your internal APIs to be correct.

You do, but it's easy: run a compile, fix the errors, repeat until no more errors.

> It would have been better to just delineate that barrier better from an API standpoint.

Isn't that exactly what the Python 3 transition was? i.e. stop accepting non-unicode "strings" (actually just arbitrary byte sequences) for APIs that semantically require a string, reserve them for APIs that actually want a byte sequence.


> You do, but it's easy: run a compile, fix the errors, repeat until no more errors.

The reason this doesn't work is that previously the double-quote literal was a "string" type. The string type was, yes, just a sequence of bytes, but in an ascii-centric world that also mean text.

Python2 added unicode string literals that accepted unicode code points. Most APIs were happy to sloppily mix the two and generally work quite adequately.

Python3 then made the hard distinction between byte-string and unicode-string. Not an unreasonable position to take on the face of it. The issue is many python2 APIs were written from the perspective of "accepts string literal types", where that could be either bytestring or unicode string.

Now suppose you have a large codebase in python that spans the entire stack from database interaction, to webserver, to desktop application. All built on double quoted string literals. Accepting unicode strings in the places that needed that (user-facing places mainly, utf-8 bytestrings anywhere being stored on disk or sent over network)

Then you go to switch to python3, and suddenly all of your string literals are interpreted as unicode instead of bytestring / ascii sequences. So now you need to go through every place in your codebase that accepts strings as an argument and determine, "is this a user-facing string, or a utf-8 bytestring", because they used to be basically the same thing, and now they aren't.

It's not "difficult" really, it's just a pain in the neck.


None of that would be a problem in a typed language. The ultimate destination of any string literal is some standard library function, whether that's write to network socket, display to user, or something else. So you just ripple backwards from that through your own functions that are calling those standard library functions, until you get to the point where you're passing in the literal, and then you know what kind of literal it needs to be.


> None of that would be a problem in a typed language.

Python is dynamically typed and weakly typed, but still typed. That's precisely the problem! The difference is just that a statically typed language gives you all the information, and a dynamically language doesn't, but still fails. Just without providing you the necessary information up-front.

There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan...


> Python is dynamically typed and weakly typed, but still typed.

People who claim that dynamic typing is a thing claim that Python is strongly typed. (This is of course nonsense; there's no such thing as dynamic typing, because types are by definition something that expressions in a language have, not something that runtime values have).

> There's a nice explanation here: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan....

That is not a "nice explanation". It is writing to obscure rather than to clarify. And it certainly acknowledges that one cannot have differently typed values in a dynamic language.


That's not really much different from the path we took. It's just instead of running the compiler, we ran the linter and test suite until things passed. It's just when you have a million lines of code that takes quite a while.


Delphi also went through a similar transition from "strings are in whatever the local code page says" with one byte chars to Unicode strings (Windows-style).

However the makers of Delphi spent many years preparing for this, so when the time came for us to switch we only had to spend half a day or so to migrate our half a million lines of code.


Something is wrong if there is no third type: the "natural" string (bytes on Python 2, unicode on Python 3).


I assume many of the strings were left untouched. But you still have to audit all of it to know which needs to be used where.


I believe that's included in the "etc."


Surely any "natural" string would be better represented as unicode in Python 2? What is an example that wouldn't be?


> Surely any "natural" string would be better represented as unicode in Python 2?

No because much of the stdlib works in terms of native strings and will choke (or worse silently fuck up) on the other. Yes also in python 2, the stdlib was absolutely not “unicode clean”.

So a transitional / polyglot codebase has and needs not 2 but 3 string types: bytes, unicode, and native. And neither “unicode literals” nor “bytes literals” were good things to apply across the board.


I've found myself defining `native_str = bytes if PY2 else str` (with `if PY2: str = unicode` at the top of the file, as in all my py2/py3 polyglot code) because there are some things that need bytes on Python 2 and unicode on Python 3 - e.g. the `__file__` attribute of a dynamically created module or other low-level things.


I believe what they meant was that for many strings it really shouldn't matter if they were bytes or unicode. They would perform their function correctly either way. That's completely true, but you do still have to go through and find the cases where that doesn't work.


There is. It looks like this:

  u"Hello World"


The biggest problem with the Python 2 to Python 3 transition was not that breaking changes were made. It’s that breaking changes were made in a way such that you could not easily have code that worked both on Python 2 and Python 3.

It took years before the advent of six, Python 3 u’’ literals, and modernize. The author discusses this at length.


Another big problem is there was no significant incentive to adopt Python 3. That’s why it took so long for large projects to transition. In comparison, during the last decade, C++ went from dodgy C++11 toy projects to all new code being written in modern C++. The modern feature set is that good.


C++ doesn't mandate you switch from std::cout to fmt in order to use lambdas. If they did that, I think we'd see a lot less modern C++.


That’s a find-and-replace fix that can be addressed reliably. A relatively smaller problem versus moving off of boost.

The compiler support for C++11 (and especially inconsistencies in Debian packages, compiled flags, etc) was a very painful issue for several years. But auto is that useful ...


Right, moving to std::cout to fmt could be as simple as a find-and-replace fix. That the C++ committee could have inflicted this minimal pain on their users, but chose not to do it, shows some amount of concern for backwards compatibility and old codebases. By comparison, Python 3 changed the entire text model and dropped the mic, and waited for 8 years to start to pick the pieces back up.


I guess I don’t understand your argument that “Python 3 changed the entire text model and dropped the mic.” format strings are optional; the old % operator still works fine. The change to unicode is dramatic, but personally I haven’t run into major problems. I’ve had unit tests break because of it, but that’s why one has unit tests. I’ve also worked on a very large python webapp that underwent painful internationalization, and in that case we ended up using unicode strings everywhere anyways.


The % didn't use to work fine. .iteritems() was made for no good reason.

Python 3 could have required all strings began with u" or b", but they didn't - they did something which encouraged breakage.


Six was available for years (2011) before Mercurial even started porting (2015).

https://github.com/benjaminp/six/graphs/contributors


That was part of the "discusses this at length". Part of the relevant discussion is:

> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)


> Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. [..] However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).

Isn't this more a problem with Python not easily differentiating between String and Byte types? Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.


Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism. Python 2 had different types but you could be sloppy and it would kinda work out.

There was this assumption that Unicode code points were the correct single unit to talk about Unicode. You iterate over code points, you talk about string lengths in terms of code points, you slice in terms of code points. Much like the infamy of 16-bit Unicode, this is an assumption that has kinda gotten worse over time. Now we can and do want to talk about bytes, code points, and newer sets like extended grapheme clusters. I think this is probably the big failing of Python 3's Unicode model. Making a string type operate on extended grapheme clusters might fix it, but we'd be in for the same sort of pain, and the flexibility of "everything is bytes, we can iterate over it differently" of Go and Rust is much nicer in comparison.

The second thing was this assumption that everything remotely looking like text was Unicode, despite this maybe not being true. HTTP has parts that look like plain text, like "GET" and "POST" and the headers like "Content-Type: text/html". But the correct way to view this as ASCII bytes, and no other encoding makes sense; binary data intermixed with "plain text" definitely happens, and the need to pick and choose between either Unicode or Bytes caused major damage in the standard library which still persists to this day -- some parts definitely chose the wrong side. Take a look at the craziness in the "zipfile" module for one other example. It's probably fixed now, but back then, I basically had to rewrite it from scratch in one of my other projects.

They eventually relented and added back a lot of the conveniences to blur the line between bytes and unicode again, like adding the % formatting operator for bytes, which I think shows that their insistence on separating the two didn't really pan out in practice. And yet, migration is still a pain.


> Python 2 had different types but you could be sloppy and it would kinda work out.

It would "kinda work out", if your Unicode strings were ASCII in practice, and only then. Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.

Which is to say, it only worked out for English input, and even then only until the point where you hit a foreign name, or something like "naïve". Then you'd suddenly get an exception - and it happened not at the point where the offending input was generated, but at the point where two strings happened to be combined.

This was a horrible state of affairs for basically everybody except the English speakers, because there was a lot of Python code out there that was written against and tested solely on inputs that wouldn't break it like that.

Intermixing binary data with text can be represented just fine in a type system where the two are different. For your HTTP example, the obvious answer is that the values that are fundamentally binary, like the method name or the headers, should be bytes, while the parts that have a known encoding should be str - there's nothing there that requires actually mixing them in a single value. In those very rare cases where you genuinely do have something like Unicode followed by binary followed by Unicode in a single value, that is trivially represented by a (str, bytes, str) tuple.

The problem with the Python stdlib isn't that bytes and Unicode are distinct. It's that it's overly strict about only accepting Unicode in some places where bytes should be legal, too. This is orthogonal to them being separate types.


> Because whenever a Unicode and a non-Unicode string had to be combined, it used ASCII as the default encoding to converge them.

They could have just changed the default encoding to utf8. (For those too lazy to configure their Python properly.)

There, problem solved - and no need for a breaking Python 3.


It would still be a mess any time you have to deal with byte strings that aren't UTF-8. The problem is with the implicit conversion itself - it shouldn't happen, because there's no way to properly guess the encoding. But there was no way to get rid of it without breaking things.


> But there was no way to get rid of it without breaking things.

Even such a breaking change would be a molehill compared to the mountain of breaking changes in Python 3.

Point is, they had one job, and they failed.


That change was at the heart of the breaking changes around strings in Python 3. If the conversions remained implicit, most people would probably have never even noticed that string literals default to Unicode, or that some library functions now require Unicode strings.


> There was this assumption that Unicode code points were the correct single unit to talk about Unicode.

The most messed-up thing about Python 3 is that it's supposed to be justified by doing Unicode right and they still got it wrong.

Having strings be sequences of Unicode code points is a super-bizarre design. That is, Python 3 strings indeed are semantically sequences of Unicode code points rather than sequences of Unicode scalar values. You can not only materialize lone surrogates (defensible for compatibility with UTF-16) but you can also materialize surrogate pairs in addition to actual astral characters. You still can't materialize units that are above the Unicode range, though, so it's not like C++'s std::u32string.

Looking at the old PEPs, it appears to have arisen by accident rather than as an actual design.


I'm confused, there isn't an insistence that everything is unicode. Http headers are treated as bytes before you decode them, but you can totally decode an http request or response as ASCII. At least until you're interacting with a website that has unicode codepoints in it's url.


I think the issue is with people getting used to python 2 approach, where the distinction was between str (bytes) and unicode. In python 3 you should not think of bytes vs unicode, you should think as text vs bytes and you should use text as long as necessary.

BTW: I believe the http headers supposed to be encoded using ISO-8859-1 it's essentially same thing as US-ASCII, but it covers an entire byte.


> Yes, but that insistence that Bytes and Unicode are two different things that Shall Not Be Mixed was mostly a Python 3-ism

Go has string and byte[], and you can't mix it, you have to cast. Java has String, char[] and byte[] and similarly you need to do cast. Rust has Bytes and String (I don't know Rust enough, but I'm pretty sure it doesn't implicit conversion between them).

Also Python 3 doesn't distinct between Bytes and Unicode, Python 3 has distinction between bytes and text (str - BTW: Guido actually expressed regret that he did use "str" instead of "text", because it would be much clearer)

In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes, how the bytes are stored internally is an implementation detail, if you need to write to a file or to network, you encode the text using various encodings (most popular is UTF-8) and you decode it back when reading.


Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.

> In Python 3 you don't have Unicode (as far as you should be concerned), you have text and bytes

Python 3 strings store Unicode code points. When you iterate over a Python 3 str, you get back Unicode code points. As mentioned elsewhere, this is not a Unicode scalar value, and can include things like unpaired surrogates. This is also not an extended grapheme cluster, which is the current best-effort description as to what counts as a "single character".

So, you really do need to be concerned about what your strings contain. If you don't want people to care, don't give them the ability to iterate, slice, or index into str to retrieve Unicode code points, and leave them as opaque blobs, as some of those other languages do.


> Go's string is guaranteed to be a series of bytes, not Unicode code points. I'm unsure about Java. Rust has a more complicated text model that I won't summarize in this post, but it's far better than Python 3's.

Yes, but at this point you're arguing about implementation details. The idea is that if you use it as a string it is string, if you need bytes, you need to perform a conversion. It shouldn't be your concern how it is stored internally.

If we are going into Python internals, the string can be stored as multiple versions from basic C-string to unicode code points. If you perform conversion it will cache the result so it can be used in other places. I don't remember the details, since I looked at the code long time ago, but it isn't that simple.


I don't know how to explain it any simpler. Iterating over a str type in Python 3 enumerates Unicode code points. The length of a str type is the number of code points it contains. Reversing a str will reverse the Unicode code points it contains (not guaranteed to be a sane operation). Indexing into a str with foo[0] gives you back a str containing a single Unicode code point.

This is not an implementation detail, it is fundamental to how the str type in Python 3 operates. I have not talked at any point about the internal storage of this type, just the interface it publicly exposes.


This is called a leaky abstraction. I can't find a good behavior for a high level language to do it this way. If you use index in a string you always will get something that's invalid, at least in Python or Java you get code points.


Python 3 strs should not be iterated over, sure. Ban that in your linter, then you're in the same position you would be in Rust. It's a misfeature but it's still a detail.


Zipfile has always been a mess. I have no idea why, but its interfaces have been consistently poor from a usability perspective. This well before py3 was a factor.


The blog post talks about this a bit in Rust, but we don't actually say that. We do make that the default, but we also give you the ability to get at the underlying things as well. There's a lot of interesting work here, actually, like WTF-8...


In the wild WTF-8 and its 16-bit equivalent show up more often than you'd expect. I ended up discovering a case recently where part of the .NET executable file format is actually encoding strings as WTF-16 (not UTF-16) and any internal lowering needs to map them to WTF-8 instead of UTF-8. Until that point I had expected to only ever encounter WTF-8 in web browsers!


> Both Go and Rust ("""systems""" level languages) have decided that "utf-8 ought to be enough for anybody" and that seems to be a good decision.

When working with e.g. filepaths, Rust has an OsStr type.


A go string is just a sequence of bytes, which is usually/by convention utf8. But you can store anything you want in there, if necessary.


I would say that it is just shitty design not not differentiate between bytestrings and regular strings in a way that causes problems. The biggest design flaw here was not forcing people to understand the difference in python2


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: