Mercurial’s journey to and reflections on Python 3

ploxiln · on Jan 13, 2020

I've been involved in multiple non-trivial libraries and frameworks that supported both python2 and python3 for many years with the same codebase ... and it really wasn't anything like this. The python3 "adaptation" effort for mercurial was just bungled by multiple terrible decisions.

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

But you don't need all b"" everywhere. That was the second huge mistake. Don't just convert every natural string in the whole codebase to b"". The natural string type is the right type in many places, both for python2 (bytes-like) and python3 (unicode-like). The helpers for converting kwargs keys to/from bytes is a sign that you are way off track. This guy got really hung up on the fact that the python2 natural string type is bytes-like, and tryied to force explicit bytes everywhere (dict keys, http headers, etc) and was really tilting at windmills for most of these past 5 years.

Yes, you pretty much had to wait for python-3.4 to be released and for python-2.6 to be mostly retired in favor of python-2.7. Then, starting in early 2014, it was pretty straightforward to make a clean codebase compatible with python-2.7 and python-3.4+, and I saw it done for Tornado, paramiko, and a few other smaller projects.

pdonis · on Jan 13, 2020

> The natural string type is the right type in many places

For many programs, yes. Not for a revision control system that needs to be sure it's working with the exact binary data that's stored in the repository. Repository data is bytes, not Unicode.

I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

jharsman · on Jan 13, 2020

I was an early adopter of Mercurial and the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support.

For example, when I converted our existing Subversion repository to Mercurial I had to rename a couple of files that had non ASCII characters in their names because Mercurial couldn't handle it. At least on Windows file names would either be broken in Explorer or in the command line.

In fact I just checked and it is STILL broken in Mercurial 4.8.2 which I happened to have installed on my work laptop with Windows. Any file with non ASCII characters in the name is shown as garbled in the command line interface on Windows.

I remember some mailing list post way back when where mpm said that it was very important that hg was 8-bit clean since a Makefile might contain some random string of bytes that indicated a file and for that Makefile to work the file in question had to have the exact same string of bytes for a name. Of course, if file names are just strings of bytes instead of text, you can't display them, or send them over the internet to a machine with another file name encoding or do hardly anything useful with them. So basic functionality still seems to be broken to support unix systems with non-ascii filenames that aren't in UTF-8.

pdonis · on Jan 13, 2020

> the teams insistence that file names were byte strings was the cause of lots of bugs when it came to Unicode support

File names are a different problem because Windows and Unix treat them differently: Unix treats them as bytes and Windows treats them as Unicode. So there is no single data model that will work for any language.

hsivonen · on Jan 14, 2020

The Rust standard library has a solution for this that actually works: On Unix-like systems file paths are sequences of bytes and most of the time the bytes are UTF-8. On Windows, they are WTF-8, so the API users sees a sequence of bytes and most of the time they match UTF-8.

This means that there's more overhead on Windows, but it's much better to normalize what the application programmer sees across POSIX and NT while still roundtripping all paths for both than to make the code unit size difference the application programmer's problem like the C++ file system API does.

pdonis · on Jan 14, 2020

> On Windows, they are WTF-8

Seems like an apt acronym for Windows... :-)

On a more serious note, Python seems to have done something fairly similar with the pathlib standard library module.

simias · on Jan 13, 2020

Not to mention case-sensitivity issues. Can you have two files, one named "FILE.txt" and the other "file.txt" in the same directory for instance?

SSLy · on Jan 13, 2020

On windows? Of course you can.

edgyquant · on Jan 14, 2020

I'm certain you can on Linux as well. Only Macs old HFS would not allow it.

cataflam · on Jan 14, 2020

Isn't this a fairly recent change?

amaranth · on Jan 14, 2020

NTFS has always been case sensitive, Windows API just lets you treat it as case insensitive. If you pass `FILE_FLAG_POSIX_SEMANTICS` to `CreateFile` you can make files that differ only in case.

mathw · on Jan 14, 2020

Good luck using those in some tools which use the API differently though. Windows filenames are endless fun. What's the maximum length of the absolute path of a file? Why, that depends on which API you're using to access it!

rurban · on Jan 15, 2020

Even worse on Unix where it depends on the mount type. Haven't seen much proper long filename support in Unix apps or libs, it's much better in Windows land. Garbage in garbage out is also a security nightmare as names are not identifiable anymore. You can easily spoof such names.

gpderetta · on Jan 13, 2020

Hum, any program that doesn't treat filenames as bytestreams on unix is broken. Doubly so if its primary purpose is preserving and archiving files.

Are you sure the issue wasn't something else?

lmm · on Jan 14, 2020

The point is that filenames aren't bytestreams on windows, and if you treat them as such then your program won't work.

WorldMaker · on Jan 14, 2020

By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

(The remarks in the post here that Mercurial on Python 3 on Windows is not yet stable and showing a lot of issues is possibly even an indicator/canary here. To my understanding, Python 2 Windows used to paper over some of these lowest common denominator encoding compatibility issues with a lot more handholding than they do with the Python 3 Unicode assumption.)

lmm · on Jan 14, 2020

> By this point, any cross-platform file tool that isn't using Unicode as a lowest-common denominator for filenames and similar things to insure maximal compatibility is likely ready to cause havoc.

Be that as it may, Mercurial has existing repositories that may use non-unicode filenames, and just crashing whenever you try to operate on them is probably not an acceptable way forward.

WorldMaker · on Jan 14, 2020

Sure, but that's also not the only resulting option; instead of erroring you could also do something nice like help those users migrate to cleaner Unicode encodings of their filenames by asking them to correct mistakes or provide information about the original encoding. It takes more code to do that than just throwing an error, of course, but who knows how many users that might help that don't even realize why their repositories don't work correctly on, say, Windows.

Dylan16807 · on Jan 14, 2020

Windows filenames basically are bytestreams. But the bytes come in pairs.

lmm · on Jan 15, 2020

Not really. Certain byte sequences are invalid.

Dylan16807 · on Jan 15, 2020

Certain byte sequences are invalid in unix filenames too. So that can't be the factor that decides if they are bytestreams or not.

xorcist · on Jan 13, 2020

If hg borked on non-ascii characters, it sounds like the problem was rather that it didn't treat that data as a bag-of-bytes. Not the other way around?

ploxiln · on Jan 13, 2020

He was trying to use Windows. For Windows, you pretty much have to go through unicode to utf-16, can't be arbitrary bytes, can't be utf8.

(I think that relatively recently it is possible to use utf8 with some new windows interfaces ... but this is probably not widely compatible with older windows releases ...)

Dylan16807 · on Jan 14, 2020

Windows uses arbitrary shorts that are sort of supposed to be utf-16. Just like Unix uses arbitrary bytes that are sort of supposed to be utf-8.

You have to convert between them, but neither uses proper Unicode to represent filenames.

cbsmith · on Jan 13, 2020

Yeah, but utf-16 is still bytes. It's just bytes with a different encoding.

But I do see the pain with Python 3 where the runtime tries to hide these kinds of issues from you. That abstraction can make it difficult to have the right behaviour.

mynegation · on Jan 14, 2020

Everything is bytes but the meaning assigned to bytes, matters. Let’s say I create a file named «Файл» on Unix in UTF8 and put it into git repo. For Unix it is a sequence of bytes that is representation of Russian letters in UTF8. So far so good. Now I clone this repo to Windows, what should happen? The file can not be restored with the name as encoded into bytes on Unix, that will be garbage (that even has a special name “Mojibake”) in the best case or fail outright in the worst. What should happen is decoding of those bytes from UTF8 (to get original Unicode code points) Into Unicode code points, then encoding using Windows native encoding (UTF-16).

mikepurvis · on Jan 14, 2020

True, but one of those representations still needs to be canonical one in the repo for the purposes of hashing into the commits and so on.

Git builds a bunch of logic like this in around handling line endings in text files.

cbsmith · on Jan 14, 2020

Everything isn't bytes. Strings without an encoding don't have a specific byte representation.

takeda · on Jan 14, 2020

It's the other way around. Strings always have meanings and always reference the same characters. You use encoding to encode strings into bytes.

Bytes without encoding, don't have any meaning, they are just... random bytes.

cbsmith · on Jan 15, 2020

We're actually saying the same thing. You're saying without an encoding you can't turn bytes into a string (technically, in Python terminology, that's a decoding, but you know... ;-). I'm saying a string doesn't have a byte representation without an encoding. That's two perspectives on the same truth.

I absolutely agree that a string has meaning without a byte representation. That's the whole point of having it as a distinct type.

lmm · on Jan 14, 2020

UTF-16 is not "just bytes". There are sequences of bytes that are not valid UTF-16, so if you want to roundtrip bytes through UTF-16 you have to do something smarter than just pretending the byte sequence is UTF-16.

cbsmith · on Jan 14, 2020

Sorry, I wasn't trying to imply that any permutation of bytes would work. If you encode it improperly, it's not going to work.

masklinn · on Jan 13, 2020

> For many programs, yes.

For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

Much of the stdlib works with native strings and will either blow up or misbehave if fed anything else[0], which means much of your codebase will necessarily be native strings, with a subset being explicitly bytes or unicode.

> Repository data is bytes, not Unicode.

It's also mostly absent from the source code, and where it is present (e.g. placeholders or separators) it's easy to flag as explicitly bytes.

[0] though some e.g. the encoding layers or io module want either bytes or unicode depending what you're doing specifically, and not always the most sensible, like baseXY being bytes -> bytes conversions where 95% of the use case is to smuggle binary data through text… oh well

pdonis · on Jan 13, 2020

> For all programs, for the simple reason that:

> Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes.

This is a problem with the Python 3 standard library; in many places it requires Unicode when it shouldn't.

takeda · on Jan 14, 2020

This is a really bad way of thinking. The distinction in Python 3 is between text (str) and bytes.

str is not Unicode in fact if you don't use fancy characters internally it stores text as a byte array.

You should think of text the same as of image or sound, what you see in the screen or hear in the speaker is the actual thing, but if you need to save it on disk you encode it as for example png or wav.

owl57 · on Jan 14, 2020

You can just read "requires text when it shouldn't". But I don't recommend this terminology: in most modern computer programs, including Python 3 implementations, "text" and "Unicode" mean the same thing, but outside of this context Unicode is just more precise: sometimes "text" means ASCII and sometimes it means things non-represantable in current version of Unicode.

pdonis · on Jan 14, 2020

> The distinction in Python 3 is between text (str) and bytes.

Feel free to s/Unicode/str/ in what I posted if you prefer that terminology. The problem is still the same.

An example of the problem: Python's standard streams (stdin|out|err) in Python 2 are streams of bytes, but in Python 3 they're streams of Unicode (or str if you prefer that terminology) characters. The problem is twofold: first, if my standard streams are hooked to a console, Python can't always properly detect the encoding of the bytes coming from the console, so it can give me the wrong Unicode characters; second, if my standard streams are hooked to pipes, there is no encoding it can pick that is right, since the bytes aren't even coming from a console (where at least there is some plausible argument for saying the user meant to type Unicode characters, not bytes). What Python 3 should have done was keep the standard streams as bytes, since that's the only common denominator you can rely on, and then let the application decide how to decode them if it decides it needs to, just as in Python 2.

takeda · on Jan 14, 2020

I believe the behavior is correct though. Python uses encoding specified through LANG/LC_* which is the encoding that supposed to be used, and all properly behaved applications use it.

If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version. Most people will use it for text, so the defaults make sense. Personally I would like if there was no automatic conversion when using files/network/pipes etc. but I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

pdonis · on Jan 15, 2020

> Python uses encoding specified through LANG/LC_

Yes, that's the best you can do, but it's still not always correct. I agree that it should be, but "should be" and "is" aren't always the same.

> If your application works on binary data, you can use sys.stdin/out/err.buffer to get binary version.

Yes, but there are still standard library functions that will use the regular streams, and that might conflict with what your application is doing. There is no way to tell Python as a whole "use binary streams everywhere because they are pipes for this application".

> Personally I would like if there was no automatic conversion when using files/network/pipes etc.

That would work if (a) Python could always detect that condition (it can't) and (b) the entire standard library adjusted itself accordingly.

> I guess that would make it more confusing for new users, and would be unnecessary boilerplate for most use cases.

Python 2 worked fine with the standard streams being binary, and applications wrapping them to decode to Unicode when necessary. Python 2.7 even back ported the TextIOWrapper and similar classes to make the wrapping as simple as possible. A similar approach could have been taken in Python 3 (binary streams and a simple wrapper class), but it wasn't.

masklinn · on Jan 14, 2020

Complaining that the world is not as it should be does not solve the issue.

ploxiln · on Jan 13, 2020

Repository data bytes does not show up as string literals in your code, or keyword argument names, or http header names. The vast majority of code involved in this struggle is misc business logic, not repository tracked file contents itself.

reubenmorais · on Jan 13, 2020

Python 3's approach means bytes/str poisons the whole expression. So if you want to do something like:

"%s/%s" % (repository_data_1, repository_data_2)

And have it work on Python 2 and 3, you're screwed.

luhn · on Jan 13, 2020

And Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things. Python 2 would let you do that, and it would often cause subtle bugs with non-ASCII data. Python 3 requires you to encode/decode, so you're working consistently and explicitly with binary or text.

I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

reubenmorais · on Jan 14, 2020

> I don't quite understand your example. `b'%s/%s' % (b'abc', b'def')` works in both 2 and 3. So does `u'%s/%s' % (b'abc'.decode('utf8'), b'def'.decode('utf8'))`, if you wanted to get a unicode string out of it.

We're discussing the linked article, so I'm talking in the context of the linked article. I know it works now, but Python 3 initially removed %-formatting for bytes. I guess I should have used past in my comment, "you were" screwed instead of "you are". From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.

pdonis · on Jan 14, 2020

> Python 3's behavior is more correct—You can't just intermix binary and textual data, they're two different things.

Python 3's behavior as far as forcing you to explicitly recognize data type conversions is more correct, yes.

Python 3's behavior in assuming that nobody would ever need to do "text-like" operations like string formatting on byte sequences was not. At least this particular wart was fixed. But there are still a lot of places where Python makes you use the str "textual" data type when it's not the right one.

Python 3's behavior in making individual elements of a byte string integers instead of length-one byte strings is, frankly, braindead.

acdha · on Jan 14, 2020

That example works fine in both Python 2 and 3 if you’re not mixing types incorrectly. If you are, it will appears to work on Python 2 before failing the first time you encounter non-ASCII data and tends to greatly confuse people with errors which would have been caught immediately on Python 3. I’ve seen teams waste hours trying to track down errors like that.

zo1 · on Jan 14, 2020

Exactly this. The amount of times I saw juniors fixing thses sort of obscure subtle bugs with str_var.decode("utf-8").encode("latin-1") and this after attempting every which combination of the above two de/encode operations is mind boggling.

reubenmorais · on Jan 14, 2020

It works after Python 3.5. From the article:

> Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015.

takeda · on Jan 13, 2020

The rule of thumb (not just for Python, but anything that deals with encoding) is to use binary encoding at the bounds of your program (reading/writing files, sending/receiving data over network etc) it applies to everything including tools like this. If you follow it your life will be simpler.

You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

masklinn · on Jan 13, 2020

> You just need to be aware that in some cases the work is already done for you by the language, for example in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Sadly they fucked up that part rather thoroughly, because the default encoding is `locale.getpreferredencoding()`, which ensures it's going to be wrong at the least possible convenient time and on the devices least accessible for debugging.

Do not ever use text-mode `open` without specifying an encoding.

mintplant · on Jan 14, 2020

Node.js tries to be helpful in defaulting file writes to UTF-8, but defaults file reads to returning a raw byte buffer [0]. So you have to either remember to treat the two operations differently, or, like in Python, manually specify the encoding for both.

[0] I seem to recall that it used to default to the locale's preferred encoding, but I could have my wires crossed with other languages' standard libraries there.

takeda · on Jan 14, 2020

The locales are provided by LANG and other locate variables, so Python will use whatever is set in environment, you can also specify the encoding in one of open() parameters.

masklinn · on Jan 14, 2020

> The locales are provided by LANG and other locate variables

Which is absolutely not what you want when, say, opening your own data files. Even when opening the user’s files it’s likely not what you want.

> you can also specify …

And what I’m saying is this is not a “can also” it’s a “must”. Not doing so will bite you in the ass, because “whatever random garbage is on the machine” is really not what you want a default to be.

takeda · on Jan 14, 2020

Oh I see your point. Looks like they changed the behavior in 3.7 (they added -X UTF-8 option), but being able to set it from the application would be great.

slavik81 · on Jan 13, 2020

> in python if you open a file (without "b" option, the python will do the translation on the fly and you don't need to worry about it)

Of course, if you don't know what encoding the file was opened with, you don't know what characters can be written to the file.

I was bitten by this with Python 3.5 on Windows. I naively assumed the default file encoding would be UTF-8 or UTF-16, but it was actually CP-1252, so my program would crash upon trying to write a non-ASCII character.

mark-r · on Jan 13, 2020

Every Python program should be tested with Emoji characters, they're a real torture test.

slavik81 · on Jan 14, 2020

Note that you need to test on every platform, as the default file encoding may vary. I missed that bug in part because it worked correctly on Linux.

mark-r · on Jan 14, 2020

Good point. I do almost all of my Python on Windows where it's much easier to get an error.

WorldMaker · on Jan 14, 2020

Every program in general should be tested with Emoji characters at this point.

mark-r · on Jan 14, 2020

Not a bad idea, but I think Python is more likely to have hidden bugs that this will uncover. A language that accepts bytes as input and emits the same on output will probably work fine on UTF-8 for example.

WorldMaker · on Jan 14, 2020

That's the Python 2 mentality and a large part of this discussion was that it didn't work in hindsight, that you can't just be "encoding oblivious", but it usually doesn't show up as an obvious problem until you least expect it. Our input and output devices are aren't always homozygous on byte encoding (and quite possibly very rarely are; we have decades and decades of kludges around this), and testing every program with Emoji has become one of my favorite pastimes for finding failure cases.

takeda · on Jan 14, 2020

It defaults to the system encoding. I don't use Python on Windows, but Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8. Perhaps Python needs to be updated to reflect that?

You can also specify encoding when calling open.

Dylan16807 · on Jan 14, 2020

> Windows evolved its default encoding over time, the code pages were popular in Windows 9x, starting with NT based (2000, XP...) They used UTF-16 I believe and then Windows 7? It became UTF-8.

They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

The actual code pages, to this day, are legacy things that are mostly 8 bits. My system is set to code pages 437 and 1252, for example.

They put together a code page for UTF-8 but it's behind a 'beta' warning.

ygra · on Jan 15, 2020

> They bolted on a separate set of functions that took UCS-2 and now take UTF-16.

NT actually bolted on 8-bit versions of the native Unicode functions. FooBarA is a wrapper around FooBarW.

> They put together a code page for UTF-8 but it's behind a 'beta' warning.

Codepage 65001 has been a thing for quite a while. It's just that it's variable-width per character and few applications are ready to handle that when they assume a 1:1 or 2:1 relationship between bytes and characters. It does work sort of for applications that don't do too weird stuff to text, though, and can be a useful workaround in such cases to get UTF-8 support into legacy applications.

But in general, Windows is UTF-16LE and the code pages are indeed legacy cruft that no application should touch or even use. Sadly much software ported from Unix-likes notices »Hey, there's a default encoding in Windows too, so let's just use that«.

slavik81 · on Jan 14, 2020

The default file encoding for Windows was changed to UTF-8 in Python 3.6. That particular problem on that particular platform is now a thing of the past.

It was just an example of why implicit conversions in the standard library functions don't save you from having to think about encodings. You get much more robust and user-friendly programs when you explicitly consider your encodings and the error-handling strategies to go with them.

cbsmith · on Jan 13, 2020

To be fair... the problem was more in Python 2 where this stuff was often conflated. Python 3 really just brought the problem in to stark relief.

TBH I do think the problem is easier to address in a statically typed world.

speedplane · on Jan 15, 2020

> I think this article is an excellent illustration of the Python developers' failure to properly recognize this use case in the 2 to 3 transition.

The entire 2 to 3 transition is an excellent illustration of Python developers failing properly recognize the challenges in transition. What other popular language intentionally broke backwards comparability? It's hard to think of any.

Python set the entire community back 10 years or more by making this drastic mistake.

simias · on Jan 13, 2020

It might be my own pro-typed-language bias showing but this migration from byte strings to unicode strings is really where dynamically typed languages really don't shine.

If we imagine an alternative reality where Rust started only with byte-strings and added unicode as an afterthought like Python did, you'd definitely face a massive amount of churn but at least the compiler would yell at you every time you pass a byte string where unicode is expected and vice-versa. Once you'll have fixed all of the errors in the vast majority of cases there's a good chance that your program would work again. It would be very annoying but at least you know clearly where the problems occur.

In Python on the other hand this type of code refactoring is very painful in my experience. You may end up with the same function being called sometimes with unicode and sometimes with bytes. And then you have to look at the call stack to figure out where it comes from. And then you realize that you end up with, say, a list of records which sometimes contain unicode and sometimes byte arrays depending on whether the code that updated them used the old or the new version etc...

And if it turns out that you can't easily reproduce the problem and you just get a bug report sent from somewhere in production then Good Luck; Have Fun.

monoideism · on Jan 14, 2020

> added unicode as an afterthought like Python did

I agree with you on the benefits of static typing, but let's clear: Python didn't add unicode as an "afterthought". The initial release of Python predates the initial release of the Unicode standard, by almost a year.

Furthermore, even if this were not the case, it took a while before Unicode got any significant adoption among programming languages, well after the release of Python 1.0. I think Java in 1996 was the first language to adopt Unicode.

WorldMaker · on Jan 14, 2020

Another useful red letter date for language/tool adoption is the standardization of UTF-8 in 1993. Before UTF-8 there were a lot of tools, especially in the POSIX world, that didn't feel comfortable without an 8-bit safe encoding format.

Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then (before a large influx of users), but a corresponding complaint about UTF-8 is that because it was 8-bit safe, a lot of tools also felt they could kick the can on dealing with it more directly (as a default), and Python 2 seems to be among them. Hindsight has told us a lot about the problems to expect (and exactly why Python 3 did what it felt it had to do), but they probably weren't as clear in 2000. (In further hindsight, imagine if Astral Plane Emoji had been standard and been common around 2000 instead of 2010 how much further we might be in consistent Unicode implementation today. I suppose that makes 2010 another red letter date for Unicode adoption.)

im3w1l · on Jan 14, 2020

And it was much later than 1993 that unicode conclusively defeated latin-1. Something like 2010?

monoideism · on Jan 14, 2020

> Python 2 was after UTF-8 in 2000, so with hindsight could have had the foresight to pull this bandaid off then

That's true, but I would argue that given the difficulty and backlash we've seen moving from Python 2 to Python 3, such a move would have risked destroying Python's rapid forward momentum and condemned it to the ash heap of programming language history.

monoideism · on Jan 14, 2020

To add on to this, I'm not agreeing with the backlash from Python 2 to 3. And I wouldn't want it in the ash heap of history - I definitely think there's a definite place for nice, quick, easy dynamic langs like Python, particularly for exploratory programming.

I'm just saying the move to Python 3 turned out to be a huge deal to a lot of people (it surprised me), and for that reason, trying such a big jump at Python 2 would have been risky and could have derailed Python's forward progress at a critical point.

Would the downvoters like to share their reasons for disagreement?

WorldMaker · on Jan 14, 2020

I think the question goes back to the size and scale of users at the 1 to 2 jump versus the 2 to 3 jump. Python didn't really start to hit most of its "forward progress" in terms of both user adoption and being so deeply integrated into systems. There was no Django for Python 1, for one example. As another example, I'm pretty sure Debian and its heavy reliance on Python for so much of its system scripting didn't happen until Python 2, either, but a quick search didn't turn up a reliable date.

It probably would have been a lot less risky with so many fewer daily users, so many fewer huge projects to migrate.

monoideism · on Jan 14, 2020

You may be right. I first used Python on a regular basis in 2002 (after release of Python 2), so I wasn't aware it had so little adoption prior to Python 2. But it definitely was picking up by 2002.

dkarl · on Jan 13, 2020

First was the idea that normal feature contributors should not see any b"" or any sign of python3 support for the first couple years of the effort. Huge mistake. You need some b"".

When I read that, I was angry on behalf of the people doing the porting work who had their hands tied by it, and I was angry on behalf of the Mercurial developers who, I think, must have been underestimated. It's normal that platforms don't stand still and coding standards on a project evolve over time. Obviously it's not going to fly for open source contributors to be "voluntold" to do porting work, but to be aware of it and accommodate it and know enough about the new platform to mostly avoid creating new work for the porters seems like a small and reasonable ask, especially when you compare it to the effort required to make high-quality contributions in the first place.

I get that there are people who are bitter to this day about Python having a version 3, but surely by 2017 the vast, vast majority of developers who were going to rage quit the Python community over it were already gone.

mixmastamyk · on Jan 13, 2020

Yes, I was really surprised that they avoided upgrading to Python 2.7-level best practices and future statements for as long as they did and tried to hide it from most developers thru custom compatibility layers. Huh? That's step 0, getting except, stdlib imports, and print statements up to date. Folks can deal with that, that's the easy part.

Keeping blame details (and line-lengths, ha!) was given as the excuse and that is a nice feature and all. However they could have copied the repo over before porting to keep that information and saved time. Wouldn't be surprised if it was eventually lost anyway.

indygreg2 · on Jan 14, 2020

The late start was mostly due to having to retain Python 2.4/2.5 compatibility until May 2015 and it was literally impossible to use some future statements or some Python 3 syntax until 2.6 was required. I have updated the post to reflect this.

mixmastamyk · on Jan 14, 2020

IC, that’s unfortunate. Believe that is the time to cut a legacy branch/release rather than block progress for a decade.

CJefferson · on Jan 13, 2020

Interesting you mention http headers. I had a program converted Python 2 -> Python 2 which was crashing occasionally, and it turned out it was because I was being sent a http request which wasn't valid unicode, so decoding failed.

I had to switch back to treating headers as bytes for as long as possible.

It is a stupid client which doesn't send valid ascii for http headers of course.

takeda · on Jan 13, 2020

I believe the headers are encoded using ISO-8859-1 not Unicode. That encoding has 1:1 mapping with bytes so wouldn't break this way. Treating them as UTF-8 was the bug.

code_biologist · on Jan 13, 2020

This is exactly the sort of encoding issues that the python 2 to 3 transition has flushed out. People get frustrated with python 3, yet the actual failure was their mishandling of encoding issues -- papered over by python 2.

xorcist · on Jan 13, 2020

But that's not what frustrates people with the transition. It's that they suddenly get encoding issues where there should have been no encoding to begin with!

cbsmith · on Jan 13, 2020

No observed encoding issues.

CJefferson · on Jan 13, 2020

When I treated headers as bytes, there wasn't an "encoding".

What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

cbsmith · on Jan 14, 2020

> When I treated headers as bytes, there wasn't an "encoding".

If you are representing strings as bytes, you are intrinsically using an encoding.

> What I often want to do when reading user data is not treat it as a "encoded string", but just as a stream of bytes. Most data I work with (HTML files, output of other programs) can't be treated as anything but bytes, because people put junk in files / output of programs.

Yes, it makes a mockery of the notion that "human readable data is easy". In many cases, you don't want to work with the actual strings in the data anyway, so bytes is the right thing to do.

But yes, this strategy largely avoids encoding issues... until it doesn't.

Dylan16807 · on Jan 14, 2020

> If you are representing strings as bytes, you are intrinsically using an encoding.

It's just binary data that might resemble a string. No encoding necessary.

fiedzia · on Jan 14, 2020

This is false more often than not. Many programs taking user input will treat it as string, assuming specific encoding or compatibility with screen output/some api, at least in some code paths. For example if you print an error message when you can't open some file, you are very likely to assume its encoded in a way terminal can handle, so its no longer "just binary data".

CJefferson · on Jan 14, 2020

Yes, I have to worry about how to make a "best effort" to show it to users, but in all internal code paths it must stay as "just binary data", else I lose information. This is exactly how chrome and Firefox handle headers internally.

cbsmith · on Jan 15, 2020

It might resemble a particular encoding of a string... and the way you got that string to that particular sequence of bytes is by... encoding it.

Dylan16807 · on Jan 15, 2020

> and the way you got that string to that particular sequence of bytes

No I didn't. Those bytes came from an external source. My primary job is to preserve the exact sequence, whether I can make sense of it or not.

cbsmith · on Jan 16, 2020

In that context, you aren't using strings. You are using bytes. HTML without interpreting it as strings isn't really HTML, nor is it a string. It's just a blob that is passing through.

takeda · on Jan 14, 2020

> When I treated headers as bytes, there wasn't an "encoding".

oh, actually there was (either us-ascii or more likely iso-8859-1) the bytes are just values 0-255 what these values mean is the encoding. You're confused because the encoding was implicit, rather than explicit.

It would perhaps be clearer to see it if you for example had to chose if you use ASCII or legacy EBCDIC encoding.

CJefferson · on Jan 13, 2020

I'll admit, I'm not positive what the encoding should be. However there is a bunch of people who do clearly send UTF-8, and I can also promise you there are headers out there which just have binary nonsense in them. See for example https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...

If you want to handle all headers, you have to be prepared to just get binary data.

takeda · on Jan 14, 2020

Yes, and using ISO-8859-1 is the way to handle them without issues. You will never get error when decoding it that way. If you are using UTF-8 there are character combinations that are invalid.

dnautics · on Jan 13, 2020

> It is a stupid client which doesn't send valid ascii for http headers of course.

...or a smart malicious actor.

utxaa · on Jan 20, 2020

> But you don't need all b"" everywhere.

as a mercurial user i never understood this decision. for instance look at this recent commit: https://www.mercurial-scm.org/repo/hg/rev/b4c82b704180

would anyone disagree with the fact that an error message should be a string?

a source transformer to add b'' all over the place? really?

and i still don't understand why the hg transition had to be more complex than: https://docs.djangoproject.com/en/1.11/topics/python3/

... and of course now this: https://www.mercurial-scm.org/wiki/OxidationPlan

i wonder what does matt mackall think of all these developments?

skywhopper · on Jan 14, 2020

Why are you so certain about your assertions here about when they did and did not need to use explicit byte strings?

fireattack · on Jan 13, 2020

I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3. Coming from C#, I never get used to Python 2's approach. It's a pain in the ass working with non-Latin characters in Py2 starting from simply output in console, especially on Windows.

>assuming the world is Unicode is flat out wrong

True, but Py2's approach makes lots of developers assume the world is Latin-1. I see way too many examples of things broken on a Chinese locale environment, including Python's official IDLE ([1]).

[1] https://bugs.python.org/issue15809 (Summary of this bug: in 2.x IDLE, an explicit unicode literal used to still be encoded using system's ANSI encoding instead of, well, unicode.)

int_19h · on Jan 13, 2020

The most amusing quote in the entire article is this (emphasis mine):

> This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

Requiring developers to think which one it should be is, of course, the whole point of the changes in Python 3 - and it's what produces better apps that are more aware of i18n issues in general and Unicode in particular.

And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else. Of course, the devil is in the details, which is reflected by the word "practically" in that sentence - this kinda implies that there are places where Unicode strings are used. At which point you do want the developers to think about bytes vs Unicode.

So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly. Which, of course, is the right change for the vast majority of code out there, that operates on higher level of abstraction, where "all strings are Unicode by default" is a perfectly reasonable assumption to force.

sfink · on Jan 13, 2020

> And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.

> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.

Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.

acdha · on Jan 14, 2020

Paths ARE Unicode strings on 99% of the computers with humans sitting in front of them. NTFS, HFS+, and APFS all use Unicode but more importantly, the experience of not using valid Unicode where that’s possible is horrible: undeletable files, crashes, etc. I’ve seen that many times over the years (it was popular with malware authors) but never a time where this was a desirable behavior.

The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.

jgraham · on Jan 14, 2020

This just isn't true. In Windows paths are UCS2 i.e. arbitary sequences of unicode code units, inclusing unpaired surrogates. This means that there are paths that will work on Windows but cannot be encoded as e.g. valid UTF-8. As a result Rust has a bespoke encoding just for representing Windows paths in a way that's compatible with UTF-8 ("WTF-8"). It also means that you can't make a guaranteed lossless conversion from a filesystem path to a Rust string; you have to handle the possibility of errors.

On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.

As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.

int_19h · on Jan 14, 2020

This all - including complicated equality comparisons - is why paths should have their own dedicated type, and not just be raw strings. Thankfully, Python has had pathlib for a while now.

WorldMaker · on Jan 14, 2020

Paths are Unicode strings on Windows. Yes, POSIX adds a lot more spice to the mix, but if the intent is a cross-platform tool, then Unicode is a reasonable lowest-common-denominator assumption for filenames in 2020.

Conan_Kudo · on Jan 14, 2020

Paths are Unicode strings everywhere but Unix/Linux. And I would even argue that this is a broken aspect of POSIX today. We should make Unicode the baseline for paths in POSIX-compliant systems, but there's probably too much hand-wringing for that to ever happen.

ygra · on Jan 14, 2020

Paths are sequences of 16-bit values on Windows, not necessarily valid UTF-16. It's basically the same as in POSIX, just one byte wider per character.

markbnj · on Jan 13, 2020

> if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.

int_19h · on Jan 13, 2020

Right. But that's a very different issue, and it's not at all about string literals as such.

Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?

phkahler · on Jan 13, 2020

>> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated

The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.

Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.

With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.

int_19h · on Jan 13, 2020

The lack of u'' in early versions of Python 3 is a valid complaint, but it's a separate one.

It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.

epage · on Jan 14, 2020

Another reason the complaint doesn't make sense is that the author then praises Rust which is more similar to Python 3 than 2.

afiori · on Jan 14, 2020

From other comments the annoyances for the author were about the standard library using Unicode for system level API; Rust had a OSString type that works with the GIGO model of posix

the_mitsuhiko · on Jan 13, 2020

> but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I'm also a "non-latin" user and I will keep repeating this point ad nauseam: there would have been many strictly superious solutions to solving this problem and most of them would have been closer to what we had in Python 2 than 3.

Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

A Unicode model that was a bad idea in 2005 was picked and we now have it in 2020 where it's a lot worse because thanks to emojis we now are well outside the basic plane.

harikb · on Jan 13, 2020

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

That said, UTF-8 is one of the best pragmatic solutions to this Unicode problem. Most engineers I meet who throw their hands up in the air complaining about Unicode haven't read the simple Wikipedia page for utf-8.

Python 2 was already half way there, they just to had to tweak a few places bytes are converted to strings. Of course this is easier for newer languages to solve. We can't blame Python for having to provide backward compatibility.

PS: I also blame all the "encoding detection" libraries which exist to try to solve an unsolvable problem. Nobody can detect an encoding, at least not reliably. If these half-assed libraries did not exist, people would have finally settled on UTF-8 and given up on others by now.

the_mitsuhiko · on Jan 13, 2020

> Both of those are newer languages that happen to take a stance from the day 1. So not quite comparable.

Python 3 predates Rust and Go and I can tell you from personal interactions with people how much opposition there was against UTF-8 as either default or internal encoding. A lot of the arguments against it were already not valid then and they definitely are not today.

Python 3 launched despite a lot of vocal opposition against it. I think many do not even remember how badly broken the URL, HTTP and Email modules were when they were first ported to Python 3. There was a complete misunderstanding of how platform abstractions should look like.

All of this was known back then.

dralley · on Jan 14, 2020

Is there any hope of "fixing" it now without going through another massive migration struggle (which will simply not happen)?

morelisp · on Jan 13, 2020

No one is complaining that Python 2 didn't DTRT when it comes to Unicode.

But when Python 3 made its decision, it was known to be the wrong thing. People who had done Unicode in other languages told them it was the wrong thing. People who had taken the effort to do Unicode right in Python 2 told them it was the wrong. The only people telling them they were doing the right thing, were Python 2 programmers who thought they were going to get Unicode support for free without thinking about it (or worse, who had done horribly wrong things in Python 2 - the mess PyGTK wrote itself into, for example).

Python 3 has no excuses for what are now often unusable APIs when you truly do need to process binary data. And all we gained is that we don't need to type "u" before some string constants anymore. It wasn't worth it, and it's still not good.

lmm · on Jan 14, 2020

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

What do you mean by "free"? Rust requires you to explicitly convert a string to bytes or vice versa, no? Which is pretty much what you do in Python - the only difference I can see is that you have shortcut methods to encode/decode using UTF-8, but semantically they're no different from encode/decode in Python.

I'm pretty dubious about specifying that the internal representation must be UTF-8. That's a failure of abstraction (because the program shouldn't know or care what the internal representation is), leads to inherent performance/interop problems on several compile targets (Windows, the JVM, Javascript), and seems to imply that Han unification is forced at the language level.

burntsushi · on Jan 14, 2020

str -> [u8] is free from a performance perspective. It is internally equivalent to a type cast.

[u8] -> str requires a UTF-8 validity check, but is otherwise also internally equivalent to a type cast (i.e., no allocations). I assume this is what Armin meant by "almost" free.

FWIW, I do think that "internally and externally UTF-8" is the best approach to take. If Rust's string type used, say, a sequence of 32-bit codepoints instead, then lots of lower level string handling implementations would be quite a bit slower than their UTF-8 counterparts. (For at least a few reasons that I can think of.) UTF-8 also happens to be quite practical from a performance perspective because it lets you reuse highly optimized routines like memchr in lots of places.

In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

lmm · on Jan 14, 2020

With an opaque string type there's nothing stopping a particular Python implementation from using UTF-8 as an internal representation - it would likely perform worse than CPython at iterating over the code units of a string, but that's likely an acceptable cost. Particularly for a language like Python, defining the precise performance characteristics is rarely the priority, especially if it comes at the cost of confusing the semantics.

> In any case, the abstraction is not lost completely. Rust's string types provide higher level methods without needing to know the encoding used by strings. e.g., You can iterate over codepoints, search, split and so on. The abstraction is intentionally leaky. e.g., You are permitted to take a substring using byte offsets, and if those byte offsets fall in the middle of a UTF-8 encoded codepoint, then you get a panic (or None, depending on the API you use). You are indeed prevented from doing some things, e.g., indexing a string at a particular position because it doesn't make a lot of semantic sense for a Unicode string.

> You might call this a "failure" because it leaks its internal representation, but to me, it's actually a resounding success. Refraining from leaking its internal representation in a way that is zero cost would be absolutely disastrous from a performance perspective when implementing things like regex engines or other lower level text primitives.

I'd argue that offering APIs that can panic is a poor tradeoff in a default/general-use/beginner-facing type. There's maybe a place for a type that implements the same traits as strings while also offering unsafe things like indexing by byte offset (if it's really impossible to achieve what's needed in a safe way, which I'm dubious about), but it's a niche one for specialist use cases (even if it might be the same underlying implementation as the "safe" string type).

burntsushi · on Jan 14, 2020

I feel like you picked at the least interesting aspects of my comment. It continues to be frustrating to talk to you. :-(

And yes, you can index by byte offset in a zero cost way by converting the string to a byte slice first.

Have you used Rust strings (or any similarly designed string abstraction) in anger before? It might help to get some boots-on-the-ground experience with it.

Sohcahtoa82 · on Jan 13, 2020

> Both rust and go decided to go with Unicode support that is largely based around utf-8 with free transmutations to bytes and almost free conversions from bytes. This could have been Python 3 but that path was dismissed without a lot of evidence.

Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

If so, that violates the "Explicit is better than implicit" part of the Zen of Python. Encoding/Decoding bytes to/from strings shouldn't happen automatically because doing so means you have to make an assumption about the encoding.

masklinn · on Jan 13, 2020

> Do you mean that if you have bytes, but you want to send them to a function that expects a string, then it would automatically interpret the bytes as UTF-8?

No, the types are separate and not implicitly converted P2-style, however "unicode strings" are guaranteed to be proper UTF8 so encoding to UTF-8 is completely free, and decoding from UTF8 just requires validating.

Python's maintainers rejected this approach because "it doesn't provide non-amortised O(1) access to codepoints", and while Python 3 broke a lot of things they sadly refused to break this one completely useless thing, only to have to come up with PEP 393 a few years later.

Sohcahtoa82 · on Jan 13, 2020

Ah, that makes sense. Thank you for the clarification.

Lev1a · on Jan 14, 2020

To add to your earlier dialog partner, here are the doc pages for the relevant Rust functions/methods, embedded with runnable examples:

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/primitive.str.html#method.as_b...

https://doc.rust-lang.org/std/str/fn.from_utf8.html

Also, as explained in those docs, if and when you are absolutely sure that the Vec or slice of bytes is valid UTF-8, you could use the following "unsafe" methods to not incur the overhead of validation (warnings in the docs):

https://doc.rust-lang.org/std/string/struct.String.html#meth...

https://doc.rust-lang.org/std/str/fn.from_utf8_unchecked.htm...

takeda · on Jan 13, 2020

IMO Python is doing exactly the same thing that Go does (I know too little about Rust to comment) the only difference is that Python respects the LANG variable while Go is just fixed on using UTF-8.

the_mitsuhiko · on Jan 13, 2020

> Python is doing exactly the same thing that Go does

It doesn't. Go's internal string encoding is UTF-8 and it can even be malformed. Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

takeda · on Jan 14, 2020

Here's your problem: you should not care how python is representing it internally.

> Go in fact does pretty much what Python 2's byte strings were just that string operations such as converting to uppercase or iteration over runes understands UTF-8 and Unicode.

Why do you care about internal representation though, what are you gaining if Go's string and Python's str can express all characters. In Go you still need to convert string into byte[] when doing I/O.

earthboundkid · on Jan 13, 2020

Python 2's approach was bad, no argument, but the transition plan for 2-to-3 just didn't work. They thought everyone would run 2to3 in a big bang, and then we'd all switch over to 3 in a few years. Instead it dragged out over a decade because in reality we needed to write code that was compatible with both 2 and 3 (the "6" approach) until enough things were on 3 to drop 2 support.

Hindsight is 20/20 naturally, but in retrospect, they should have just made `bytes` into the name for old `str` and used `from __future__ import` to create a gradual system for moving from 2 to 3 instead of a big bang "we'll break everything once and then never again".

cbsmith · on Jan 13, 2020

I'm not sure they really thought 2to3 would be used for a big bang. I seem to recall the general initial messaging was that Python 3 was a new language and you would need to do a language port to get to it.

kibwen · on Jan 13, 2020

> I understand author's reasoning in the context of a transition, but as a "non-Latin" language user, defaulting str to unicode literals is the best change in Python 3

I think this is misreading the author's criticism. The fact that string literals are now Unicode is not the fundamental problem; the fact that standard library APIs that formerly took bytes now incorrectly take Unicode strings is the problem.

IMO it's great that the world is moving towards opaque blobs of Unicode for strings, but that requires understanding when something shouldn't simply be a string in the first place (for reasons of legacy or otherwise).

fireattack · on Jan 13, 2020

My comment is about this sentence:

>Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode

>standard library APIs that formerly took bytes now incorrectly take Unicode strings

What do you mean by "incorrectly"?

ngoldbaum · on Jan 13, 2020

POSIX APIs take bytes, generally. Python wraps these APIs to take unicode and doesn't allow you to pass bytes, even if you need to. Filenames, for example, are just bytes, and if you force them to always be valid unicode you will make it so that you can't interact with files that have names that aren't valid unicode. That's just one example.

chaosite · on Jan 13, 2020

This is false since Python 3.6.

https://docs.python.org/3/glossary.html#term-path-like-objec...

morelisp · on Jan 13, 2020

An extremely frustrating part of the Python 3 migration is how many times Python module maintainers have had to hear "oh, now it's safe to migrate." This page currently leads off with a comment saying it's been fine any time since 3.4. You say 3.6. When I was maintaining a popular Python module, I heard the same at 3.1, and 3.2. (I didn't maintain it long after that.)

joshuamorton · on Jan 13, 2020

There are very few places where the bytes/string difference matters for posix paths. Python is far from the only popular tool to assume paths must be valid unicode.

morelisp · on Jan 13, 2020

> There are very few places where the bytes/string difference matters for posix paths.

It's nothing to do with "places", points in your program, or entry points into the stdlib. It's entire about what path names you need to process, and for large classes of software you have zero control over that. If you have a path that doesn't encode properly with your LC_CTYPE, you're in for a bad time with Python 3. (Of course you won't if you control all your own path names, but then you also don't have a problem assuming and enforcing ASCII.)

People were still migrating home systems to Unicode-compatible encodings long after Py3 came out. I still find files in archives with paths in weird (and undeclared/undeclarable) encodings. Lots of people had such files; non-native English speakers were the most likely to have them.

> Python is far from the only popular tool to assume paths must be valid unicode.

It and Java are the only ones I use regularly. Java doesn't have a good reputation for playing well with the outside world, vs. Python which had been sold for years as "better shell scripts."

masklinn · on Jan 14, 2020

> There are very few places where the bytes/string difference matters for posix paths.

There’s only every single input from the system at large, no big.

joshuamorton · on Jan 14, 2020

I don't quite agree. There's lots of systems where it's always unicode, and the a lot of systems where it's always ASCII, and then some systems where stuff is weird (and should be unicode :x)

chaosite · on Jan 13, 2020

There was a different API to get this behavior since 3.4: https://www.python.org/dev/peps/pep-0428/#id39

masklinn · on Jan 13, 2020

Which means it's been true (and broken) for many many years until maintainers finally succumbed to external pressure and unbroke the API.

branko_d · on Jan 13, 2020

Just beware that C# is not exactly "Unicode" either.

C# char is a UTF-16 code unit, not a Unicode code point.

Most code points "fit" into just one UTF-16 code unit, but not all.

For example: 𝐀 ("Mathematical Bold Capital A", code point U+1D400) is encoded in UTF-16 as a surrogate pair of code units: U+D835 and U+DC00. So reversing "x𝐀y" should produce "y𝐀x" ("y\ud835\udc00x") - note how U+D835 and U+DC00 were not reversed in the result.

ygra · on Jan 14, 2020

C# isn't exactly quiet about this property, and yes, it can be annoying from an API perspective, but in C# this was likely a pragmatic choice to remain compatible (and familiar) with C++, COM, etc. where most developers would be coming from.

API members that operate on code points universally take a string and an index.

That being said, treating strings as arrays of characters is fraught with peril in most cases anyway. You can't trivially reverse strings in any encoding, as you need to reverse the sequence of grapheme clusters (to account for diacritics, etc.). You can't trivially truncate strings either, for pretty much the same reason. You can't trivially grab a single character from the middle of a string, again, for the same reason. So basically, indexing, reversing, truncating, copying a subsequence, etc. are all not trivially possible regardless of the encoding. UTF-16 is not the main problem here, as even in UTF-32 it'd be broken.

ak217 · on Jan 13, 2020

I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding, combined with silent coercion between unicode/bytes whenever needed. Those two features in combination made Python brittle and dangerous when handling non-ascii characters, not the "strings are bytes" default.

Making strings Unicode by default is wonderful compared to the alternatives (and OP's assertion that this amounts to "assuming the world is Unicode" is disingenuous: there's nothing stopping programs from handling bytes correctly - Python 3 merely resolved the ambiguity).

kibwen · on Jan 13, 2020

> I think the actual pain in Python 2 came from the misguided decision not to adopt UTF-8 as the default character encoding

The decision of a default encoding surely dates back to Python 1.0 or earlier, which predates not just UTF-8 but even Unicode itself. Python is an old language!

And if the assertion is that Python 2.0 should have made the tumultuous Unicode jump when it released in 2000, I could get behind that (especially in retrospect!), but enthusiasm for both Unicode and UTF-8 was not nearly as high then as it is today, so I don't begrudge them for not jumping at the opportunity.

steveklabnik · on Jan 13, 2020

Interestingly enough, Ruby 1.8 -> 1.9, the big version jump there, there was this kind of transition. The remainder of this post is all IIRC, it's been a while...

Ruby 1.8 had "everything as bytes" and there was no concept of encodings.

Ruby 1.9 introduced explicit encodings on every string. By default, strings would be encoded as the same encoding as your source file. The default was ASCII. You could control this explicitly with a magic comment, and so many folks added the "UTF-8" comment, to get strings encoded as utf-8 by default.

Ruby 2.0, which was not as large a transition as Ruby 1.8 -> 1.9, even though it sounds like a larger one, said that encodings of files were UTF-8 by default, and therefore, strings generally became UTF-8 by default as well. Most folks just removed their magic comments.

mark-r · on Jan 13, 2020

It's surprising how many people believe that you can use a magic comment to make Python use UTF-8 encoding as the default. All the magic comment affects is the encoding of the source file, not the run-time.

morelisp · on Jan 13, 2020

Enforcing UTF-8 as the default encoding, barring a magic comment otherwise, would hardly have been the biggest compatibility break in the 2.x line. It could have been done in any minor release, IMO.

im3w1l · on Jan 14, 2020

To be fair, IDLE is pretty garbage in most ways.

jmilloy · on Jan 13, 2020

> in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case.

I feel like this is the essence of the article: specific constraints/choices of Mecurial made their port to Python 3 difficult. Working with early Python 3 certainly did not help. But there seems to have been some stubbornness here mixed with a lot of retroactive justification.

> One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code.

This is almost ridiculous. You are going to write a JIT partial 2to3 instead of just increasing your length limits and/or using an autoformatter? (Of course, it turns out they eventually did do that... after a bit more stubborness regarding the autoformatter.)

> So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial.

Couldn't this have been a very occasional copy and paste, instead of a downstream dependency? [six](https://six.readthedocs.io/) "consists of only one Python file, so it is painless to copy into a project."

> Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility.

Yes, can't disagree. Early adopters who attempted to write 2- and 3- compatible code suffered the most.

intrepidhero · on Jan 13, 2020

> Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

Having just done transitions on a number of much smaller projects I had the same thought. Changes to string handling tripped me up and the changes to relative imports took some thinking. But the biggest frustration was the nagging question: Why am I doing this?

edit: missing word

libria · on Jan 13, 2020

> Why am [I] doing this?

Lack of security updates past 2019 forced our hand. Did you find a way around that?

falcolas · on Jan 13, 2020

> Lack of security updates past 2019 forced our hand. Did you find a way around that?

Amazon is maintaining Python 2 for at least 4 years, as part of their Amazon Linux long term support release. Google app engine will support Python 2 for an unknown amount of time; they haven't announced an end date. PyPy is Python 2, with (to the best of my limited knowledge) no plans to deprecate support. There are also other LTS releases out there which include Python 2 support.

IOW, the forcing function of the PSF no longer supporting Python is not as big a factor as was hoped.

rst · on Jan 13, 2020

Security updates in Python itself aren't the only issue; a Python 2 project may also depend on packages with security issues of their own, which require continued upstream maintenance.

For example, the python-saml package (for managing SAML-based single sign-on) has separate Python 2 and Python 3 versions, and implements a security-sensitive protocol which means it has (in the fairly recent past) gotten security updates for issues serious enough to rate an assigned CVE. If you're using it, having the current maintainers walk away from the Python 2 version is a serious risk...

jnwatson · on Jan 13, 2020

I'm a maintainer for a somewhat popular Python package that had support all the way back to 2.4, but I've had to systematically remove support for those versions. The problem is all the CI infrastructure and testing packages are removing support.

Is Amazon planning to support pytest for at least 4 years? It will have its last 2.7-supporting release very soon.

viraptor · on Jan 13, 2020

This would only help the server side of mercurial though. There's no client-side supported distribution really. Pypy is not that popular yet.

falcolas · on Jan 13, 2020

I don't know about others, but when I used Mercurial, it was via installing it through brew. And if brew installed pypy as a dependency so Mercurial could still use Python 2, I probably wouldn't have noticed.

viraptor · on Jan 14, 2020

You'd notice, because it wouldn't work: https://www.mercurial-scm.org/wiki/PyPyPlan

kick · on Jan 13, 2020

PyPy is keeping Python 2 support indefinitely, I believe.

hsivonen · on Jan 13, 2020

There's a project for keeping Python 2 alive: https://github.com/naftaliharris/tauthon

It's particularly uncool that Guido brought up the prospect of lawyers (https://github.com/naftaliharris/tauthon/issues/47#issuecomm...) to force it not to be called Python and opposed to letting people who care about keeping Python 2 alive evolve it as "Python 2". (I know he has the legal right to insist on the name change. Still uncool.)

simias · on Jan 13, 2020

I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6. Having incompatible forks of the language share the same name is a pain for everyone involved. I can completely understand the "mainline" python maintainers not wanting to have to deal with that.

Besides if the Tauthon people are serious about maintaining their fork long term it needs to become more than a mere fork and a real language ecosystem of its own, in the long run having a different name will probably help with that, assuming that they ever get there.

EDIT: Also reading the rest of the thread I realize that the post that you linked out of context is slightly misleading (but I blame github's aggressive folding more than you here). Guido's answer comes after the following exchange:

stefantalpalaru: "Disregard Guido's objection. The "Python" trademark doesn't extend to "py2" or "py28". Read this for details: https://www.python.org/psf/trademarks/"

Guido: "Isn't the whole point that we're trying to solve this without lawyers?"

stefantalpalaru: "The whole point is that you've been sabotaging Python 2 for years and when someone does what needed to be done from the start, you come up with silly objections."

Guido: "OK, bring in the lawyers."

In that light, and given the other poster's ridiculously inflammatory take, Guido's answer seems rather level headed and appropriate IMO. He stands his ground, so to speak.

lizmat · on Jan 14, 2020

Re: I understand your point of view but on the other hand we can make the parallel with Perl 5 and 6.

Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media). So Perl and Raku are now considered to be different languages, albeit from the same inspiration.

Now, if Python 2 people would decide to rename Python 2 to something else, I guess it would be a mirrored parallel :-)

simias · on Jan 14, 2020

That's precisely what's happening with the third-party Python2 forks. The renaming of Perl 6 occurred last October, specifically because of the problems caused by the confusion between the incompatible Perl 5 and 6 that caused a lot of trouble to the Perl people on either side for many years.

It's not a mirrored parallel, it's the Python folks learning from Perl's mistakes and making sure that this parallel won't come to be.

mjw1007 · on Jan 13, 2020

I think it makes a great deal of sense for the Python core team to say "we're finished with Python 2 and want nothing more to do with it".

But I'm very disappointed that the Python Software Foundation isn't explictly supporting people who want to keep Python 2 compiling and running on modern systems. I think that would be well within their remit to "promote, protect, and advance the Python programming language".

This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

Even before Python 3.0 appeared, I came across scientists saying "I prefer to stick with Fortran because new Python versions break old code too frequently".

int_19h · on Jan 13, 2020

PSF does not object to people who keep Python 2 compiling and running, such as ActiveState (https://www.activestate.com/company/press/press-releases/act...).

This case is different, because it's a project that uses the Python name, but actively adds features to the language. This is the classic example of brand confusion - someone might try to use it, find something to complain about, and PSF's reputation suffers as the result. They also get support overhead from the users of the fork (even if all they do is tell them to go away, that is still triage time that could be spent on other issues).

mjw1007 · on Jan 13, 2020

"Does not object" is better than nothing, but I think it would be better if the PSF actively helped to coordinate this work (again, without bothering the core team). As far as I'm concerned, this is exactly the sort of thing that the PSF exists for.

forgotpwd16 · on Jan 14, 2020

>This is particularly so because Python is widely used for scientific purposes, and being able to reproduce old results is valuable.

You can always download an old version and the respective libraries and use them to reproduce any results you want. That doesn't mean that old version should be supported anymore.

Klonoar · on Jan 13, 2020

Longtime Python dev who was also annoyed by the 2 - 3 transition here.

I don't see Guido as in the wrong for that. It'd be a smack in the face when you spend years trying to finally push people to switch (for better or for worse) and then a project like this takes the SEO and gets to run freely with it.

hsivonen · on Jan 13, 2020

Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it? It's ungraceful not to hand off maintainership on good terms to someone who wants to do the work.

Imagine if Stroustrup had done D and insisted that it be called C++ and wanted everyone to stop using the language everyone knew as C++ on Jan 1st 2020.

dragonwriter · on Jan 13, 2020

> Why should Guido or PSF get to tell people to stop using Python 2 even if they no longer want to work on it?

They aren't stopping people from using Python 2, the language or Python 2, the software.

They are stopping people from using the name “Python” as the name of forked implementations of Python 2 not maintained by the PSF. No implementation not maintained by PSF is allowed to be called unqualified Python; the name is an important indicator of provenance. There are and have been plenty of third-party Python (2 and otherwise) implementations, the implementations just need their own names.

hsivonen · on Jan 13, 2020

> They aren't stopping people from using Python 2, the language or Python 2, the software.

The effort to claim the binary name python for Python 3 is actively hostile to Python 3 and a thing that runs unmodified Python 2 unmodified on the same operating system installation. (It's unclear to me how much this is a PSF push, but at least the PEP isn't telling distros to refrain from this hostile-to-comaptibility action.)

> No implementation not maintained by PSF is allowed to be called unqualified Python

The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.

joshuamorton · on Jan 13, 2020

> The best situation would be PSF hosting continued Python 2-compatible development by people who want to do the work.

For who? This costs the PSF manpower/overhead that they don't want to expend on a thing they don't want to maintain. It dilutes the language that the PSF are stewards of, and would further cause a schism in the python community. None of those things sounds good for python, its ecosystem, or the PSF. They sound good for, like, a few curmudgeonly companies and individuals that don't want to migrate.

I can't parse your first sentence, so I can't respond to it.

hsivonen · on Jan 13, 2020

> For who?

For users of the Python 2 language who have a lot of Python 2 code and for whom migration doesn't make cost/benefit sense on technical merits of Python 3.

There's Tauthon. There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2. There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

It would be great if there was a common venue for collaboration for these by the parties who are interested in keeping Python 2 going. (I'm not suggesting that Python 3 core devs should do the work.) Like a foundation for Python software.

The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.

joshuamorton · on Jan 13, 2020

> The first sentence meant that claiming the command-line executable name python for Python 3 is hostile to letting an execution environment for Python 3 and an execution environment for Python 2 co-exist going forward without having to modify existing programs that assume that python is for Python 2 and python3 is for Python 3.

Yes, but I don't believe I've seen any (real) suggestions to change PEP 394.

> There's Tauthon.

Which I claim is actively bad for python's ecosystem in the long term. It shouldn't be supported by any organization that wants what is best for Python.

> There are probably others. Also, there's the need to keep the server side of pip up and running for these to work.

That works just fine without any help. pypi continues to support python2 tags and wheels, and I doubt that'll change anytime soon.

> There's Active State's long-term support for Python 2. There's presumably Red Hat's long-term for Python 2.

So the entire reasonable bit here is that the PSF should provide something to help various enterprise companies manage backporting security patches. Which, like, I'm not sure what infrastructure is actually needed for that. They already make security patches public. Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.

hsivonen · on Jan 14, 2020

> Unless you're suggesting that LTS enterprise support offerings should co-ordinate additional feature work on python 2, which is both unusual and again I claim actively harmful to the ecosystem.

If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

That such improvements are considered actively harmful comes from a point of view where there's a top-down imperative to shut down Python 2 in order to make Python 3 succeed. It's not harmful from the point of view of the code people have written in Python 2 being valuable.

The notion that there user community needs to work for Python (by porting to Python 3) and that Python 2 needs to be shut down as opposed to Python development valuing the existing code that had been developed is the core problem with Python 3.

joshuamorton · on Jan 14, 2020

> If you have a large amount of Python 2 code that doesn't make sense to rewrite as Python 3 but does make sense to keep developing as opposed to just keep running as-is, it makes sense to want compatibility-preserving improvements to the language.

But it really doesn't. If the new features are that valuable, you can convert your code. It's not actually that hard (I have a few 100kloc ported forward now, with millions of lines of dependencies that says so).

takeda · on Jan 13, 2020

That project is not Python 2 though, it added features that made it incompatible with both Python 2 and Python 3. Just look at their effort to add wheel support.

Any project that forks changes name:

nagios -> icigna

mysql -> mariadb

NetBSD -> OpenBSD

FreeBSD -> DragonflyBSD

Python -> PyPy, Jython, IronPython

It would be crazy for them to keep the same name and not be compatible. It would cause confusion and also lead to increase of support tickets in wrong bug trackers.

hsivonen · on Jan 14, 2020

In those cases, the original project lived on. Here Python 3 is the incompatible fork, but because the technical fork is done by the folks who control the name and who want to shut the old thing down, so the compatible evolution of Python 2 had to change the name.

gjulianm · on Jan 13, 2020

Your analogy is not appropriate. The actual situation with Tauthon is as if someone was not happy with C++17, so they forked C++14, added new features and changed syntax and then insisting to call it C++. It's just confusion for the users and it's in the best interest of the PSF to protect the Python name.

hsivonen · on Jan 13, 2020

I'd agree if the Python core devs were still interested in evolving Python 2.x. But they aren't, so now no one else gets to do Python 2.8, either. It would be the best if the PSF provided a venue for Python 2.x development even if the folks who went on to do Python 3 weren't the people working on it.

Anyway, the core problem is a top-down effort to try to make a programming language of Python 2.x’s level of usage stop to the extent it’s stoppable under its license, because its creators wanted to do something else, as opposed to facilitating its user community to pool resources to continue its development. Does the PSF have a legal obligation to do such facilitation? No. Is the lack of such facilitation bad for parties who bought into Python when it was Python 2? Yes.

joshuamorton · on Jan 14, 2020

> core devs were still interested in evolving Python 2.x.

They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

You're arguing that the psf should treat python2 and 3 as different languages. In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.

hsivonen · on Jan 14, 2020

> They absolutely are. In fact, python 3.9 is in the works right now, which has many new evolutions beyond 2.7.

I meant compatible (in the sense that old programs keep running and you can add new stuff to old programs using the new features) evolutions.

> You're arguing that the psf should treat python2 and 3 as different languages.

For practical purposes, they are different languages and the PSF has been treating them as distinct things.

> In their (any my) opinion, this is harmful. It bifurcates python into two incompatible languages. That's bad long term (Perl).

It indeed is bad. I hope that every other programming language community and designer takes a close look at what happened and makes sure never to do a Python 3 analog of their language.

> In other words, what's best for python the language, and what's best for python2 the language are not the same. And for the psf, python is more important.

That's the core problem from the perspective of Python 2 users. The organization that was the steward of the language that they invested in (in the form of writing code in the language) decided not only that a different programming language is more important for the org but that the old language needed to be shut down in order to benefit their new thing.

It's OK for people to get bored with a project and move onto something else, but with the level of usage that Python 2 had and has, it's very problematic for the language steward organization to turn around and seek to shut the language down instead of continuing to evolve it in a way that's respectful of the language users' investment in the language.

joshuamorton · on Jan 14, 2020

> a way that's respectful

You had like 10 years of warning and it's "disrespectful"? I don't think there's a chance of productivity if you're starting from that baseline level of entitlement. Sure, mandates are annoying. But I just can't fathom that.

hsivonen · on Jan 14, 2020

It's not about how many years of warning there was. It's about making users of the language to rewrite by mandate as opposed to the new features being incrementally adoptable into existing code bases. Sure, that means there are some language changes you never get to make.

Java, JavaScript, C, and C++, for example don't break investment in old code like Python 3 did. They form a reasonable baseline.

joshuamorton · on Jan 14, 2020

And we have kotlin, typescript, and rust due to those languages unwillingness to make breaking changes. The cpp committees unwillingness to remove old garbage from the language is iirc the most cited issue with the language by longtime users.

There are tradeoffs.

hsivonen · on Jan 14, 2020

You can add Kotlin to you app without rewriting all Java. You can add TypeScript to your app without rewriting all JavaScript. You can add Rust to your app with with rewriting all C++. Seems reasonable.

That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

joshuamorton · on Jan 14, 2020

> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

You're mistaken. I have python3 binaries and python2 binaries that share dependencies.

You're correct that fully automatic transpilation is impossible, but that doesn't mean that there can't be shared source. It does however mean that things like per-file flags or whatnot aren't possible. Python became a better language with text vs. bytes support, but that support couldn't be done in a backwards compatible way. Oh well.

> You can add Rust to your app with with rewriting all C++.

It's not as good as you seem to think. It's a nonstarter for a lot of people otherwise interested in adopting rust into existing codebases. Certainly not better than the py2/3 situation.

Kotlin interop also is troublesome, although granted better than rust/cpp or py2/3.

> That Python 2 and 3 can't co-exist in an app is pretty bad in comparison.

That python didn't get replaced by a different language is an incredible testament to the foresight of the python language stewards.

gjulianm · on Jan 13, 2020

> as opposed to facilitating its user community to pool resources to continue its development

How does reusing the name facilitate the development? Every time there is a fork of an open-source project the name changes, precisely to avoid confusions. Reusing the Python name in a fork that is not just a redistribution, but a new version with new features and syntax, is just confusing, unusual and does not help anyone.

umanwizard · on Jan 13, 2020

The difference is that the major C++14 implementations are still supported, and probably will be for as long as the concept of C++ exists.

There's no `--std=python2` flag you can pass to the interpreter, unfortunately.

gjulianm · on Jan 13, 2020

There is no '--std=C++14-with-arbitrary-things-from-c++20' flag either, which is what this fork does. We can discuss whether breaking backwards compatibility was bad or necessary, but creating another fork of Python that backports some features of Python 3 is just adding confusion. If their primary purpose is supporting Python 2.7 applications, they can do just fine without calling it Python.

comex · on Jan 13, 2020

There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

Indeed, C++ has rarely made any breaking changes. A decade or so ago, GCC did cause some major ecosystem breakage, by cracking down on C++ constructs which had never been valid according to the spec but which GCC had previously allowed. When that happened, there was a flag to (at least partially) revert to the old behavior: -fpermissive.

Conan_Kudo · on Jan 14, 2020

> There doesn’t need to be such a flag because C++17 was fully backwards compatible with 14, with some tiny exceptions nobody cares about.

This literally does not parse. How do you know "nobody" cares about those exceptions?

wott · on Jan 14, 2020

'-std=C++14' already includes a few extensions, it is not pure C++14 but a superset. And then there is '-std=gnu++14' too.

CJefferson · on Jan 14, 2020

No, this is like if someone was not happy with C++17 AND gcc removed its support for C++14. Instead, I can still happily compile C++14, C++11, C++03 and C++98 code with gcc.

hackbinary · on Jan 13, 2020

I disagree. If someone else wants to continue to develop Python 2 outside the Python foundation and formalised development community, then that is their prerogative, but Python has the right to decide what is and is not Python.

Dilution of what is commonly accepted to be Python would not be a good thing, and would further add to confusion.

I know that platform upgrades are painful, but we need to move with the times or we'll all be mired in technical debt and old technology.

takeda · on Jan 13, 2020

Yes, them keeping Python 2 alive for 10 years where Python 3 was developed caused a lot of issues, it would be extremely short sighted to allow third (incompatible) python into the mix.

hsivonen · on Jan 14, 2020

> it would be extremely short sighted to allow third (incompatible) python into the mix

The whole point of Tauthon is that it is compatible with Python 2 (in the direction that old programs work).

camgunz · on Jan 13, 2020

Letting "Python 2" zombie around is unacceptable. Python 3 is better in every way, and has been since 3.3 (which Armin deserves a lot of credit for).

Consider anyone who wants to build something with Python, whether it's a library, application, or service. What's better, having to build for Python 3 and 2, or just Python 3?

Thank God that Guido did this, despite knowing all the blowback he'd get. To me, that's super cool.

eesmith · on Jan 13, 2020

"better in every way" ... except for 1) startup time (according to the linked-to article), 2) support for existing Python 2 code, and 3) support for Python 2 C extensions.

For example, https://blog.khinsen.net/posts/2017/11/16/a-plea-for-stabili... describes the "Molecular Modelling Toolkit (MMTK), which might well be the oldest domain-specific library of the SciPy ecosystem, will probably go away after 2020. Porting it to Python 3 is possible, of course, but an enormous effort (some details are in this Twitter thread[1]) for which resources (funding plus competent staff) are very difficult to find."

[1] The thread at https://twitter.com/khinsen/status/930749714567434240 includes "Lots of C modules written for Python 1.4 are waiting for enthusiastic code archeologists ;-)".

I don't think Hinsen is alone in that situation. I can well believe there are some people who, for example, plan to retire in about 5 years and would rather keep with with a Python 2 zombie than spend time to port working code to Python 3.