Hacker News new | past | comments | ask | show | jobs | submit login
The Disaster of Python 3 (complete.org)
72 points by goranmoomin on Nov 22, 2019 | hide | past | favorite | 49 comments

I suggest taking this article with a grain of salt. The assumption seems to be that it's totally fine that Linux can have filenames that are arbitrary byte strings and that don't convert to valid Unicode text. First, Python 3 has a good way to deal with those. See PEP 383: Non-decodable Bytes in System Character Interfaces.

Second, having filenames that are not valid Unicode text (even if Python 3 has a way of handling them and round-tripping them) is going to cause you a lot of pain. No one who has thought through all the issues thinks its a good idea. The modern computing world uses Unicode text all over the place. Filenames are manipulated by humans and we deal with them as text.

The idea that 8-bit byte strings are the ideal way to deal with text is a dead end. I expect we are going to see more of these kinds of articles now that Python 2's EOL is coming. In retrospect, you could argue that Python 3 should store Unicode text in memory has UTF-8. However, at the times decisions were made, UTF-8 was not dominate as it is now.

So just handwave it and it eventually goes away? Nope, as long as:

1. Python standard library itself does not adhere to PEP 383. (Which is not likely to be fixed if everyone has such dismissive attitude.)


2. Operating systems do not enforce valid UTF-8 on filenames. (This is unlikely to change sooner than in few decades, if at all.)

> Operating systems do not enforce valid UTF-8 on filenames.

macOS does. In fact, it goes a step farther and normalizes filenames (NFD).

I'm not aware of any serious problems this has caused. It turns out that -- to a reasonable approximation -- nobody actually writes software which puts arbitrary binary data in filenames.

Does this also involve case sensitivity?

I've encountered errors with nodejs imports where getting filename casing wrong didn't matter on OSX (dev environment) but caused errors in ci/prod (Linux). Never gave it much thought after fixing, but this reminded me

Not precisely. File systems are case-insensitive by default on macOS, but this can be changed on a FS-by-FS basis. Unicode normalization is -- as far as I'm aware -- mandatory.

> Operating systems do not enforce valid UTF-8 on filenames.

Should they? There is no difference from a file system perspective. We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.

Dealing with file names properly is a chore even on bash.


> We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.

They could ban new line from file names too. See this proposal: http://austingroupbugs.net/view.php?id=251

It looks to me OP thinks we should "because it's pain otherwise".

My own opinion is more like - it is tied to C paradigm (where you don't have any reliable runtime typing/encoding info, everything is just a bag of bytes), and it can't really be solved without moving on to something else. Not sure if that ever happens.

Python 3’s Unicode support has its issues, e.g. try using `len()` on a string containing surrogate pairs. But it generally works and it’s internally consistent, which is far more than can be said for Python 2’s unholy shambles where str and unicode could be unsafely mixed for all kinds of exciting data corruption and errors. Python 2-to-3 was a move in the right direction. I switched early, and the number of encoding-related problems I’ve had in my Python 3 code has been a fraction of those under Python 2.

Article author’s clickbait-y hysterics are a ridiculous overreaction to the usual IO hassles when working with outside data in arbitrary legacy encodings. But Python 3.1 was released a decade ago and we’re now up to 3.8, so it’s not like folks who still deal with legacy data haven’t had plenty time to file tickets and patches on Python’s famously grotty stdlibs. e.g. Adding optional `path_encoding` and `data_encoding` parameters to `ZipFile()`, similar to `open(f,encoding='…')`, would easily address the described problem, no sturm-und-drang required.

Or heck, in the time it took him to write his post, he could’ve easily knocked out a Python 2 script that rewrites all his legacy .zip files to use modern UTF8 instead. But some folks just prefer complaining, I guess.

Obligatory: https://i.ytimg.com/vi/tJ-LivK4-78/hqdefault.jpg

(Honestly, from the HN post title I thought it was going to about something Python 3 really did fuck up, like the 2-to-3 migration which has been an absolute sucking swamp of unnecessary complexity and make-work for years, and a fine demonstration of how not to manage lifecycle.)

It's easy to poke at taking a decade to do a major upgrade to a tool.

But it's been:

a) a smashing success in comparison with Perl5 => Perl6, and

b) one doubts that a future Python3 => Python4 transition would go significantly better.

“Didn’t fuck it up quite as badly as Perl” is hardly a robust defense.

Contrast the upgrade cycle of Apple’s Swift. Same amount of user howling, of course, yet Swift has moved forward by 4 breaking releases in the time it’s taken Python to do 1. That designed their upgrade process; tooled, timelined, and taught it appropriately; and ran it to schedule.

With Python there was no clear process, no clear schedule, no clear endpoint. They cared about the code but the logistics were an afterthought. They designed, implemented, and tested all their language changes; they should have designed, implemented, and tested their rollout procedures with equal rigor.

In the end, there was no transition of Perl 5 to Perl 6. Recently it was decided to rename Perl 6 to Raku, freeing up both languages from decades long entanglement. https://raku.org

Not sure what this would imply for a future Python 3 -> Python 4 transition.

Doing something like offloading the GIL could be a bridge too far, and trigger a new language?

> No one who has thought through all the issues thinks its a good idea.

As far as it goes, no reasonable person thinks that even having spaces in filenames is a good idea. Nonetheless, POSIX is what it is, and if you're writing a program meant to be generally useful, you can't just handle the POSIX filenames that you like--you have to handle them all.

There are many consequences for failing this, but the most obvious ones are buggy behaviour in general, and security holes in particular. This cannot be ignored or wished away.

As always, the one true link is this:


Read, understand, and be glad that the interpreter isn't trying to "help" any more!

I just want to chime in and say that Ned Batchelder is one of the greatest human beings alive.

Or, at least, to me he is. This man helps so many people with nothing in return. He's a regular on some python IRC channels, he has personally helped me so much. He makes difficult concepts easy to understand. I encourage everyone to watch his pycon talks. Start with his talk on loops: https://www.youtube.com/watch?v=EnSu9hHGq5o

There's a few outstanding people that make me love the Python ecosystem, Nedbat is definitely in that list.

FunkyBob (Curtis Maloney) is also a staple with Django, he's spent years and years helping people and asking for nothing in return.

I'll throw in Raymond Hettinger & Jack Diederich -- excellent peeps.

Agreed. I run into Ned every 3-4 years, he always remembers me, and he's super nice and helpful.

The problem isn't that there's no implicit conversion between bytes and unicode anymore. The problem is that almost no code beyond the lowest level interfaces is correctly handling it when you need to use bytes. In this case, filepaths (which are bytes in posix, and no using LOCALE is not good enough) don't work. Other examples of things that are bytes:

- Command line arguments

- Environment variables

- stdin/stdout/stderr

- Files in general

- Many expected-to-be-human-readable fields of popular network protocols

In short, there are many situations where you want to treat a bytestring as a string, not as an array of integers.

If bytes in python 3 had acted like str in python 2 (except for the implicit conversions / comparisons with unicode strings), the situation would be a lot better. As it is, they feel like a second-class citizen designed to discourage use, and as a result are unsupported in most libraries that NEED to support them.

(edited for formatting)

Python 3 making 'bytes' an array of integers was a minor mistake, IMHO. I.e. b'abc'[0] should be b'a', not 97. That change made it harder to port code and also makes the bytes() object a bit more unhandy to use. Much too late to change that now.

However, as someone who as done a mixture of low level (e.g. system tools), high level (e.g. web apps) and network protocol programming, the Python 3 bytes/str model works well. If you really want to treat a 8-bit byte string as a string, you can always decode as "latin1". In my modern Python 3 code, I don't find a good reason to ever do that anymore.

> Read, understand, and be glad that the interpreter isn't trying to "help" any more!

As someone who thinks the Python 3 behaviour is largely the right behaviour (I'm unconvinced the solution used for filenames on POSIX is the right one, nor that assuming stdin/out isn't arbitrary bytes is the right choice), I still have a lot of issue with the Python 2 -> 3 migration. (Note I haven't read the article because it won't load here, nor on archive.is.)

As someone who has dealt with a fair number of codebases migrating over the past decade, I would like to have seen a clearer migration path. The route taken basically asked developers to go from:

    def foo(x):
      return x == b"a"
    print foo(u"a")
The fact this went from printing True (Python 2) to False (Python 3) without there ever being any way to know your codebase was doing this, unless you had tests for all such codepaths, meant it was hard to have confidence behaviour was maintained after porting (and I've worked with enough projects that have used Python for scripting without extensive testing of the scripts, often because they're largely doing I/O).

If Python 2.6/7 had a mode like -b (which warns when bytearray and unicode are compared) that warned when str/unicode are compared, that would already have been a big improvement for the migration path. As it is, people have written tools that do this (unicode-nazi), but then you quickly run into the fact that the Python 2 stdlib does this all the time, making it hard to just try and resolve such comparisons within a Python 2 codebase. (Note Python 3's -b does warn for bytes/str!)

Now, at the same time as the behaviour of u"a" == u"b" changed, Python also changed the return type of (e.g.) os.listdir(). This means if you want to compare a different list loaded from elsewhere, you need to have that list in different types depending on whether you're running on Python 2 or Python 3. In a dynamically typed language, it's hard to make all these changes with confidence that you're actually fixing everywhere.

> If Python 2.6/7 had a mode like -b...

Agreed. This was a major mistake in the migration story for Py3. They bet on the static translation approach of 2to3, which is just inappropriate for a dynamic language like Python. Better to have doubled down on Python's dynamism by adding modes to the interpreter to suss out the code that wouldn't work on Py3.

Nah, even with tooling the sloppy-2-to-tighter-3 migration was always going to be a semi-manual job. The right thing to do was embrace that transition: do it early, once, and never touch it again. Get everyone developing their code solely in Python 3, and provide fully automated 3-to-2 conversion for those who still need to deploy on Python 2. (Ideally as part of the module packaging, so that all Python packages automatically support 2 and 3.)

Instead of which, we got a lot of “write-once, run-everywhere” nonsense, with everyone vying to bend their code out of shape in the most creatively unproductive ways possible. Absolutely ridiculous makework, and the Python community should’ve called itself on it. Unfortunately, the geeks love a challenge, far more than being told when they’re having a brain fart. Oh well, at least that whole shambles has finally just about run its course; here’s hoping its lessons are learned for Python 4.:)

If I would have to port a code base from 2 to 3 today, I think the first thing I would do is add type annotations to to the code base via typing/mypy and go from there. A fully typed Python 2 code base shouldn't be too difficult to port with the help of mypy and proper editor support.

Someone in the past must have invented a wide encoding unlike the popular transition format. I wonder if using that for some use cases would have been less pain because it would remove the whack-a-mole symptom: nothing would work at all until you did all the work to consume/transcode as appropriate.

I suppose it's moot at this point.

The problem, which is always the problem, is that like 12 someones invented different wide encodings at many times in the past.

The result is the chaos of the present day.

I don't think today's chaos is related to other wide encodings (those are probably very rarely used). Today's chaos is like Batchelder describes, but I'm suggesting that some of that is due to the ambiguity of the encoding: is this data I'm consuming iso-8859-x or is it utf-8? It's this ambiguity that contributes to the whack-a-mole (and this is a big part of the chaos IMO).

That said, would anyone have been interested in a totally new encoding? For European languages which use mostly the same 26 latin characters with occasional diacritics and accents, UTF-8-with-incompatible-consumer degrades into occasional unreadable characters. But if your out-of-date browser or application gave you a "cannot decode this encoding" error, that might have caused a whole lot of pain during that transition. Not to mention that some of the same issues with OS/filesystem/language library interaction would probably remain.

Okay, another flamebait article. I'll byte, uh no, bite.

> Inconsistencies in Types

There's just no such thing as a builtin character type in Python. No, that's not new to Python 3. '/' is a str. b'/' is a bytes. Indexing str gives you a str because there's no such thing as a character type, and introducing it would be pointless. Indexing bytes gives you a byte (an int) instead of a bytes because if you're working with a raw byte sequence you probably want to access the bytes individually? If this wasn't the case people working with raw byte sequences would probably be even more displeased during the transition period.

"But I can't port the code by prepending b to every string literal!" Sorry, that's not the correct way to port.

> Bugs in the standard library

Yeah, that might be Unicode-related bugs in PSL. They were users' problems in Python 2 era, now Python core team shoulders the burden. Instead of every single programmer making Unicode errors in their own code, if you find an error in PSL you can fix it once and for all.

This problem in Python 3 is not limited to OS file names, that’s just one way to get invalid Unicode data. But invalid data happens all the time when working with real data. The Python 3 string design requires that all strings must be valid Unicode or Python will raise an error. This is a really unfortunate property that has bitten every single data scientist I know who uses Python 3. At some point, often hours or days into a long, expensive computation, one of their programs has suddenly encountered just a single invalid byte and crashed, costing them days of time and work. The only recourse for writing robust programs that can gracefully and correctly handle invalid data is not to use strings, which, frankly makes the string type seem pretty useless.

The Python 3 string design also necessitates scanning and often transcoding every piece of string data that it encounters, both on the way in and again on the way out. That means that not only is the string type inappropriate for any data that might not be valid Unicode, it is also inappropriate for any data that might be large.

I’ve been meaning to write a blog post about how Julia handles strings, but haven’t yet gotten around to it. Among other benefits:

- You can process any data as strings and characters, whether it’s valid Unicode or not.

- If you read any data as strings or characters and write it back out, you get the exact same data back, no matter what it is, valid or not.

- Invalid characters are parsed according to the Unicode 10 spec.

- You only get an error if you actually ask for the code point of an invalid character, which is a fairly rare operation and must error since there is no correct answer.

- The standard library generally handles invalid Unicode gracefully.

- You can use strings for large data: there’s no need to look at, let alone transcode string data—if you don’t need to access something no work is required.

It's open source. If you are having so much trouble with your ultra niche filename use case, just open a pr and clean up some of the code.

This is a nostalgic article, as underlined in the closing section about XON/XOFF and mainframe-compatible escape sequences.

The world is moving on, and while historic systems are beautiful (I still have a 2.11 BSD emulator running - or rather runnable - somewhere), at some point you need to weight the breakage for legacy users against the cost of maintenance of the compatibility.

Indeed, POSIX is still mandating that filenames are arbitrary byte sequences. But it is just becoming impractical, and in the end it's up to whoever has the motivation to have it working to keep it working, and if there's not enough people with this motivation it's just going to inevitably rot.

It's likely that 10 years from now, anything non-Unicode will be completely broken on modern (desktop, at least) systems and perhaps Linux even gets an opt-in mount option for enforcing filenames to be utf-8-compatible (which may change to opt-out another 10 years on, just as POSIX is going to evolve too in this regard).

Yes, it's a pity and I likely still have some ISO-8859-2 files from 1999 on my filesystem. But I think it's unreasonable for anyone to waste time with that support. And I wouldn't advise anyone wasting extra 20 hours of your developer life on building things around ncurses instead of a more direct approach - build a cool feature in that time instead!

Not a disaster, been loving it at least five years. Still, a few niche bugs could be fixed, why not?

In the meantime as a workaround, make fs links with ascii names and/or subclass ZipFile. I had to monkey patch a Py2 stdlib module once to fix it for a year or so until it was fixed. Probably httplib if memory serves.

I just don’t see the problem here - most of the piece completely ignores documented ways to deal with encodings.

For instance, I export PYTHONIOENCODING=UTF_8:replace in some machines where I know the default locale and terminal settings might cause problems with logging.

Edit: premature posting from mobile

The site seems to be having some trouble keeping up; here's an archive/cache/mirror: https://archive.is/efTT9

The string encoding is actually the best part of python 3. There's a large number of small feature regressions that really irritate me, like the removal of comparators, and gratuitous changes like the removal of print statements and moving shit around without providing aliases. But the bytes/str distinction is actually really useful for anybody who uses unicode, which is everybody.

If Python 3 had just made that change and no other breaking changes, the transition would have been much faster and the value propositon much clearer.

I’m fascinated by the idea that someone is using an IBM 3151 terminal in 2019. Other than for nostalgia, should not those have been retired about a decade ago?

I bought it off eBay a few weeks ago.

As to why - sitting in front of emacs with a clicky model M keyboard produces a very different frame of mind. I am more focused and more deliberate in what I type (one doesn't just type ls /usr/bin on such a thing). Although it's by no means my primary computing device, I do find myself going down there for at least a little while on most days. It is a pleasant break, a change of scenery, a different mental state.

I got it, and my vt420 and vt510, after thinking about the bifurcated nature of computing history. Although I started with computers in the 80s, it was the PC side of things. The Unix/"big iron" simply wasn't accessible to many in those days. I have spent decades doing work day in, day out in what amounts to a fancy vt510 emulator (xterm). I wanted to use the real thing. Also it got my son to play zork with me.

I wrote about it here: https://changelog.complete.org/archives/10013-connecting-a-p...

and here: https://changelog.complete.org/archives/10031-resurrecting-a...

Normally when I see a headline of this sort, I expect I will find another over-enthusiastic bombastic smoke-and-mirrors hit-piece.

The fact that this article hits home and is right scares me a tiny bit. For example:

"I should note that a simple open(b"foo\x7f.txt", "w") works. The lowest-level calls are smart enough to handle this, but the ecosystem built atop them is uneven at best."

Oh Crap...

i do not understand why the unicode type is needed inside the program. Why can't you treat everything as bytes? It's not like you can't concatenate two strings of bytes!

If these bytes mean something or something else, this is a concern for the user of the program that feeds it these bytes. The program itself could be oblivious to that.

Title is overblown. At best this might be "disaster" for string/bytes types but many would argue even that is not the case.

This one (Python 3) does not spark joy.

Somehow python, as a whole, started to feel like it needed too much boilerplate and special fiddling to do simple things. I felt like I spent way too much time keeping track of different environments or versions to make each project work, and was always dissatisfied.

Authors problem starts right at the beginning: He mentions that POSIX filenames consist of 8bit bytes, but then uses a utf-8 string as the example filename in the first code block.

Python 3 is a disaster for many other reasons, UTF-8 bugs is just one of them. So far I'm sticking with Tauthon[1] which seems to be the best of both worlds.

1.: https://github.com/naftaliharris/tauthon

I don't wish the Tauthon project ill but I suspect people are underestimating how much work goes into maintaining Python 3. If the Tauthon project scope was limited to taking Python 2.7.X and doing bug fix only releases of it, I think it could be a successful project. Since they seem to be backporting features from the 3.X branch, I don't see them keeping up. Python 3 has too many users at this point. If you look at the Tauthon commit log, it seems clear they are being left behind.

There is another problem with trying to backport selected Python 3 features. How do you decide what gets backported? New features will introduce incompatibility. Even if the feature is forward compatible, you end up with code that will run on Tauthon 2.X but not on Tauthon 2.X-1. If it just a better Python 2, that's fine. When it is some 3rd kind of thing with a relatively tiny user base, who is going to use it?

> I suspect people are underestimating how much work goes into maintaining Python 3.

And I suspect people are vastly overestimating it. The latest additions to Python look like they were made by a committee. It is the committee style work that eats up all the man hours. The programming itself is rather simple.

> Python 3 has too many users at this point.

Most were dragged along by force. Many people are happy that 2.x versions like Tauthon are still maintained. To all projects I'm personally involved in (mainly scientific) python 3 offers literally ZERO advantages and only causes additional costs.

> I don't see them keeping up. > How do you decide what gets backported? New features will introduce incompatibility.

You gave your own answer. All features that don't introduce incompatibility are going to be backported. In that regard "keeping up" is also not the top priority.

Py3 is currently more performant than py2.

There are a number of py3 only features pushed for by the scientific community (@ is the most obvious, but there are others).

> Py3 is currently more performant than py2.

No it's not. It is a clear regression. The minuscule performance improvements stem from the enforcement of xrange() vs range(). But using xrange() on 2.x is still faster.

No, starting in 3.6 or 3.7, there are general performance improvements at the c level in the interpreter.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact