Second, having filenames that are not valid Unicode text (even if Python 3 has a way of handling them and round-tripping them) is going to cause you a lot of pain. No one who has thought through all the issues thinks its a good idea. The modern computing world uses Unicode text all over the place. Filenames are manipulated by humans and we deal with them as text.
The idea that 8-bit byte strings are the ideal way to deal with text is a dead end. I expect we are going to see more of these kinds of articles now that Python 2's EOL is coming. In retrospect, you could argue that Python 3 should store Unicode text in memory has UTF-8. However, at the times decisions were made, UTF-8 was not dominate as it is now.
1. Python standard library itself does not adhere to PEP 383. (Which is not likely to be fixed if everyone has such dismissive attitude.)
2. Operating systems do not enforce valid UTF-8 on filenames. (This is unlikely to change sooner than in few decades, if at all.)
macOS does. In fact, it goes a step farther and normalizes filenames (NFD).
I'm not aware of any serious problems this has caused. It turns out that -- to a reasonable approximation -- nobody actually writes software which puts arbitrary binary data in filenames.
I've encountered errors with nodejs imports where getting filename casing wrong didn't matter on OSX (dev environment) but caused errors in ci/prod (Linux). Never gave it much thought after fixing, but this reminded me
Should they? There is no difference from a file system perspective. We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.
Dealing with file names properly is a chore even on bash.
They could ban new line from file names too. See this proposal: http://austingroupbugs.net/view.php?id=251
My own opinion is more like - it is tied to C paradigm (where you don't have any reliable runtime typing/encoding info, everything is just a bag of bytes), and it can't really be solved without moving on to something else. Not sure if that ever happens.
Article author’s clickbait-y hysterics are a ridiculous overreaction to the usual IO hassles when working with outside data in arbitrary legacy encodings. But Python 3.1 was released a decade ago and we’re now up to 3.8, so it’s not like folks who still deal with legacy data haven’t had plenty time to file tickets and patches on Python’s famously grotty stdlibs. e.g. Adding optional `path_encoding` and `data_encoding` parameters to `ZipFile()`, similar to `open(f,encoding='…')`, would easily address the described problem, no sturm-und-drang required.
Or heck, in the time it took him to write his post, he could’ve easily knocked out a Python 2 script that rewrites all his legacy .zip files to use modern UTF8 instead. But some folks just prefer complaining, I guess.
(Honestly, from the HN post title I thought it was going to about something Python 3 really did fuck up, like the 2-to-3 migration which has been an absolute sucking swamp of unnecessary complexity and make-work for years, and a fine demonstration of how not to manage lifecycle.)
But it's been:
a) a smashing success in comparison with Perl5 => Perl6, and
b) one doubts that a future Python3 => Python4 transition would go significantly better.
Contrast the upgrade cycle of Apple’s Swift. Same amount of user howling, of course, yet Swift has moved forward by 4 breaking releases in the time it’s taken Python to do 1. That designed their upgrade process; tooled, timelined, and taught it appropriately; and ran it to schedule.
With Python there was no clear process, no clear schedule, no clear endpoint. They cared about the code but the logistics were an afterthought. They designed, implemented, and tested all their language changes; they should have designed, implemented, and tested their rollout procedures with equal rigor.
Not sure what this would imply for a future Python 3 -> Python 4 transition.
As far as it goes, no reasonable person thinks that even having spaces in filenames is a good idea. Nonetheless, POSIX is what it is, and if you're writing a program meant to be generally useful, you can't just handle the POSIX filenames that you like--you have to handle them all.
There are many consequences for failing this, but the most obvious ones are buggy behaviour in general, and security holes in particular. This cannot be ignored or wished away.