Second, having filenames that are not valid Unicode text (even if Python 3 has a way of handling them and round-tripping them) is going to cause you a lot of pain. No one who has thought through all the issues thinks its a good idea. The modern computing world uses Unicode text all over the place. Filenames are manipulated by humans and we deal with them as text.
The idea that 8-bit byte strings are the ideal way to deal with text is a dead end. I expect we are going to see more of these kinds of articles now that Python 2's EOL is coming. In retrospect, you could argue that Python 3 should store Unicode text in memory has UTF-8. However, at the times decisions were made, UTF-8 was not dominate as it is now.
1. Python standard library itself does not adhere to PEP 383. (Which is not likely to be fixed if everyone has such dismissive attitude.)
2. Operating systems do not enforce valid UTF-8 on filenames. (This is unlikely to change sooner than in few decades, if at all.)
macOS does. In fact, it goes a step farther and normalizes filenames (NFD).
I'm not aware of any serious problems this has caused. It turns out that -- to a reasonable approximation -- nobody actually writes software which puts arbitrary binary data in filenames.
I've encountered errors with nodejs imports where getting filename casing wrong didn't matter on OSX (dev environment) but caused errors in ci/prod (Linux). Never gave it much thought after fixing, but this reminded me
Should they? There is no difference from a file system perspective. We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.
Dealing with file names properly is a chore even on bash.
They could ban new line from file names too. See this proposal: http://austingroupbugs.net/view.php?id=251
My own opinion is more like - it is tied to C paradigm (where you don't have any reliable runtime typing/encoding info, everything is just a bag of bytes), and it can't really be solved without moving on to something else. Not sure if that ever happens.
Article author’s clickbait-y hysterics are a ridiculous overreaction to the usual IO hassles when working with outside data in arbitrary legacy encodings. But Python 3.1 was released a decade ago and we’re now up to 3.8, so it’s not like folks who still deal with legacy data haven’t had plenty time to file tickets and patches on Python’s famously grotty stdlibs. e.g. Adding optional `path_encoding` and `data_encoding` parameters to `ZipFile()`, similar to `open(f,encoding='…')`, would easily address the described problem, no sturm-und-drang required.
Or heck, in the time it took him to write his post, he could’ve easily knocked out a Python 2 script that rewrites all his legacy .zip files to use modern UTF8 instead. But some folks just prefer complaining, I guess.
(Honestly, from the HN post title I thought it was going to about something Python 3 really did fuck up, like the 2-to-3 migration which has been an absolute sucking swamp of unnecessary complexity and make-work for years, and a fine demonstration of how not to manage lifecycle.)
But it's been:
a) a smashing success in comparison with Perl5 => Perl6, and
b) one doubts that a future Python3 => Python4 transition would go significantly better.
Contrast the upgrade cycle of Apple’s Swift. Same amount of user howling, of course, yet Swift has moved forward by 4 breaking releases in the time it’s taken Python to do 1. That designed their upgrade process; tooled, timelined, and taught it appropriately; and ran it to schedule.
With Python there was no clear process, no clear schedule, no clear endpoint. They cared about the code but the logistics were an afterthought. They designed, implemented, and tested all their language changes; they should have designed, implemented, and tested their rollout procedures with equal rigor.
Not sure what this would imply for a future Python 3 -> Python 4 transition.
As far as it goes, no reasonable person thinks that even having spaces in filenames is a good idea. Nonetheless, POSIX is what it is, and if you're writing a program meant to be generally useful, you can't just handle the POSIX filenames that you like--you have to handle them all.
There are many consequences for failing this, but the most obvious ones are buggy behaviour in general, and security holes in particular. This cannot be ignored or wished away.
Read, understand, and be glad that the interpreter isn't trying to "help" any more!
Or, at least, to me he is. This man helps so many people with nothing in return. He's a regular on some python IRC channels, he has personally helped me so much. He makes difficult concepts easy to understand. I encourage everyone to watch his pycon talks. Start with his talk on loops: https://www.youtube.com/watch?v=EnSu9hHGq5o
FunkyBob (Curtis Maloney) is also a staple with Django, he's spent years and years helping people and asking for nothing in return.
- Command line arguments
- Environment variables
- Files in general
- Many expected-to-be-human-readable fields of popular network protocols
In short, there are many situations where you want to treat a bytestring as a string, not as an array of integers.
If bytes in python 3 had acted like str in python 2 (except for the implicit conversions / comparisons with unicode strings), the situation would be a lot better. As it is, they feel like a second-class citizen designed to discourage use, and as a result are unsupported in most libraries that NEED to support them.
(edited for formatting)
However, as someone who as done a mixture of low level (e.g. system tools), high level (e.g. web apps) and network protocol programming, the Python 3 bytes/str model works well. If you really want to treat a 8-bit byte string as a string, you can always decode as "latin1". In my modern Python 3 code, I don't find a good reason to ever do that anymore.
As someone who thinks the Python 3 behaviour is largely the right behaviour (I'm unconvinced the solution used for filenames on POSIX is the right one, nor that assuming stdin/out isn't arbitrary bytes is the right choice), I still have a lot of issue with the Python 2 -> 3 migration. (Note I haven't read the article because it won't load here, nor on archive.is.)
As someone who has dealt with a fair number of codebases migrating over the past decade, I would like to have seen a clearer migration path. The route taken basically asked developers to go from:
return x == b"a"
If Python 2.6/7 had a mode like -b (which warns when bytearray and unicode are compared) that warned when str/unicode are compared, that would already have been a big improvement for the migration path. As it is, people have written tools that do this (unicode-nazi), but then you quickly run into the fact that the Python 2 stdlib does this all the time, making it hard to just try and resolve such comparisons within a Python 2 codebase. (Note Python 3's -b does warn for bytes/str!)
Now, at the same time as the behaviour of u"a" == u"b" changed, Python also changed the return type of (e.g.) os.listdir(). This means if you want to compare a different list loaded from elsewhere, you need to have that list in different types depending on whether you're running on Python 2 or Python 3. In a dynamically typed language, it's hard to make all these changes with confidence that you're actually fixing everywhere.
Agreed. This was a major mistake in the migration story for Py3. They bet on the static translation approach of 2to3, which is just inappropriate for a dynamic language like Python. Better to have doubled down on Python's dynamism by adding modes to the interpreter to suss out the code that wouldn't work on Py3.
Instead of which, we got a lot of “write-once, run-everywhere” nonsense, with everyone vying to bend their code out of shape in the most creatively unproductive ways possible. Absolutely ridiculous makework, and the Python community should’ve called itself on it. Unfortunately, the geeks love a challenge, far more than being told when they’re having a brain fart. Oh well, at least that whole shambles has finally just about run its course; here’s hoping its lessons are learned for Python 4.:)
I suppose it's moot at this point.
The result is the chaos of the present day.
That said, would anyone have been interested in a totally new encoding? For European languages which use mostly the same 26 latin characters with occasional diacritics and accents, UTF-8-with-incompatible-consumer degrades into occasional unreadable characters. But if your out-of-date browser or application gave you a "cannot decode this encoding" error, that might have caused a whole lot of pain during that transition. Not to mention that some of the same issues with OS/filesystem/language library interaction would probably remain.
> Inconsistencies in Types
There's just no such thing as a builtin character type in Python. No, that's not new to Python 3. '/' is a str. b'/' is a bytes. Indexing str gives you a str because there's no such thing as a character type, and introducing it would be pointless. Indexing bytes gives you a byte (an int) instead of a bytes because if you're working with a raw byte sequence you probably want to access the bytes individually? If this wasn't the case people working with raw byte sequences would probably be even more displeased during the transition period.
"But I can't port the code by prepending b to every string literal!" Sorry, that's not the correct way to port.
> Bugs in the standard library
Yeah, that might be Unicode-related bugs in PSL. They were users' problems in Python 2 era, now Python core team shoulders the burden. Instead of every single programmer making Unicode errors in their own code, if you find an error in PSL you can fix it once and for all.
The Python 3 string design also necessitates scanning and often transcoding every piece of string data that it encounters, both on the way in and again on the way out. That means that not only is the string type inappropriate for any data that might not be valid Unicode, it is also inappropriate for any data that might be large.
I’ve been meaning to write a blog post about how Julia handles strings, but haven’t yet gotten around to it. Among other benefits:
- You can process any data as strings and characters, whether it’s valid Unicode or not.
- If you read any data as strings or characters and write it back out, you get the exact same data back, no matter what it is, valid or not.
- Invalid characters are parsed according to the Unicode 10 spec.
- You only get an error if you actually ask for the code point of an invalid character, which is a fairly rare operation and must error since there is no correct answer.
- The standard library generally handles invalid Unicode gracefully.
- You can use strings for large data: there’s no need to look at, let alone transcode string data—if you don’t need to access something no work is required.
The world is moving on, and while historic systems are beautiful (I still have a 2.11 BSD emulator running - or rather runnable - somewhere), at some point you need to weight the breakage for legacy users against the cost of maintenance of the compatibility.
Indeed, POSIX is still mandating that filenames are arbitrary byte sequences. But it is just becoming impractical, and in the end it's up to whoever has the motivation to have it working to keep it working, and if there's not enough people with this motivation it's just going to inevitably rot.
It's likely that 10 years from now, anything non-Unicode will be completely broken on modern (desktop, at least) systems and perhaps Linux even gets an opt-in mount option for enforcing filenames to be utf-8-compatible (which may change to opt-out another 10 years on, just as POSIX is going to evolve too in this regard).
Yes, it's a pity and I likely still have some ISO-8859-2 files from 1999 on my filesystem. But I think it's unreasonable for anyone to waste time with that support. And I wouldn't advise anyone wasting extra 20 hours of your developer life on building things around ncurses instead of a more direct approach - build a cool feature in that time instead!
In the meantime as a workaround, make fs links with ascii names and/or subclass ZipFile. I had to monkey patch a Py2 stdlib module once to fix it for a year or so until it was fixed. Probably httplib if memory serves.
For instance, I export PYTHONIOENCODING=UTF_8:replace in some machines where I know the default locale and terminal settings might cause problems with logging.
Edit: premature posting from mobile
If Python 3 had just made that change and no other breaking changes, the transition would have been much faster and the value propositon much clearer.
As to why - sitting in front of emacs with a clicky model M keyboard produces a very different frame of mind. I am more focused and more deliberate in what I type (one doesn't just type ls /usr/bin on such a thing). Although it's by no means my primary computing device, I do find myself going down there for at least a little while on most days. It is a pleasant break, a change of scenery, a different mental state.
I got it, and my vt420 and vt510, after thinking about the bifurcated nature of computing history. Although I started with computers in the 80s, it was the PC side of things. The Unix/"big iron" simply wasn't accessible to many in those days. I have spent decades doing work day in, day out in what amounts to a fancy vt510 emulator (xterm). I wanted to use the real thing. Also it got my son to play zork with me.
I wrote about it here: https://changelog.complete.org/archives/10013-connecting-a-p...
and here: https://changelog.complete.org/archives/10031-resurrecting-a...
The fact that this article hits home and is right scares me a tiny bit. For example:
"I should note that a simple open(b"foo\x7f.txt", "w") works. The lowest-level calls are smart enough to handle this, but the ecosystem built atop them is uneven at best."
If these bytes mean something or something else, this is a concern for the user of the program that feeds it these bytes. The program itself could be oblivious to that.
Somehow python, as a whole, started to feel like it needed too much boilerplate and special fiddling to do simple things. I felt like I spent way too much time keeping track of different environments or versions to make each project work, and was always dissatisfied.
There is another problem with trying to backport selected Python 3 features. How do you decide what gets backported? New features will introduce incompatibility. Even if the feature is forward compatible, you end up with code that will run on Tauthon 2.X but not on Tauthon 2.X-1. If it just a better Python 2, that's fine. When it is some 3rd kind of thing with a relatively tiny user base, who is going to use it?
And I suspect people are vastly overestimating it. The latest additions to Python look like they were made by a committee. It is the committee style work that eats up all the man hours. The programming itself is rather simple.
> Python 3 has too many users at this point.
Most were dragged along by force. Many people are happy that 2.x versions like Tauthon are still maintained. To all projects I'm personally involved in (mainly scientific) python 3 offers literally ZERO advantages and only causes additional costs.
> I don't see them keeping up.
> How do you decide what gets backported? New features will introduce incompatibility.
You gave your own answer. All features that don't introduce incompatibility are going to be backported. In that regard "keeping up" is also not the top priority.
There are a number of py3 only features pushed for by the scientific community (@ is the most obvious, but there are others).
No it's not. It is a clear regression. The minuscule performance improvements stem from the enforcement of xrange() vs range(). But using xrange() on 2.x is still faster.