Hacker News new | past | comments | ask | show | jobs | submit login

I suggest taking this article with a grain of salt. The assumption seems to be that it's totally fine that Linux can have filenames that are arbitrary byte strings and that don't convert to valid Unicode text. First, Python 3 has a good way to deal with those. See PEP 383: Non-decodable Bytes in System Character Interfaces.

Second, having filenames that are not valid Unicode text (even if Python 3 has a way of handling them and round-tripping them) is going to cause you a lot of pain. No one who has thought through all the issues thinks its a good idea. The modern computing world uses Unicode text all over the place. Filenames are manipulated by humans and we deal with them as text.

The idea that 8-bit byte strings are the ideal way to deal with text is a dead end. I expect we are going to see more of these kinds of articles now that Python 2's EOL is coming. In retrospect, you could argue that Python 3 should store Unicode text in memory has UTF-8. However, at the times decisions were made, UTF-8 was not dominate as it is now.

So just handwave it and it eventually goes away? Nope, as long as:

1. Python standard library itself does not adhere to PEP 383. (Which is not likely to be fixed if everyone has such dismissive attitude.)


2. Operating systems do not enforce valid UTF-8 on filenames. (This is unlikely to change sooner than in few decades, if at all.)

> Operating systems do not enforce valid UTF-8 on filenames.

macOS does. In fact, it goes a step farther and normalizes filenames (NFD).

I'm not aware of any serious problems this has caused. It turns out that -- to a reasonable approximation -- nobody actually writes software which puts arbitrary binary data in filenames.

Does this also involve case sensitivity?

I've encountered errors with nodejs imports where getting filename casing wrong didn't matter on OSX (dev environment) but caused errors in ci/prod (Linux). Never gave it much thought after fixing, but this reminded me

Not precisely. File systems are case-insensitive by default on macOS, but this can be changed on a FS-by-FS basis. Unicode normalization is -- as far as I'm aware -- mandatory.

> Operating systems do not enforce valid UTF-8 on filenames.

Should they? There is no difference from a file system perspective. We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.

Dealing with file names properly is a chore even on bash.


> We'd still run into problems even if they did: the line feed is a valid UTF-8 character and is one of the characters with special meaning in many programs.

They could ban new line from file names too. See this proposal: http://austingroupbugs.net/view.php?id=251

It looks to me OP thinks we should "because it's pain otherwise".

My own opinion is more like - it is tied to C paradigm (where you don't have any reliable runtime typing/encoding info, everything is just a bag of bytes), and it can't really be solved without moving on to something else. Not sure if that ever happens.

Python 3’s Unicode support has its issues, e.g. try using `len()` on a string containing surrogate pairs. But it generally works and it’s internally consistent, which is far more than can be said for Python 2’s unholy shambles where str and unicode could be unsafely mixed for all kinds of exciting data corruption and errors. Python 2-to-3 was a move in the right direction. I switched early, and the number of encoding-related problems I’ve had in my Python 3 code has been a fraction of those under Python 2.

Article author’s clickbait-y hysterics are a ridiculous overreaction to the usual IO hassles when working with outside data in arbitrary legacy encodings. But Python 3.1 was released a decade ago and we’re now up to 3.8, so it’s not like folks who still deal with legacy data haven’t had plenty time to file tickets and patches on Python’s famously grotty stdlibs. e.g. Adding optional `path_encoding` and `data_encoding` parameters to `ZipFile()`, similar to `open(f,encoding='…')`, would easily address the described problem, no sturm-und-drang required.

Or heck, in the time it took him to write his post, he could’ve easily knocked out a Python 2 script that rewrites all his legacy .zip files to use modern UTF8 instead. But some folks just prefer complaining, I guess.

Obligatory: https://i.ytimg.com/vi/tJ-LivK4-78/hqdefault.jpg

(Honestly, from the HN post title I thought it was going to about something Python 3 really did fuck up, like the 2-to-3 migration which has been an absolute sucking swamp of unnecessary complexity and make-work for years, and a fine demonstration of how not to manage lifecycle.)

It's easy to poke at taking a decade to do a major upgrade to a tool.

But it's been:

a) a smashing success in comparison with Perl5 => Perl6, and

b) one doubts that a future Python3 => Python4 transition would go significantly better.

“Didn’t fuck it up quite as badly as Perl” is hardly a robust defense.

Contrast the upgrade cycle of Apple’s Swift. Same amount of user howling, of course, yet Swift has moved forward by 4 breaking releases in the time it’s taken Python to do 1. That designed their upgrade process; tooled, timelined, and taught it appropriately; and ran it to schedule.

With Python there was no clear process, no clear schedule, no clear endpoint. They cared about the code but the logistics were an afterthought. They designed, implemented, and tested all their language changes; they should have designed, implemented, and tested their rollout procedures with equal rigor.

In the end, there was no transition of Perl 5 to Perl 6. Recently it was decided to rename Perl 6 to Raku, freeing up both languages from decades long entanglement. https://raku.org

Not sure what this would imply for a future Python 3 -> Python 4 transition.

Doing something like offloading the GIL could be a bridge too far, and trigger a new language?

> No one who has thought through all the issues thinks its a good idea.

As far as it goes, no reasonable person thinks that even having spaces in filenames is a good idea. Nonetheless, POSIX is what it is, and if you're writing a program meant to be generally useful, you can't just handle the POSIX filenames that you like--you have to handle them all.

There are many consequences for failing this, but the most obvious ones are buggy behaviour in general, and security holes in particular. This cannot be ignored or wished away.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact