
The Fundamental Problem in Python 3 - psibi
https://changelog.complete.org/archives/10063-the-fundamental-problem-in-python-3
======
mikl
"Python does not cater to my favourite edge case" != fundamental problem

In the days before UTF-8-everywhere, file names with anything besides
alphanumerics and safe symbols like dashes or underscores were always a
problem. If you had special characters in your filenames, you were almost
certain to run in to problems, since their meaning varied greatly depending on
how the system was set up - codepages and whatnot.

But this is only a problem if you have such files, and unless you’ve kept
files around for decades, you don’t. So young programmers can grow up never
having problems with this. Everything will be UTF-8 and it’ll just work.

And as for broken old file names, who cares? Fix your file names and move on.
There’s no reason that Python 3 should have workarounds for problems that were
solved over a decade ago.

~~~
KaiserPro
> And as for broken old file names, who cares? Fix your file names and move
> on.

what if you are trying to process "foreign"(as in, not created by you)
filenames, trying to validate/conform with python? I mean its a great DoS
vector, which is difficult to protect against with python.

it'll crash. Which the point the article is trying to make.

> unless you’ve kept files around for decades, you don’t.

Thats both untrue and not very helpful. Its perfectly possible to bump into
files like this.

~~~
oefrha
> what if you are trying to process "foreign"(as in, not created by you)
> filenames, trying to validate/conform with python? I mean its a great DoS
> vector, which is difficult to protect against with python.

1\. Have catch-all exception handling. Exception may not be your fault,
exception-caused "crashing" whatever that means is entirely your fault.

2\. Use os.fsdecode.
[https://docs.python.org/3/library/os.html#os.fsdecode](https://docs.python.org/3/library/os.html#os.fsdecode)

3\. Don't process random untrusted filenames. Sanitize if it's some sort of
HTTP upload, for instance.

------
oefrha
Just a rehash of

[https://changelog.complete.org/archives/10053-the-
incredible...](https://changelog.complete.org/archives/10053-the-incredible-
disaster-of-python-3)

[https://news.ycombinator.com/item?id=21606416](https://news.ycombinator.com/item?id=21606416)

Arguing that python3’s str model doesn’t work well with POSIX’s “any bag of
bytes can be a filename” model. Plus new rants about surrogateescape which the
author has learned since publishing the last article.

The author sure has a penchant for flamebait titles.

~~~
mehrdadn
No comment on the titles, but the issues are real (the linked page explains
some of it decently [1]). I know I've found it excruciatingly difficult to
write Python code that handles non-ASCII stdio correctly, especially one that
might display on a terminal, especially a portable manner. Some of the
compatibility issues are inherently hard problems in any language, but others
are Python-related, and it didn't get better in Python 3.

[1] [http://lucumr.pocoo.org/2014/5/12/everything-about-
unicode/](http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/)

~~~
oefrha
> I've found it excruciatingly difficult to write Python code that handles
> non-ASCII stdio correctly, especially one that might display on a terminal,
> especially a portable manner, and it didn't get better in Python 3.

I've been doing just that in Python 3 for half a decade at least -- got a
fairly popular open source CLI application that hasn't seen a Unicode
complaint for years. It doesn't come for free on all possible configurations
(yeah, every developer with a wide enough userbase has seen the dreaded
"'ascii' codec can't encode characters in position ...: ordinal not in range"
at some point) but it's definitely not "excruciatingly difficult". Bad things
only happen sys.stdout.encoding isn't utf-8, which is rare on * nix systems --
fixed by setting PYTHONIOENCODING or setting locale to * .UTF-8. Encoding on
windows is always a headache (not specific to Python at all) but somehow we
seemed to have managed to steer clear too. Anyway, just check
sys.stdout.encoding.

Meanwhile, this article is about problems that arise when you have garbage
filenames that are neither Unicode nor whatever Microsoft's encoding; I
wouldn't expect the average user to deal with files like that in day-to-day
usage.

~~~
mehrdadn
> Bad things only happen sys.stdout.encoding isn't utf-8

Yes, so your code will break when the caller or user sets it to something
else.

> which is rare on * nix systems

"UNIX only" is not exactly what I meant by "portable"...

> It doesn't come for free but it's definitely not "excruciatingly difficult".

It's not when you're ignoring the difficult parts!

> Encoding on windows is always a headache (not specific to Python at all)

It's a headache in Windows, but in some cases encoding is worse in Python 3.
Sadly I've never sat down to make a library of all the examples I come across,
but here's one:

Try predicting what this shell script should print on each platform. (It's
Bash/Batch cross-compatible, so once you tell me your prediction, try running
in both.)

    
    
      (python2 -c "import sys; sys.stdout.write('\xDD')" && python3 -c "import sys; sys.stdout.buffer.write(b'\xDD'); sys.stdout.write('\xDD')") > Temp.bin && python3 -c "import binascii; print(binascii.hexlify(open('Temp.bin', 'rb').read()).decode('ascii'))" && rm Temp.bin

~~~
oefrha
> Yes, so your code will break when the caller or user sets it to something
> else.

No, it will break when the encoding is set to something where the _output text
simply can 't be encoded_. There's simply no such thing as “” in the ASCII
locale, or the C locale, so when you try to print those you get an encoding
error; if the encoding is set to cp1252 it would be fine.

If you don't want to encode _text_ according to user's preferred encoding,
just encode as utf-8 regardless and write the bytes. Let me say this again: if
you want to print text, print text; if you want to print a byte stream, print
a byte stream. Python3 doesn't make this trivial but it's also not terribly
hard.

But by printing utf-8 byte stream under all scenarios, you would be
conveniently ignoring the fact that default console encoding on Windows is
cp437, not cp65001, so non-ASCII parts of the utf-8 byte stream would be
garbled on Windows consoles by default. (And I see garbled Unicode filenames
from programs written in other languages all too often when I'm on a Windows
console.) It's a damned if you do, damned if you don't situation.
Interestingly, the default now is utf-8 on Windows consoles, unless
PYTHONLEGACYWINDOWSSTDIO is set. They probably decided that printing (the
occasional) garbage is better than correctness.

> Try predicting what this shell script should print on each platform.

Again, not surprising when you realize the default encoding on (en_US) Windows
is cp1252. If you want to write a byte 0xdd, don't use '\xDD' which is U+00DD
which of course is encoded differently in utf-8 (0xc3 0x9d) and cp1252 (0xdd).

~~~
mehrdadn
> Again, not surprising when you realize the default encoding on (en_US)
> Windows is cp1252.

By "the encoding" you're referring to PYTHONIOENCODING, right?

Somehow PYTHONIOENCODING is _" not specific to Python at all"_?

~~~
oefrha
PYTHONIOENCODING is an override env var you can set. cp1252 being the default
on Windows is of course not Python-specific, it’s convention set by Microsoft.
In fact, you do realize cp1252 aka Latin-1 was even the default encoding used
by the HTML spec before the advent of HTML5 (postdates Python3)? Creators of
Python3 certainly didn’t set this default to mess with you.

~~~
mehrdadn
No, this is just insane Python behavior. There's clearly nothing in the
Windows I/O path that's doing this, and if it's a "convention", it's not even
one that Python 2 has followed!

PHP doesn't do it this way:

    
    
      php -r "echo json_decode(chr(34).'\u00DD'.chr(34));"
    

Ruby doesn't do it this way either:

    
    
      ruby -e "puts \"\u00DD\""
    

Neither does MSYS2's Bash:

    
    
      printf '\u00DD'
    

Neither does Node.js:

    
    
      node --eval="process.stdout.write('\u00DD');"
    

Hell, even Python 2's behavior of raising an error is more sane than just
encoding in cp1252 silently and expecting every developer to somehow _know_
the resulting byte sequence is going to be different on each platform:

    
    
      python3 -c "print(u'\u00DD')"
    

Obviously, I'm not the only one who thinks that whatever this (ancient?) so-
called "convention" is, it's actively harmful enough to avoid in 2019.

And as if that's not enough, that's not even the end of it. Even literally
writing a _string_ to a _file_ (with _no_ console or stdio in between!!) ends
up producing _a completely different file_ depending on which platform the
code is run on:

    
    
      python3 -c "open('Temp.bin', 'w+').write(u'\u00DD')" && xxd Temp.bin && rm Temp.bin
    

How is any programmer supposed to deal with this insanity? A program writes to
a file, but produces files with _completely different contents_ depending on
the platform? Heaven help you if you try to interacting with a program in
another language. At least with newlines, as painful as they are, most people
_know_ and _recognize_ how to deal with them. But dealing with that mess is
enough, and not a reason to make the situation even worse?! It's almost as if
they took the CRLF issue and then went "Hmm, that's not fair to CR or LF, we
need to do this to _even more characters_."

Somehow we have all this nonsense in the name of "fixing" strings from Python
2. Instead of finding a better convention or at least leaving the existing
behavior alone, Python 3—which is intended to let people write cross-platform
code!—actively _embraced_ it and _botched_ Python 2's safe behavior, making
programs produce garbled output completely silently... and blaming it on the
stupid developer for trying to write a broken cross-platform program without
first getting a PhD on the history of Code Pages on Windows.

------
adrian17
"into what is fundamentally a weakly-typed, dynamic language."

Nit: isn't Python generally considered to be strongly, dynamically typed?

~~~
shean_massey
Exactly. I stopped looking for insights as soon as I read that one.

~~~
coldtea
Yes, because any post containing an error can never also include insights...

------
kresten
It takes a specific type of personality to remain angry about Python 3 in Dec
2019.

Didn’t that story finish?

~~~
jasondclinton
I know if at least a few big tech companies that haven't migrated from 2 yet.
So, a lot of folks are discovering Python 3 for the first time.

~~~
philipov
In my experience, companies haven't yet migrated from Python 2 for the same
reason they haven't yet migrated from COBOL. It has nothing to do with Python
3's benefits or flaws.

------
xapata
Use pathlib?

------
smitty1e
The lack of any Pathlib discussion seemed curious.

------
kthejoker2
> This was the most common objection raised to my prior post. “Get over it,
> the world’s moved on.”

Gee I wonder why ...

------
goatinaboat
The truth is that 99% of programmers can do everything they need to in ASCII
and the other 1% are working on tools to handle Unicode itself. It’s a mistake
and as soon as it goes the way of the <blink> tag the better. At least that
tag was amusing for a short while...

~~~
bildung
This is only true for the native English speakers. 96% of the world population
aren't.

~~~
doteka
Always felt like a strange argument to me. I grew up bilingual, neither of
these languages was English. Never in my life has it occurred to me to name a
file I created with something else than ascii characters.

~~~
goatinaboat
_neither of these languages was English. Never in my life has it occurred to
me to name a file I created with something else than ascii characters._

This is the totally normal experience of every programmer from every country.

The only people pushing Unicode, ironically, are white native English speakers
who think they’re saving the world.

~~~
coldtea
> _This is the totally normal experience of every programmer from every
> country._

What experience? Never naming a file in "something else than ascii
characters"?

It might be true for programmers. Users, which 99.9% of them are not
programmers, do it ALL THE TIME in any country in the world.

And why shouldn't they? Should they learn english just to name files, or use
transliterations of their language names for the contents of the file?

> _The only people pushing Unicode, ironically, are white native English
> speakers who think they’re saving the world._

Spoken like a person with no experience outside an English speaking country or
developer echo bubble whatsoever.

