
The Disaster of Python 3 - pcr910303
https://changelog.complete.org/archives/10053-the-incredible-disaster-of-python-3
======
nas
I suggest taking this article with a grain of salt. The assumption seems to be
that it's totally fine that Linux can have filenames that are arbitrary byte
strings and that don't convert to valid Unicode text. First, Python 3 has a
good way to deal with those. See PEP 383: Non-decodable Bytes in System
Character Interfaces.

Second, having filenames that are not valid Unicode text (even if Python 3 has
a way of handling them and round-tripping them) is going to cause you a lot of
pain. No one who has thought through all the issues thinks its a good idea.
The modern computing world uses Unicode text all over the place. Filenames are
manipulated by humans and we deal with them as text.

The idea that 8-bit byte strings are the ideal way to deal with text is a dead
end. I expect we are going to see more of these kinds of articles now that
Python 2's EOL is coming. In retrospect, you could argue that Python 3 should
store Unicode text in memory has UTF-8. However, at the times decisions were
made, UTF-8 was not dominate as it is now.

~~~
rini17
So just handwave it and it eventually goes away? Nope, as long as:

1\. Python standard library itself does not adhere to PEP 383. (Which is not
likely to be fixed if everyone has such dismissive attitude.)

and

2\. Operating systems do not enforce valid UTF-8 on filenames. (This is
unlikely to change sooner than in few decades, if at all.)

~~~
matheusmoreira
> Operating systems do not enforce valid UTF-8 on filenames.

Should they? There is no difference from a file system perspective. We'd still
run into problems even if they did: the line feed is a valid UTF-8 character
and is one of the characters with special meaning in many programs.

Dealing with file names properly is a chore even on bash.

[https://mywiki.wooledge.org/BashPitfalls](https://mywiki.wooledge.org/BashPitfalls)

~~~
skissane
> We'd still run into problems even if they did: the line feed is a valid
> UTF-8 character and is one of the characters with special meaning in many
> programs.

They could ban new line from file names too. See this proposal:
[http://austingroupbugs.net/view.php?id=251](http://austingroupbugs.net/view.php?id=251)

------
hprotagonist
As always, the one true link is this:

[https://nedbatchelder.com/text/unipain.html](https://nedbatchelder.com/text/unipain.html)

Read, understand, and be glad that the interpreter isn't trying to "help" any
more!

~~~
pen2l
I just want to chime in and say that Ned Batchelder is one of the greatest
human beings alive.

Or, at least, to me he is. This man helps so many people with nothing in
return. He's a regular on some python IRC channels, he has personally helped
me so much. He makes difficult concepts easy to understand. I encourage
everyone to watch his pycon talks. Start with his talk on loops:
[https://www.youtube.com/watch?v=EnSu9hHGq5o](https://www.youtube.com/watch?v=EnSu9hHGq5o)

~~~
stevenjohns
There's a few outstanding people that make me love the Python ecosystem,
Nedbat is definitely in that list.

FunkyBob (Curtis Maloney) is also a staple with Django, he's spent years and
years helping people and asking for nothing in return.

~~~
commandersaki
I'll throw in Raymond Hettinger & Jack Diederich -- excellent peeps.

------
oefrha
Okay, another flamebait article. I'll byte, uh no, bite.

> Inconsistencies in Types

There's just no such thing as a builtin character type in Python. No, that's
not new to Python 3. '/' is a str. b'/' is a bytes. Indexing str gives you a
str because there's no such thing as a character type, and introducing it
would be pointless. Indexing bytes gives you a byte (an int) instead of a
bytes because if you're working with a raw byte sequence you probably want to
access the bytes individually? If this wasn't the case people working with raw
byte sequences would probably be even more displeased during the transition
period.

"But I can't port the code by prepending b to every string literal!" Sorry,
that's not the correct way to port.

> Bugs in the standard library

Yeah, that might be Unicode-related bugs in PSL. They were users' problems in
Python 2 era, now Python core team shoulders the burden. Instead of every
single programmer making Unicode errors in their own code, if you find an
error in PSL you can fix it once and for all.

------
StefanKarpinski
This problem in Python 3 is not limited to OS file names, that’s just one way
to get invalid Unicode data. But invalid data happens all the time when
working with real data. The Python 3 string design requires that all strings
must be valid Unicode or Python will raise an error. This is a really
unfortunate property that has bitten every single data scientist I know who
uses Python 3. At some point, often hours or days into a long, expensive
computation, one of their programs has suddenly encountered just a single
invalid byte and crashed, costing them days of time and work. The only
recourse for writing robust programs that can gracefully and correctly handle
invalid data is not to use strings, which, frankly makes the string type seem
pretty useless.

The Python 3 string design also necessitates scanning and often transcoding
every piece of string data that it encounters, both on the way in and again on
the way out. That means that not only is the string type inappropriate for any
data that might not be valid Unicode, it is also inappropriate for any data
that might be large.

I’ve been meaning to write a blog post about how Julia handles strings, but
haven’t yet gotten around to it. Among other benefits:

\- You can process any data as strings and characters, whether it’s valid
Unicode or not.

\- If you read any data as strings or characters and write it back out, you
get the exact same data back, no matter what it is, valid or not.

\- Invalid characters are parsed according to the Unicode 10 spec.

\- You only get an error if you actually ask for the code point of an invalid
character, which is a fairly rare operation and must error since there is no
correct answer.

\- The standard library generally handles invalid Unicode gracefully.

\- You can use strings for large data: there’s no need to look at, let alone
transcode string data—if you don’t need to access something no work is
required.

------
Areading314
It's open source. If you are having so much trouble with your ultra niche
filename use case, just open a pr and clean up some of the code.

------
xpasky
This is a nostalgic article, as underlined in the closing section about
XON/XOFF and mainframe-compatible escape sequences.

The world is moving on, and while historic systems are beautiful (I still have
a 2.11 BSD emulator running - or rather runnable - somewhere), at some point
you need to weight the breakage for legacy users against the cost of
maintenance of the compatibility.

Indeed, POSIX is still mandating that filenames are arbitrary byte sequences.
But it is just becoming impractical, and in the end it's up to whoever has the
motivation to have it working to keep it working, and if there's not enough
people with this motivation it's just going to inevitably rot.

It's likely that 10 years from now, anything non-Unicode will be completely
broken on modern (desktop, at least) systems and perhaps Linux even gets an
opt-in mount option for enforcing filenames to be utf-8-compatible (which may
change to opt-out another 10 years on, just as POSIX is going to evolve too in
this regard).

Yes, it's a pity and I likely still have some ISO-8859-2 files from 1999 on my
filesystem. But I think it's unreasonable for anyone to waste time with that
support. And I wouldn't advise anyone wasting extra 20 hours of your developer
life on building things around ncurses instead of a more direct approach -
build a cool feature in that time instead!

------
mixmastamyk
Not a disaster, been loving it at least five years. Still, a few niche bugs
could be fixed, why not?

In the meantime as a workaround, make fs links with ascii names and/or
subclass ZipFile. I had to monkey patch a Py2 stdlib module once to fix it for
a year or so until it was fixed. Probably httplib if memory serves.

------
rcarmo
I just don’t see the problem here - most of the piece completely ignores
documented ways to deal with encodings.

For instance, I export PYTHONIOENCODING=UTF_8:replace in some machines where I
know the default locale and terminal settings might cause problems with
logging.

Edit: premature posting from mobile

------
3JPLW
The site seems to be having some trouble keeping up; here's an
archive/cache/mirror: [https://archive.is/efTT9](https://archive.is/efTT9)

------
mantap
The string encoding is actually the best part of python 3. There's a large
number of small feature regressions that really irritate me, like the removal
of comparators, and gratuitous changes like the removal of print statements
and moving shit around without providing aliases. But the bytes/str
distinction is actually really useful for anybody who uses unicode, which is
everybody.

If Python 3 had just made that change and no other breaking changes, the
transition would have been much faster and the value propositon much clearer.

------
drivingmenuts
I’m fascinated by the idea that someone is using an IBM 3151 terminal in 2019.
Other than for nostalgia, should not those have been retired about a decade
ago?

~~~
jgoerzen
I bought it off eBay a few weeks ago.

As to why - sitting in front of emacs with a clicky model M keyboard produces
a very different frame of mind. I am more focused and more deliberate in what
I type (one doesn't just type ls /usr/bin on such a thing). Although it's by
no means my primary computing device, I do find myself going down there for at
least a little while on most days. It is a pleasant break, a change of
scenery, a different mental state.

I got it, and my vt420 and vt510, after thinking about the bifurcated nature
of computing history. Although I started with computers in the 80s, it was the
PC side of things. The Unix/"big iron" simply wasn't accessible to many in
those days. I have spent decades doing work day in, day out in what amounts to
a fancy vt510 emulator (xterm). I wanted to use the real thing. Also it got my
son to play zork with me.

I wrote about it here:
[https://changelog.complete.org/archives/10013-connecting-
a-p...](https://changelog.complete.org/archives/10013-connecting-a-physical-
dec-vt420-to-linux)

and here:
[https://changelog.complete.org/archives/10031-resurrecting-a...](https://changelog.complete.org/archives/10031-resurrecting-
ancient-operating-systems-on-debian-raspberry-pi-and-docker)

------
fargle
Normally when I see a headline of this sort, I expect I will find another
over-enthusiastic bombastic smoke-and-mirrors hit-piece.

The fact that this article hits home and is right scares me a tiny bit. For
example:

"I should note that a simple open(b"foo\x7f.txt", "w") works. The lowest-level
calls are smart enough to handle this, but the ecosystem built atop them is
uneven at best."

Oh Crap...

------
enriquto
i do not understand why the unicode type is needed inside the program. Why
can't you treat everything as bytes? It's not like you can't concatenate two
strings of bytes!

If these bytes mean something or something else, this is a concern for the
user of the program that feeds it these bytes. The program itself could be
oblivious to that.

------
sytelus
Title is overblown. At best this might be "disaster" for string/bytes types
but many would argue even that is not the case.

------
pontifier
This one (Python 3) does not spark joy.

Somehow python, as a whole, started to feel like it needed too much
boilerplate and special fiddling to do simple things. I felt like I spent way
too much time keeping track of different environments or versions to make each
project work, and was always dissatisfied.

------
bildung
Authors problem starts right at the beginning: He mentions that POSIX
filenames consist of 8bit _bytes_ , but then uses a utf-8 _string_ as the
example filename in the first code block.

------
sprash
Python 3 is a disaster for many other reasons, UTF-8 bugs is just one of them.
So far I'm sticking with Tauthon[1] which seems to be the best of both worlds.

1.:
[https://github.com/naftaliharris/tauthon](https://github.com/naftaliharris/tauthon)

~~~
nas
I don't wish the Tauthon project ill but I suspect people are underestimating
how much work goes into maintaining Python 3. If the Tauthon project scope was
limited to taking Python 2.7.X and doing bug fix only releases of it, I think
it could be a successful project. Since they seem to be backporting features
from the 3.X branch, I don't see them keeping up. Python 3 has too many users
at this point. If you look at the Tauthon commit log, it seems clear they are
being left behind.

There is another problem with trying to backport selected Python 3 features.
How do you decide what gets backported? New features will introduce
incompatibility. Even if the feature is forward compatible, you end up with
code that will run on Tauthon 2.X but not on Tauthon 2.X-1. If it just a
better Python 2, that's fine. When it is some 3rd kind of thing with a
relatively tiny user base, who is going to use it?

~~~
sprash
> I suspect people are underestimating how much work goes into maintaining
> Python 3.

And I suspect people are vastly overestimating it. The latest additions to
Python look like they were made by a committee. It is the committee style work
that eats up all the man hours. The programming itself is rather simple.

> Python 3 has too many users at this point.

Most were dragged along by force. Many people are happy that 2.x versions like
Tauthon are still maintained. To all projects I'm personally involved in
(mainly scientific) python 3 offers literally ZERO advantages and only causes
additional costs.

> I don't see them keeping up. > How do you decide what gets backported? New
> features will introduce incompatibility.

You gave your own answer. All features that don't introduce incompatibility
are going to be backported. In that regard "keeping up" is also not the top
priority.

~~~
joshuamorton
Py3 is currently more performant than py2.

There are a number of py3 only features pushed for by the scientific community
(@ is the most obvious, but there are others).

~~~
sprash
> Py3 is currently more performant than py2.

No it's not. It is a clear regression. The minuscule performance improvements
stem from the enforcement of xrange() vs range(). But using xrange() on 2.x is
still faster.

~~~
joshuamorton
No, starting in 3.6 or 3.7, there are general performance improvements at the
c level in the interpreter.

