Hacker News new | past | comments | ask | show | jobs | submit login

It's possible things are changing. I just remember things like this, which was extremely frustrating at the time: https://lwn.net/Articles/754163 . Guido got visibly upset for this talk.

In the last roughly 10 years, Google, Instagram, Dropbox, Facebook, and reddit all invested a lot of time into trying to make Python fast and was pushed away with harsh no's. There are dozens of forks and branches with basic optimizations that were rejected.

I wish everyone the best of luck. But there's a side of me that's disappointed since this somewhat shows that it's only when these ideas come from inside the core maintainer team that they get acceptance. That's going to make it difficult to grow the team in the future.




GvR also got visibly upset at the Nuitka author:

https://news.ycombinator.com/item?id=8772124

It is amazing that someone who has been unfriendly (sometimes in a nasty political manner) to so many people still has a cult following him and is ostensibly a CoC proponent, now that the CoC is used to protect the inner circle.

If I were an expert assigned to this project at Microsoft, I'd try to get out of it as soon as possible. CPython is too political.


> In the last roughly 10 years, Google, Instagram, Dropbox, Facebook, and reddit all invested a lot of time into trying to make Python fast and was pushed away with harsh no's. There are dozens of forks and branches with basic optimizations that were rejected.

Is that true?

The talk you link was about a very secretive fork of Python that Instagram was very adamant about not open sourcing until very recently (in fact it's not clear that what's open sourced is their full project).

Other than that until recently all I've ever seen is either breaking the C-API or no clear commitment from the maintainers that this was a sustained effort to support and not just a drive-by complicated patch.

In more recent times we have cinder, pyjion, pyston, and others. And CPython Devs are up-streaming either their work or their general approach.


Yes, it is true. Many years ago someone showed up on python-dev with a very ambitious patch for improved string handling and especially slicing. He had some impressive benchmark numbers to back up his claims. The patch was rejected because it was large and would have made the string.c file double in size.

What goes in CPython depends on whether Guido and his seconds likes it or not. And if it is their ideas and their implementations it is treated differently than if it is from outsiders. Then sometimes they suddenly change their minds and something that previously weren't ever going to happen gets implemented. JITting which, apparently, now is on their road map is one such example. Works for them and Python continues to be a great language, but frustrating for contributors.


Do you have the link? I would be fascinated to read. The way you describe it sounds like it's part of the "drive-by complicated patch" category though.

I think we've seen recent languages that Strings are basically one of the hardest things to "get right". Just look at Swift and Rusts multiple attempts to implement strings even though the authors all had many years of experience of working with string implementations in other mature languages.

Python doesn't have the luxury of messing up strings, it was the number 1 reason that migrating from Python 2 to 3 took a decade. Unlike lower level compiled languages it can't stick in multiple string attempts and tell the users to pick one. So Core Devs are on the hook for maintaining any implementation for the life time of Python.

But maybe this example really was a concerted effort by someone willing to maintain Python strings for as long as needed.


Not to revive the Unicode wars again, but a big portion of Python 3's disaster transition to Unicode wasn't caused by "strings being surprisingly hard", but more rather the Python 3 team having chosen an unworkable text model which multiple people (including me, way back in 2005) all said wouldn't work in practice. Our hard-earned experience was ignored, and it took a long time for future Python editions to eat crow and roll back their mistakes.

A large part of why Python 3.6 is a better port target than 3.0 is because they slowly and silently stepped away from their very strong opinions about "what is text", and "what is bytes", and what each of those things meant. Even back when the model was debated, it was clear that the operations the core maintainers cared about (manipulation of individual codepoints [0]) was a dead-end, as the Unicode Standards committee was just about realizing that codepoints weren't an ideal unit, and grapheme clusters were added to the spec. (my dates might be a bit off here, it's all a blur I've mostly forgotten)

Python 3.0 was released as basically a completely nonfunctional piece of software. You basically couldn't write anything that touched a binary file in it. The built-in zipfile module, I remember, crashed if you tried to store or load a binary file stored inside a .zip, the only tests at the time stored .txt files, and the bug wasn't fixed for a long time. I remember others having troubles with the email and mime modules, though I didn't have any code that worked with those, personally.

I had left the community by that point, but I have heard tales of the mountains Armin Ronacher had to move in order to add the u'' prefix back to Python 3; the core team was against it, instead believing that if 2to3 wasn't working for people, it should be fixed. Alice Bevan–McGregor spent years making a version of the WSGI specification that worked with Python 3's text model. We honestly could have shaved 5-6 years off of that disaster if they had just listened to the community; pretty much every opinion we had shared eventually came true, and the Python 3 text model these days is in a very different, and much healthier, place.

[0] One of their major strongholds was that they really needed O(1) indexing of codepoints (it wasn't clear why this was a priority, and I remember arguing that wanting this is a sign of poorly written code). I think in Python 3.8 they finally caved on this, with a PEP and an implementation that can use UTF-16 or UTF-8 internally.


I'd super appreciate if you have sources (mail archive links or whatever), I'd love to read through Python history stuff like this and the PEP you were referring to? (read through the release notes and couldn't find anything, maybe it relates to PEP 538 in Python 3.7?)

It's pretty clear that there were a lot of mistakes in the Python 2 to 3 transition, I'm certainly not trying to defend every choices of the developers.


The PEP I was thinking of was actually PEP-393 [0], which was done for Python 3.3 (well, I did say I mostly left the community, and that my dates were probably wrong). It seems it didn't include UTF-8 as a representation; only Latin-1 and UCS-2, and it seems like it only picks a representation if it can maintain the O(1) indexing property. So I was wrong on that one, I guess that's the one thing they haven't given up on :/

As for the rest, I'll see if I can dig up some logs, but a lot of the discussion I was involved in took place on IRC, and my logs were lost a few server wipes ago. You might find some old posts of mine on python-ideas.

I've been planning on making a fuller blog post on why I think Python's text model is the wrong one, but Manish has a very good post [1] which covers most of the reason. Basically, Unicode code points don't net you anything over bytes other than O(1) indexing, which is completely useless, and in fact, is in many ways a worse representation. There is no operation where having a list of Unicode code points helps you more than having a list of bytes; well, one that isn't completely broken in practice.

More drastically, the team wanting "Unicode everywhere" meant forcing things that clearly weren't Unicode into Unicode. File paths have not been valid Unicode on any popular system. Windows uses its "UTF-16" which is really more UCS-2 and allows unpaired surrogates, and POSIX says node names are bytes but cannot use ASCII NUL or /. These limitations were hit days into the prototyping process on real-world use cases, which is why there's an algorithm to shove arbitrary binary data into unpaired surrogate characters (see PEP-383 [2]), to be used on file paths. This should have blew a huge hole in the whole scheme, but they pushed onwards. The utopian future of "Unicode everywhere" was more important than dealing with the practical reality of existing systems.

Similar things happened with console output, where it's now assumed that LANG will be set correctly. But no, when I ssh into a server in Japan that's set to use EUC-JP, Python will output EUC-JP-encoded bytes, which gets shovelled over ssh as bytes, which the terminal emulator on my laptop misinterprets as UTF-8. Yes, I know ssh is supposed to tunnel certain LANG envvars as well. No, I don't remember why it didn't in this case. But a lot of people were hitting this, that in Python 3.7 they finally added UTF-8 mode (PEP-450 [3]). There's still a lot of people for whom this general "Unicode console" idea breaks [4] [5].

Pretty much every part of the standard library was riddled with bugs when they rolled this out (as mentioned, I hit bugs in zipfile, others found bugs in email/mime). For a long time, the only way to know if handed a file-like object is going to be in bytes or not was isinstance(file.read(0), bytes), which is generally ugly, so a lot of modules didn't support bytes even when they should have.

Practically everything after has been walking back this text model in favor of something where bytes and str are not even that different anymore. Python 3.5 brought with it PEP-461 [5], which added % formatting back to bytes (after a very long and very tiring discussion with the core maintainers [6]).

I could go on and find more PEPs, but I'm done for now.

[0] https://www.python.org/dev/peps/pep-0393/

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

[2] https://www.python.org/dev/peps/pep-0383/

[3] https://www.python.org/dev/peps/pep-0540/

[4] On POSIX system, https://stackoverflow.com/questions/11741574/how-to-print-ut...

[5] On Windows, https://stackoverflow.com/questions/17918746/print-unicode-s...

[5] https://www.python.org/dev/peps/pep-0461/

[6] https://bugs.python.org/issue3982


> the operations the core maintainers cared about (manipulation of individual codepoints [0]) was a dead-end, as the Unicode Standards committee was just about realizing that codepoints weren't an ideal unit, and grapheme clusters were added to the spec.

Can you please expand on this? I thought the problems with Python's new text model were due to backwards incompatibiltiy.


The Python developers original plan for the new text model of Python 3 was an expansion of the text model introduced a bit into Python 2. Python 2 has two types: str and unicode. str is a sequence of bytes, and unicode is a sequence of Unicode code points. An unfortunate model, but an understandable one. The difficult part was that since unicode was introduced into Python 2 some time after 2.0, they needed backwards compatibility with existing code, and thus decided that any time an operation wanted unicode and received str, the runtime would automatically decode from str to unicode, and vice versa. This implicit decode/encode became a large hassle, so the aim of Python 3 was to remove it.

Unfortunately, while the developers removed the implicit encode/decode in Python 3, they also made the gulf between str and unicode (now called bytes and str) much larger, removing large swaths of useful functionality from bytes, and doubling down on the Unicode code points nature of the new unicode type, despite it being very apparent by then that this was not a good text model to base a language on anymore.

Backwards compatibility was a large issue indeed, but IMO they broke backwards compatibility all to introduce a subpar stricter text model that delivered on far fewer promises than tney were hoping, and in my opinion is worse than Python 2's text model. A more reasonable approach can be found in other languages, which allow iterating over sequences of byte strings by iterating on the fly. Swift, Rust and Go all get this more correct.


as the Unicode Standards committee was just about realizing that codepoints weren't an ideal unit

Where 'just about' means 1996 at the latest, the release of Unicode 2.0 (cf chapter 5 [1], p. 21, Character Boundaries).

[1] http://www.unicode.org/versions/Unicode2.0.0/ch05.pdf


Exactly this. I still remember all of these pain points. It's one of the main reasons why my team moved to Go.


> Rusts multiple attempts to implement strings

What are you referring to? Rust's str type is the only one I know about [1], and it's really good as a general purpose string representation. Were there other string implementations from before the 1.0 release?

[1] There's also String, which is just the heap-allocated version of str, and OsStr/OsString which represents the operating system's string type for file system paths.


I can believe it. The PSF and its hanger-ons have historically fulminated against improving CPython performance at the cost of complexity in the implementation language. They've caused flamewars on HN in the past, though Guido's recent about-face seems to have settled the field.


Honestly this is why I don't use Python anymore. Someone comes along with real improvements and GvR gets upset about the tone. The core devs just seem entirely full of themselves and even the most standard interpreter improvements are just too wild for them.

Id love to see the core team develop some humility and realize their engineering is barely even mediocre and accept patches from well meaning people that know what theyre doing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: