Hacker News new | past | comments | ask | show | jobs | submit login
The CPython Bytecode Compiler Is Dumb (nullprogram.com)
221 points by ingve 29 days ago | hide | past | web | favorite | 137 comments

There is a reason why basically every attempt to make this kind of language fast has to support some form of on-stack replacement.

For example, it's hard to optimize even local variable dataflow in python since it's part of the API : you can inspect the local frame of your caller, so you have a problem as soon as your function contains a single call. And no you can't know statically what is the call target since it can be replaced dynamically.

So either you perform the optimization anyway and then have to try to support reconstructing values correctly to support introspection, or you just do that generically with OSR and use the simple introspection on the interpeter.

Either way, there are not a lot of "simple optimization" when the language is so dynamic.

You could support markers stating "I swear this part is never going to use those dynamic features", disabling them for you and others.

E.G: I have no problem putting a few annotation telling Python I'm not going to and allow to override builtins in my program and it's dependancies.

We could make sure those markers can only be set in __main__, and crashes with very explicit errors in the unlikely event anything down there decide to do otherwise.

Indeed, many dynamic features are rarely used. They are handy from time to time, but I won't miss them for many codes.

"You could support markers stating "I swear this part is never going to use those dynamic features", disabling them for you and others."

Right now, the swing is in general to statically-typed languages that are more convenient to use, but I've thought there's room for a new dynamic scripting language that is still dynamically typed, but is written from the beginning to focus on speed. You can see some of the ideas in Julia or LuaJIT, but you pay a penalty on what can be dynamic.

One of the ideas I've had is more like the "pledge" feature that OpenBSD recently introduced. Rather than stating up front "I will not use this feature", you initialize your program, do all the dynamic stuff, then push the "OK, now I'm done being dynamic" button. After that, the program "freezes" into place, and calling a function with a new type of argument it has never seen before or something becomes an error.

My reasoning here is that the dynamic scripting languages tend not to use their dynamism evenly. The vast bulk of "dynamic" behavior is all done in an informal initialization phase, but then, for the bulk of the program's execution, you continue to pay for all the dynamism because the interpreter has to constantly follow the dynamic chains of functions, or even if the code is JIT'ed, the JIT has to be written to handle functions suddenly getting the "wrong" type, which at the very least means you pay for a check the static programs don't need to pay for, and generally, you may have to pay more. You set up the dynamism once at the start of the program, but pay for the ability to be dynamic later billions and trillions and so on of times over the course of the program.

(Don't just think about how poorly this would work if bodged onto Python or Javascript or something, because I know such an attempt would absolutely be a disaster. That's why I'm hypothesizing someone sitting down and designing this language from scratch with these ideas in mind, so it'll have the correct affordances and paved cow paths and such to make this work. While you're at it, give your new dynamic scripting language a solid concurrency story, since none of the current ones have one, since they all grotesquely predate that as a concern. I think there's a hole in the programming language landscape here right now. Another way to think of this is "write a dynamic scripting language that is designed to have a good, simple JIT".)

(Actually, for all the languages there are, I think there's several holes in the programming language landscape right now. You'd think everything would be covered, but it really isn't.)

> a dynamic scripting language that is designed to have a good, simple JIT

Julia is designed along these lines - its JIT compiler is really just does AOT compilation at runtime. JavaScript JITs tend to work along the lines of:

    1. Interpret (and gather profiling data including type information)
    2. JIT compile hot code, making assumptions based on profiling data
    3. Deoptimize (fall back to the interpreter) if an assumption is broken
Julia does none of this - it can only work with types that are explicitly stated or that can be inferred statically. For extremely dynamic source code, Julia's compiler must emit slow, highly generic machine code.

This is why Julia code tends to contain many more explicit types than code in other dynamic languages. Compared to other JIT-compiled dynamic languages, Julia trades-off some dynamism for compiler simplicity.

That's what I thought Julia did.

I'm proposing that you could probably get back a lot of that dynamicness if you could explicitly say "OK, I'm done being dynamic now", because in general, even Python programs that do things like introspect databases and dynamically create classes on the fly based on that still have a distinct initialization phase. Reifying that into something the interpreter/compiler (honestly, in this language design, that distinction barely matters...) understands might let you have both worlds.

Scheme is a decent first cut at such a language. You would want to restrict or eliminate call/cc and mutation of pairs, and add contracts or some other kind of gradual typing, but most of the other dynamicity of JS/Ruby/Python/etc. is missing, without cutting out metaprogramming.

In practice Chez Scheme is fast, though I guess not quite up to LuaJIT's level even though Lua seems less designed for efficiency than Scheme. (E.g. Lua tables ought to be more expensive than Scheme vectors, and people use metatables a lot where Schemers might use macros...)

> you initialize your program, do all the dynamic stuff, then push the "OK, now I'm done being dynamic" button. After that, the program "freezes" into place, and calling a function with a new type of argument it has never seen before or something becomes an error.

We could even come up with names for those two phases. We could call the first phase "compile time" or "preprocessor/macro evaluation time", and the second phase "run time" :)

> You can see some of the ideas in Julia or LuaJIT, but you pay a penalty on what can be dynamic.

I'm sorry but I think I'm missing something here - what sort of penalty are you paying in terms of dynamicness? Can you be more specific?

I was under the impression that there are things you can't do in LuaJIT that you can do in Lua, but I'm not sure now, since I can't google it up.

Regardless, Perl, Python, and Ruby have simply absurd amounts of dynamicness; you can dynamically add operator overloads to the super-superclass of some object you have in hand, or write new metaclasses then create new classes from them, or stick new implementations of __setattr__ in the middle of a class chain, etc., and the interpreter is ready for you to do it, at great runtime cost. For that sort of thing, the JIT will just give up, generally. You'll note that a lot of these things are things you don't generally have a need to do, which is precisely the thing I'm suggesting be exploited more. My understanding is that Julia is not this dynamic. (For those who may be inclined to jump to the "defense" of Julia, bear in mind I'm citing this as a good thing.)

Another way of looking at my suggestion is to forbid all the things that blow out the JITs, but giving the JIT a phase it is allowed to depend on, which will take some work because, again, simply retrofitting that on to an existing language will be a nightmare in all sorts of ways. You want some deep integration into the language for this. None of the current scripting languages were designed with JITs in mind; JITs are always laboriously wrapped around them years after the fact. What if you went the other way around? (You may get Julia in that case; what if you add the ability to have a pre-pledge phase in too, though?)

(Another idea: What if you write a language, or possibly language family, for truly heterogeneous computing models? What would a language that natively has a concept of CPU vs. GPU code look like? Some sort of explicit, visible management of shared data? We also see embryonic use cases where we want a low-power, cheap CPU that's still a CPU, to do some low-power basic maintenance cases, but has a high-power CPU friend that can be turned on if necessary. If there was a language that afforded such management, would we see more of that thing in cell phones and such? There's a lot of interesting places for programming languages to be created, but we tend to see so much of "it's ${some existing language}, with some slight twist, and no standard library".)

How can you know the builtins and the libraries you call don't use these features, to make such guarantees? I suspect it's rather more common than it appears.

You can't at first, but you don't need to. You just write normal code, with regular unit tests. Once you are satisfy, you can start playing with optimization and see if things break or get faster.

Things like monkey patching and stack inspection are the excetion rather than the rule. You may very well be able to safely disable many feature at some local level, or in prod but not dev, etc.

One of the event workers for Django monkey patches ‘socket’. A colleague and I discovered that an hour into debugging why something was misbehaving when we switched workers.

> You could support markers stating "I swear this part is never going to use those dynamic features"

You shouldn't have to!

The VM should be able to work this all out for you. One person puts the effort into improving the VM rather than everyone has to work around in their program.

Yet Smalltalk, Common Lisp, Dylan managed to do it.

Do you want to break all assumptions a Smalltalk JIT managed to reach about a live object?

Just send it a become: message.

Yeah the dynamicism part is probably not a good argument there (JITs were built for Self, after all).

It's really a combination of dynamicism and desire for approachability of the codebase: the cpython codebase is not beautiful, but it's definitely approachable, there is very little use of tricks or complex macros and the average Python developer can jump around and find things.

> For example, it's hard to optimize even local variable dataflow in python since it's part of the API : you can inspect the local frame of your caller,

This seems like the sort of thing that should be absolutely not relied upon outside of debugging/profiling and other special circumstances. At least in my experience, I've had to write this sort of thing a few times and it's always been a last resort (and mainly for those purposes I mentioned).

Not only is it problematic because it disallows compiler optimizations, it also seems like it would enforce an unchangable set of local vars.

The author hints at the problem

> Don’t count on your operator overloads to work here, though.

> by keeping these variables around, debugging is more straightforward

The transformations the author is suggesting are not in general legal. Missing operator overloads and inconsistent debugging states aren’t something to gloss over - they’re showing you your optimisations are just wrong!

You would need a sophisticated deoptimisation system with frame states, like JVMs or V8 has, to make them legal.

The Python compiler isn’t dumb - it’s correct.

From half way in the 2nd paragraph, it was obvious that the author had been looking at Python all wrong

> I wonder if the code I’m writing is putting undue constraints on the bytecode compiler and limiting its options

If that's what you're wondering while writing Python, then you probably shouldn't be writing Python.

Ooooor, maybe you like to think about what's actually going on when you write code?

I certainly basically mentally execute code to some extent as I write it. Is there a way to program that doesn't do something like that?

Nobody has any idea what a modern x86 cpu does under the hood because we have no access to the microcode.

The only thing you can do is run your code and get an empirical answer to the question 'Is it fast enough'.

A lot of the relevant low level optimizations are documented - like micro and macro op fusion. The published optimization guide is thousands of pages long.

Technically true, but also useless point of view, because it's essentially saying "it's all arbitrary chaos underneath". But it isn't. You can get 90% there with basic understanding of x86 assembly and some CPU-level abstractions.

(I too execute the code in my head, but these days usually a couple layers above x86 ASM, in my own mental "bytecode" that tracks how expensive are some of programming language's operations and stdlib functions.)

Whether relevant things fit into L1 cache will make an order of magnitude more difference than whether you're using a nominally 2-cycle or 4-cycle assembly instruction. I don't believe any human programmer on a modern CPU is able to keep track of what cache level their code operates at entirely in their head; certainly such a human would not be thinking in terms of x86 assembly or normal language stdlib functions. So attempting to execute the code in one's head is overwhelmingly likely to be a waste of time.

Sometimes, you have to.

An example - some code I was working with gave subtly different results on two processors. It turned out this was because one of them implemented FMA - Fused Multiply Add.


It wasn't particularly painful to find this, but if you were attempting to debug this kind of problem without considering lower level issues, you'd spend a whole lot of time banging your head against a wall.

If the explicit fused instruction (so VFMADD..., we're not talking micro-code fusing, are we?) doesn't have the same semantics as a separate multiply and add then it's a compiler bug to select it, isn't it?

I think if you're talking micro-code fusing then Intel does guarantee it has exactly the same semantics as a separate multiply and add.

Interesting fact - I believe modern Intel architectures actually only have fused multiply add. If you do just a multiply it'll do a fused multiply and add zero.

Compilers almost never emit instructions that are microcoded on any modern processor

Please explain what you mean. The microcode is the “firmware” of sorts of the CPU. What instructions we tell the CPU to execute and what the CPU decides to actually do to achieve the same result you are asking for can be different. What do you mean by the instructions not being microcoded?

> What do you mean by the instructions not being microcoded?

Most common instructions in modern CPU cores are decoded directly into the corresponding micro-operation(s), without involving the microcode ROM. Only the most complex instructions, involving many micro-operations, will use the microcode decoder. Also, there are often several copies of the simpler instruction decoder, so the CPU can decode several instructions in a single cycle, while there's normally a single copy of the complex (microcode-using) instruction decoder.

A good resource if you want to read more is part 3 of https://www.agner.org/optimize/ which describes the decoder (and other parts) of several families of x86 processors.


There is no rule that every instruction has to be implemented with microcode. Simple instructions like adding two numbers don't need it. Microcode is generally used for more complicated instructions, often those with a variable number of cycles. They are also usually not the fastest way to achieve a result.

> Is there a way to program that doesn't do something like that?

Logic and relational programming (Prolog, SQL) comes pretty close. "Here's some rules about the data. Execute."

And yet for any but toy-level (and logic/relational math research) programming, you have to pay attention to how it works under the hood.

I remember university classes in Prolog. Solving logical puzzles tended to be declarative, but trying to write any actual program involved shifting to the mindset that you're dealing with a regular programming language with a built-in DFS engine underneath (the same way programming in JS involves learning you have a hidden event loop running in the background). For practical use of SQL, you need to be able to choose between equivalent queries based on how the database engine actually turns them into lookups.

> And yet for any but toy-level (and logic/relational math research) programming, you have to pay attention to how it works under the hood.

Maybe 30 years ago. It's really not so important nowadays. Computers are fast.

> For practical use of SQL, you need to be able to choose between equivalent queries based on how the database engine actually turns them into lookups.

Maybe there are some cases where you need to, but there are also million-dollar businesses that have succeeded while only ever writing their SQL queries as naively as possible. The notion that all non-toy practical uses need to understand such low-level details is just utterly false.

> there are also million-dollar businesses that have succeeded while only ever writing their SQL queries as naively as possible.

Citation needed. Given engineering staff sizes and the amount of performance problems RDBMSes have, that seems highly doubtful.

Personal working experience; afraid I'm not willing to name specific past employers here.

Even then you need to understand the underlying execution model. Prolog has cuts that guide the backtracking engine. SQL performance is hard to optimize without looking at query plans.

If you are working with large data sets and want it fast then yes.

But in regards to "Is there a way to program that doesn't do something like that?", I would say relational is that way.

When writing SQL, I think of "what is the DB engine going to do when I ask this". I don't see that as being substantially different.


You must be psychic if you can figure out what the DB engine is going to do without running it.

Bitmap scan, parallel index scan, merge join, nested loop join, etc. I always have to try and see.

It's not a superpower at all--instead, I'd argue that it's the most important skillset (after understanding the relational model itself) for interacting with RDBMSes at scale.

You don't necessarily have to know whether an index scan is going to be parallelized or whether the CBO is going to switch to a loop because its heuristics think your result set is likely tiny to think about things like "is this data being accessed via the right indexes? Is the data likely in memory or not? Will my joins, given my consistency level, impose locks or concurrency considerations on the tables I expect?"

Oh, I'm not generally correct, but my process of "writing" generally goes:

  - Guess what I think is going to happen  
  - Ohhh, it did [thing]. Lemme tweak if I can convince it to do [other thing] that should work better  
  - <Repeat>
I'd argue testing against the DB is a core part of the process for writing a query.

The Python bytecode is not where the work is occurring. The bytecode is just a sentinel for a polymorphic dispatch table. Even if the bytecode were infinitely fast, it might speed up your program by 20%.

Human curiosity and meeting their tasks' requirements can coexist in this instance (or any).

Also, the author points out in the first sentence that they are subject to external constraints that require Python.

Dumb was not used in a derrogative way in this article.

As a long time Pythonista, I find the article quite balanced actually.

+1. I'm reminded of a tweet of yours from a few years back!


Try doing something like that at compile time.

... I am not sure I am impressed. The rules for partially evaluating eval are the same for partially evaluating any code. I am just not sure anybody else bothered something like that with eval.

Edit: I have seen Guile's optimizer/peval turn 30 lines of messy (to human eyes) code to a single atom result. This is actually rather simple as long as you van reason enough about the code and it has no side effects. The hard part is developing heuristics to not have it impact compile time too much

> I am not sure I am impressed

My PhD defence committee was impressed, so that was good enough for me.

In Ruby almost no code is refentially transparent so just simple PE is not enough. You need to speculate in some complex ways and writing that to be practical is not easy.

Then I am impressed :) I don't know enough ruby to know exactly why, though. I am just used to seeing this kind of behaviour in guile, and for guile I don't really see why that would be hard, at least if you have enough visibility.

I might have gotten it wrong for guile as well :) sorry to sound so dismissive. I would love to read your thesis!

Common Lisp does that for breakfast :).

I think it might also be a "lazy" solution to a more complex problem, which this post does not address: Python allows nested functions to, IIRC, an unspecified depth (probably limited by something like stack size). I imagine that that given how Python implements variable lookups, this could be problematic with optimizations in that in the face of nested functions and control paths, it would be very difficult (probably not impossible) to prove they are not used.

Well, yes. It's a naive interpreter. The source is translated to the obvious bytecode. At run time, everything is a CObject, including integers and floats. Everything which you'd expect to be a dict really is a dict. So you spend a lot of run time checking the types of objects, dispatching, and looking up variables. There's a bit of memoization, I think, to cut down some of the lookups.

It's tough to optimize a language where any thread can change the internal variables of any other thread at any time. Python has gratuitous dynamism. Rarely does code muck with the state of another thread, but it can, and the code has to handle the worst case. There's no such thing as a thread-local variable in Python.

> any thread can change the internal variables of any other thread at any time

Why is this different from Java, where the JVM can assume that local variables cannot be changed by other threads?

That is, in the absence of special cases, which I believe include `volatile` and function calls. (Any other special cases?)

Am I missing something, or does your first question answer itself? The question asserts a specific difference. (I don't remember how it works in Java to know if the assertion is correct.)

As I understand it, the JVM can assume that local variables are not modified by other threads. My question is why can't CPython make the same assumption?

Because it's not true! Threads in Python can muck around with the state of any other thread, at any time – it would be a language change to assume otherwise.

What op means is that it’s not true in Java as well, but the JVM assumes it is true until detecting otherwise

And some of the optimizations that do exist are interesting. I wonder if, post-Spectre/Meltdown, the old side channel results involving Python's integer behavior would be more widely viewed as problematic.

I'm not sure which "side channel results" you refer to, but I don't think Python tries to prevent Spectre-style attacks at all. I don't believe it's necessary because, unlike JavaScript and Lua, Python doesn't try to support running within an in-process sandbox.

They seem to have been pretty well known among my former co-workers so I'm not sure if there's an original paper on it or what, but things like detecting the difference between small (cached) integer creation time and the full object instantiation, computational complexity attacks on dictionaries, etc.

These are distinct from eg Spectre because the point isn't isolation breaking, it's leakage. But at least a few years ago there was always a question of "are side channel attacks real", and I think we're pretty clearly over that question now.

That's a package to create a namespace. The compiler doesn't know you're doing that. Nor are those variables really thread-local. You can pass a reference to them to another thread.

How is what you've said not applicable to C/C++?

You can easily pass the address of a thread_local variable around. Threads can easily muck around with objects from multiple threads. Hell that's even true with Java too.

The reason threading is weird on Python is due to the GIL. Optimizing for threads is tough because Python explicitly avoids defining a threading model & assumes all implementations have something like the GIL. That doesn't mean no optimization is possible.

One big difference is the c/c++ standards say that messing with a variable in multiple threads without locks is undefined behaviour, while Python claims there at least won't be an interpreter crash.

Right. The options in this area are:

- Don't check. Programs will crash if a thread-local variable is passed outside the thread (C, C++)

- Naive interpreter. Everything works, slowly, because the worst case code is used for everything. (CPython)

- Really clever just-in-time recompilation when what seemed to be thread-local suddenly gets accessed from another thread. (PyPy, Java?)

- Compile time checking to prevent this. (Rust)

- No threads (classic Javascript)

- Explicit shared areas into which only certain types can be placed (ECMAscript 2018)

I fail to see how what you said is relevant. All you've said is that Python & C/C++'s memory models are different from each other, which yes they are & no one has stated otherwise. My interpretation of the claims made were:

1. Python doesn't have thread-local variables 2. Python's memory model around threading is what prevents any optimization.

Both are clearly incorrect. For example, take Java. You have the same behaviour as Python: your VM won't crash & your code won't even generate a panic. Java clearly has thread-local variables & an optimizing compiler.

Python's inhibition around optimizing is partially around the guarantees they make around APIs that do self introspection but mainly around keeping the interpreter simple. For a counterpoint see PyPy, Cython, JPython, IronPython etc which optimize Python just fine. They mostly trip over module extensions being tied directly to the non-standardized CPython module API & that API is tied intimately to implementation decisions of CPython.

there is essentially no such thing as threads in python, unless I am completely missing something.

The default python interpreter does absolutely nothing to make it's execution thread-safe, and that appears to be a deliberate decision on their part to keep the complexity of the interpreter down.

On the contrary, the Global Interpreter Lock provides many thread-safety guarantees. These are based on the fact that threads can only be interrupted between, not within, byte code instructions.

For example, a single list can safely be extended (as in `l += [1]`) concurrently from several threads.

Ah, you are correct, it's the non-threadedness of the interpreter itself I was thinking of.

On the contrary. For example, each python thread you create results in the creation of a POSIX thread in the interpreter, when you're running CPython on Linux.

The GIL just means that only one of those threads will be executing pure-Python code at any time. But they're still normal threads.

If you want to details about the GIL and Python threads, this is a nice presentation:


JavaScript is really lucky. Imagine having you pet language’s execution being continuously optimized for decades by the likes of Apple, Google, Microsoft and Mozilla.

The amount of brainpower that has gone into making JavaScript fast is amazing.

Well, there are multiple attempts in Python's history to make it faster, yet failed.

Most noteworthy one is Unladen Swallow from Google:


And Dropbox's now dead Pyston.

While both are pretty dynamic, Python has a rich ecosystem of C-extensions, which exposes a lot of interpreter details to developers, making moving away from the CPython implementation much harder or straight impossible if compatibility is required.

Unladen Swallow had a small fraction of the resources that Google allocated to V8, though. I think it was only worked on by a few interns.

Pyston was better supported, but I think just too ambitious - IIRC it was initially intended to be a full rewrite and to be compatible with C-extensions.

Pyston as far as I was tracking, it seems to be mainly 2 devs from Dropbox, hardly that better off.

You have a point though, no company can replace javascript , the cost is straightly forbidden. But Python as mainly a backend language at the time, it can be more realistically replaced, with newer more performant alternatives like Golang, and to some extent node.js.

But it has its own stronghold, which is data/ml land stuff. However, that community has gotten around with Python in its own way, either they are tolering the slowness because that happens behind the scene, or they are bypassing the performance bottleneck to c-extensions.

So in the end, I guess people love complaining about Python's performance, including myself, but it never reaches the break point where they said enough is enough.

The number of devs is not so important. One author who is left alone from corporate BS for three years can achieve more than a collection of 10 randomly hired programmers who just disturb the lead dev.

As someone who has worked on language runtimes in a corporate environment, I believe there’s some truth to this. One successful example of this would be Mike Pall, of the LuaJIT fame.

The amount of ressource can't be compared.

Hundred of millions have been poured at chrome, ie and firefox. The best specialists in the world in the field of interpretters and JIT worked on them. For years.

The Python projects were mostly personnal initiatives approved by the companies.

PyPy is faring pretty well in terms of performance and compatibility.

Performance maybe, but not without gotchas.

However, compatibility is certainly not:


Scipy/scikit-learn/pandas/matplotlib/h5py aren't supported, and all the packages depends on them, which makes it not an option for probably the most important segment of python's user base.

Those packages work just fine on pypy. They just don't support Python 2.

There is https://github.com/graalvm/graalpython too, doesn’t seem to be dead yet.

Would it help making it more cross-interpreter if new Python extensions who communicate with C code were written using the standard ctypes package instead of C-extensions which depends on CPython implementation? Or even porting the existing extensions? Would this be technically possible?

> The amount of brainpower that has gone into making JavaScript fast is amazing.

It really is, and it really makes me sad that the manpower wasn't instead invested in a better underlying language.

If only Eich had been allowed to implement a Scheme for Netscape Navigator! I really dislike Scheme (it's under-specified, and has some poor design decisions like a single namespace), but having a homoiconic language available within every browser in the world would have been amazing. The manager who squashed that is single-handedly responsible for setting back the development of computing by at least twenty years (and more likely thirty, since I predict that it'll be at least another decade before we're able to escape the JavaScript rut).

The best part is that they cannot afford to stop doing that.

Yeah, but I think we are in the diminishing returns phase, already. I hope I'm wrong, though.

At least Python has multithreading (even though it has the GIL). JavaScript only has some kind of decoupled worker processes that can communicate only through a narrow message channel, or through a low level shared memory array.

Worklets are going to add more paralelism to JavaScript.

Their usefulness is very restricted.

What we want instead is:

- general concurrency

- access to any (garbage collected) JS variable/object from any thread.

- a garbage collector that works across all threads.

- no global locking of threads by the runtime system

And then you get into the problem that Python has. The reason why Python has the GIL is that all of those things are really hard to do right in a performant way.

I think JS is making the right call here: a small subset just to engage the extra cores.

Indeed, a patch I created in 2010 for the peepholer to optimize 'a,b=b,a' was rejected with one particular reason being "changing order of execution starts to venture into territory that we've stayed away from (on purpose)." https://bugs.python.org/issue10648

Not that I disagree with the patch being rejected, only that this is an example of the compiler's philosophy

I know you don't disagree with the patch being rejected, but I have to say that the reviewer gave you a firm example in which your patch changes expected behavior.

Philosophy aside, that is a fine reason to reject the patch unless you can convince the reviewer (and the committee) that you are in the right (you very well may be).

Being in the right is very much a matter of taste. Python has always erred on the side o less optimization and more obvious behavior. So something that optimizes for performance while introducing subtle possibilities for bugs would likely be accepted by the C++, but not the Python commitee.

And I think that's ok. Python wants to be simple and straightforward and performance was never a goal.

If you need performance don't use Python, or write a python library in C/C++/Rust and do the heavy lifting there.

Yes, I agree that the patch was wrong in that it shouldn't optimize NAME/GLOBAL. I was unaware of those semantics when I wrote it. However that doesn't apply to FAST, the most common LOAD/STORE ops. The other reasons are why the patch wouldn't be accepted if it were fixed

Also, just want to also say Python reviewers are great. They're very good at explaining issues with patches & are willing to collaborate on improving a patch when necessary. There's a long list of backlog issues because there is simply not enough review to go around, not because of a lack of quality. Highly recommend CPython to people who want to get involved in contributing to an open source project, even just testing patches helps

Guido himself said "Python is about having the simplest, dumbest compiler imaginable." Even a subset of this dumbest thing turned out to be a lot to cover when I dug into it at https://codewords.recurse.com/issues/seven/dragon-taming-wit...

Fast bytecode-based implementations of dynamic languages, like JavaScriptCore, do optimizations after the bytecode is generated. The bytecode is just the common frame of reference for profiling and OSR.

LuaJIT is the same, as is V8 (at least with Ignition & TurboFan). The article's assumption that fewer instructions = faster does not necessarily apply to these implementations, but is approximately correct for CPython (because it's a simple bytecode interpreter).

Python's slowness has been a very real problem for me. It's a great language to write data-wrangling scripts in, but when there's a little bit more data, it's terribly slow. And then you have to rewrite your script in a faster language, like C++ or Rust. I've observed speedups on the order of 20x this way. Which means that I could've saved myself a lot of time by writing them in another language at the start.

>And then you have to rewrite your script in a faster language, like C++ or Rust. I've observed speedups on the order of 20x this way.

My assumption of possible speedups (from past experience) is roughly.

Python -5x-> Perl -2x-> C#/F# -5x-> C++ -2x-> Rust

Rust is usually a lot faster than C++, because in my data wrangling I often need a lot of string operations that are much easier to do with zero-allocation in Rust than in C++. While you can write the same in C++ it becomes very hard to manage without the borrow checker.

I don't like Perl, so I basically never use that, and I find it's easier to write something that works in Rust than in C++. So usually I'll pick F# or Rust, depending on how fast it needs to be and how easy it needs to be to write it. F# is super nice for writing and Ionide code completion is just miles ahead of Rust code completion at this point. Being forgetful I barely need to look anything up in F#, but in Rust I always need to look at the docs. This will probably improve in the future (I hope). If Rust code completion becomes amazing I think I won't even start in Python anymore. (Pandas is the only thing that pulls me back into Python usually).

An F# like language hosted within Rust would be a dream. With a GC and stack allocation, optional lifetime annotations when you want to remove the GC overhead.

I think with the newer versions of F# you can do some manual memory management if needed.

I doubt it.

If you start writting a C++ process first, you will take a lot of time.

Python allow you to have a good idea of how the thing works. And if, eventually, you find out that it's not fast enough (even with pypy, numpy, a few lines of cython, etc), you may rewrite it in C++. But the rewrite is going to be much, much faster to do, because you know now what's up, and can just translate it to C++ and focus on its gotchas.

But usually, I know how the thing works. I have data in this format, and I need data in this other format. The transformation may include cross-referencing among the data, processing strings in various ways, computing sums and counts, calling external libraries, etc. These are simple to code and well-defined tasks. In my experience, they rarely benefit from prototyping.

My recent example is processing around 10m chess games to get statistics on all positions that occurred inside them (above an occurrence threshold). This required parsing the games in chess notation, using a chess library to simulate the moves to get positions, and counting how often each position occurred, in which matches, etc. My first try was with Python. After I realized it's unbearably slow, I tried using PyPy, running multiple processes for each core, etc., and in the end my approximation was that the job would finish in a couple of hours. I tried more optimizations and nothing helped. And there's no number crunching to use numpy for. Then I wrote the same script in Rust, and it ran in a couple of minutes, finishing well before the original Python script would have finished, had I left it to run. I arguably didn't save time by using Python here.

If you knew how it works, and it was fast to write it in rust, why didn't write it in rust in the first place ?

Because it was easy in Python.

It was easy to write the same script in Rust, then, because you already got the Python version working.

Or using a language that supports AOT/JIT from the get go, which most likely will avoid the need of any rewrite.

And of course this write-it-twice dilemma is one major impetus behind the invention of Julia, which seeks high performance in a language more user-friendly than C/C++.

The local variable elimination optimization breaks introspection features.

    def foo():
      return a
Then we can inspect the local variables:

On a related note, I believe it was also a deliberate decision to keep the name of all local variables during compilation. This is of course very different from C.

Actually, C compilers with -O0 keep all of the local variables, for debugging purposes. O0 is deliberately bad. And if you file bugs complaining about debugging at higher optimization levels, well, if you want loop fusion and splitting for high performance, you're going to lose some debugging.

No that's qualitatively different. There's no way for a piece of C code to look at the local variables of another C function. But in Python you can do that. What you mean is that C compilers keep the local variables on the stack, and do not eliminate loads/stores so that a debugger can see what the variables are and a debugger can read DWARF information to figure out names of variables. There is no way for code itself to learn these except for implementing DWARF parsing yourself.

I wonder if Python could similarly have opt-in optimization (‘-Osomething’) for those who are sure it won't break anything. Though I guess the standard lib might get in the way immediately.

CPython already has -O and -OO, they just don't do much that's useful (and are mostly detrimental): -O will skip asserts and set __debug__ to False, -OO will also remove docstrings.

CPython's compilation pipeline just has a fairly straightforward peephole optimiser: https://github.com/python/cpython/blob/master/Python/peephol...

You're describing opt-in pessmisation, not opt-in optimization.

That's an incredibly stupid thing to put into production code, though and should be regarded as a debugging feature.

It's acceptable to have different compilation modes for different levels/fidelities of debugging introspection.

> Unless the Python language specification is specific about this case

No there's no such thing called Python language specification. Every implementation of Python is based on CPython. This also leads to the sad fact that every non-CPython implementation has to provide C API that's compatible with CPython if they want to be adopted widely.

> Every implementation of Python is based on CPython.

Not quite true - there are notes in the documentation that call out parts of CPython's behaviour as "implementation-specific".

These are quite rare though. I seem to recall the suggestion being made that MicroPython shouldn't call itself "an implementation of Python" simply because its internal string encoding is UTF-8.

> This also leads to the sad fact that every non-CPython implementation has to provide C API that's compatible with CPython if they want to be adopted widely.

A specification for the C API wouldn't necessarily help with this. The problem for alternative implementations is that the C API is closely tied to CPython internals - there is no reason that having a specification would change that.

Designing and specifying a more abstracted API would be useful - but I don't see it happening.

> there are notes in the documentation that call out parts of CPython's behaviour as "implementation-specific"

In a document like this, that means "doing whatever CPython does in a way that we are not bothering to commit to documentation" (and that any other implementation will have to reverse engineer and implement).

Closest thing to a spec is the Python Language Reference: https://docs.python.org/3/reference/ which is actually pretty good.

As the article points out, CPython is dumb on purpose, which benefits transparency, maintainability, and development. I've always appreciated this -- it's obvious when you're doing something expensive in Python because you know the interpreter is going to do pretty much what you'd expect it to.

"...in this case for my wife’s photo blog". Stunning bird photographs, really exceptional. And nice work on that responsive static album generator. Now time to continue reading about CPython byte code...

Some people are working on optimizing ruby "bytecode" (what the VM interprets): https://developers.redhat.com/blog/2019/02/19/register-trans...

I'm pretty sure the same concepts could be applied to CPython.

Seems like part of the problem is that there might not be much overlap between the types of things he points out that aren't optimized - which are obscure and useless - and what could "legally" be optimized.

So, yes, something like this isn't optimized:

def foo(): return [1024][0]

But it's also pretty unlikely to see something like that in actual code. One could argue that that's just an example of a /type/ of a more general case, but I think you'd find that the more general case can't be safely optimized because Python is insanely dynamic. So e.g.

def foo(): return SomeArrayLikeThing(1024)[0]

can't be optimized because not only might the behavior be very different from a typical Python array, the behavior could very easily not be determined until the exact moment when that code is run.

IOW, the things the author points out are things that, in practice, end up being so narrow and rare that there's no real point in trying to optimize them.

It hasn’t been updated in 5 years, and IIRC Python2 only - but Russell Power’s “Falcon” was an attempt to solve these issues that showed a lot of promise.


Thanks for this - I hadn't seen it before. Looks like an implementation of the core Python interpreter designed to work more like the reference interpreter for Lua.

Unfortunately it seems to be incomplete - and even if support for exceptions etc were added, I suspect some of Python's more highly dynamic features (e.g. inspection of stack frame objects) would be extremely difficult to support in a compatible way.

I think supporting such niche dynamic features would be essential - I've found the Python community to be strongly against the idea of changing (or removing) such features in the pursuit of increased performance.

Have you seen this twitter thread? https://twitter.com/mitsuhiko/status/1091802711908106240?s=2...

Lots of high profile python developers not so against a version of python that was less dynamic and more performant.

If you want to see an advanced Python compiler, have a look at PyPy. While CPython bytecode compiler is "dump", it's as smart as it can be. You can do crazy stuff with speculative replacing of bytecodes with guards and type-specific bytecodes, but then you might as well create a full just in time compiler. It's incredibly hard to create optimizations on bytecode level without changing semantics, which Python rightly stayed away from.

This [1] is a bytecode optimizer for Python - it takes the bytecode from CPython's compiler, and outputs bytecode with a few select optimizations.

Anyone know of someone taking this idea further?

[1] http://code.activestate.com/recipes/277940-decorator-for-bin... - it's old, for Python2.4

I work on a library for doing Python bytecode transformations like this, but with a more abstract API. Here is a similar transformation with this library, which works with Python 3: https://github.com/llllllllll/codetransformer/blob/master/co...

That's why php7 is now 2x faster. Just by simple escape analysis, optimizing to stack-allocation and omitting unnecessary refcounting.

But you need to be aware that optimization passes are costly, and with a dynamic language this adds to the overall performance, unlike as with static languages where the optimizer may spend seconds, but run-time is unaffected.

The optimization is a one time cost at load time that Python already caches anyway.

Even the Java compiler is pretty stupid: it’s HotSpot doing amazing things under the hood that makes it fast.

Some optimizations can be enabled

"python -OO -m compileall mydir"

But it's merely skipping comments and names

I think CPython is quite clear about not optimising python code, no? I think both the docs and python programmers quickly disabused me of that notion (although this was some years ago so I guess things could have changed).

After all PyPy's selling point is that it does optimise, isn't it?

Of course he does not intend to criticize it, but to show what the dumb compiler is.

> To be clear: This isn’t to say CPython is bad, or even that it should necessarily change. In fact, as I’ll show, dumb bytecode compilers are par for the course. In the past I’ve lamented how the Emacs Lisp compiler could do a better job, but CPython and Lua are operating at the same level. There are benefits to a dumb and straightforward bytecode compiler: the compiler itself is simpler, easier to maintain, and more amenable to modification (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb) because it’s such a close match to the source listing.

Right, the CPython runtime is first trying to be correct (IMHO is also correct). Very few optimizations are made by the compiler/interpreter. This has long been known by Python devs, and exploited by specifically using APIs known to be written in C/C++. e.g. using 'map' instead of a list comprehension. Map was generally faster (at least in the 2.x days) than an equivalent list comprehension.

Just FYI this specific wisdom you're suggesting is exactly backwards, or at least could be depending how how you read it.

    timeit.timeit(setup='x = range(10000); l = lambda n: n + 1', stmt='map(l, x)', number=1000)
provides approximately 1.2 seconds to do that. The same with a listcomp:

    timeit.timeit(setup='x = range(10000)', stmt='[n+1 for n in x]', number=1000)
runs in a half second.

Although dropping the inlined function runs slower still (1.6s):

    timeit.timeit(setup='x = range(10000); l = lambda n: n + 1', stmt='[l(n) for n in x]', number=1000)
The second is the fastest because you drop the overhead of the function call, which does stack pushes and pops, and instead do them locally. The last is the slowest because it does the stack pushes and pops, as well as loads extra globals, which is slow.

On 3.6 the map version runs in 0.00067 seconds, because it doesn't actually do anything, it just constructs a generator object.

The general point is well taken. My favorite bit of trivia like this is the fast matrix transpose: `zip(*matrix)`, which does everything in pure c, and is also likely the shortest way to do the transpose in python.

> Just FYI this specific wisdom you're suggesting is exactly backwards, or at least could be depending how how you read it.

Yeah. map is usually faster if you already have a function to invoke (especially if that function is a builtin or in C). If you have an expression and have to wrap it in a lambda to get it in map, it's going to be way slower than the listcomp because all things considered CPython's function calls are expensive.

zip(*matrix) is the coolest pythonism.

Aye, I did not think he was criticising. However on first read it did sound to me like he was writing as though he was uncovering some arcane fact.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact