
The CPython Bytecode Compiler Is Dumb - ingve
https://nullprogram.com/blog/2019/02/24/
======
obl
There is a reason why basically every attempt to make this kind of language
fast has to support some form of on-stack replacement.

For example, it's hard to optimize even local variable dataflow in python
since it's part of the API : you can inspect the local frame of your caller,
so you have a problem as soon as your function contains a single call. And no
you can't know statically what is the call target since it can be replaced
dynamically.

So either you perform the optimization anyway and then have to try to support
reconstructing values correctly to support introspection, or you just do that
generically with OSR and use the simple introspection on the interpeter.

Either way, there are not a lot of "simple optimization" when the language is
so dynamic.

~~~
sametmax
You could support markers stating "I swear this part is never going to use
those dynamic features", disabling them for you and others.

E.G: I have no problem putting a few annotation telling Python I'm not going
to and allow to override builtins in my program and it's dependancies.

We could make sure those markers can only be set in __main__, and crashes with
very explicit errors in the unlikely event anything down there decide to do
otherwise.

Indeed, many dynamic features are rarely used. They are handy from time to
time, but I won't miss them for many codes.

~~~
jerf
"You could support markers stating "I swear this part is never going to use
those dynamic features", disabling them for you and others."

Right now, the swing is in general to statically-typed languages that are more
convenient to use, but I've thought there's room for a new dynamic scripting
language that is still dynamically typed, but is written from the beginning to
focus on speed. You can see some of the ideas in Julia or LuaJIT, but you pay
a penalty on what can be dynamic.

One of the ideas I've had is more like the "pledge" feature that OpenBSD
recently introduced. Rather than stating up front "I will not use this
feature", you initialize your program, do all the dynamic stuff, then push the
"OK, now I'm done being dynamic" button. After that, the program "freezes"
into place, and calling a function with a new type of argument it has never
seen before or something becomes an error.

My reasoning here is that the dynamic scripting languages tend not to use
their dynamism evenly. The vast bulk of "dynamic" behavior is all done in an
informal initialization phase, but then, for the bulk of the program's
execution, you continue to pay for all the dynamism because the interpreter
has to constantly follow the dynamic chains of functions, or even if the code
is JIT'ed, the JIT has to be written to handle functions suddenly getting the
"wrong" type, which at the _very_ least means you pay for a check the static
programs don't need to pay for, and generally, you may have to pay more. You
set up the dynamism once at the start of the program, but pay for the ability
to be dynamic later billions and trillions and so on of times over the course
of the program.

(Don't just think about how poorly this would work if bodged onto Python or
Javascript or something, because I know such an attempt would absolutely be a
disaster. That's why I'm hypothesizing someone sitting down and designing this
language from scratch with these ideas in mind, so it'll have the correct
affordances and paved cow paths and such to make this work. While you're at
it, give your new dynamic scripting language a solid concurrency story, since
none of the current ones have one, since they all grotesquely predate that as
a concern. I think there's a hole in the programming language landscape here
right now. Another way to think of this is "write a dynamic scripting language
that is _designed_ to have a good, simple JIT".)

(Actually, for all the languages there are, I think there's several holes in
the programming language landscape right now. You'd think everything would be
covered, but it really isn't.)

~~~
bakery2k
> a dynamic scripting language that is designed to have a good, simple JIT

Julia is designed along these lines - its JIT compiler is really just does AOT
compilation at runtime. JavaScript JITs tend to work along the lines of:

    
    
        1. Interpret (and gather profiling data including type information)
        2. JIT compile hot code, making assumptions based on profiling data
        3. Deoptimize (fall back to the interpreter) if an assumption is broken
    

Julia does none of this - it can only work with types that are explicitly
stated or that can be inferred statically. For extremely dynamic source code,
Julia's compiler must emit slow, highly generic machine code.

This is why Julia code tends to contain many more explicit types than code in
other dynamic languages. Compared to other JIT-compiled dynamic languages,
Julia trades-off some dynamism for compiler simplicity.

~~~
jerf
That's what I thought Julia did.

I'm proposing that you could probably get back a lot of that dynamicness if
you could explicitly say "OK, I'm done being dynamic now", because in general,
even Python programs that do things like introspect databases and dynamically
create classes on the fly based on that still have a distinct initialization
phase. Reifying that into something the interpreter/compiler (honestly, in
this language design, that distinction barely matters...) understands might
let you have both worlds.

------
chrisseaton
The author hints at the problem

> Don’t count on your operator overloads to work here, though.

> by keeping these variables around, debugging is more straightforward

The transformations the author is suggesting are not in general legal. Missing
operator overloads and inconsistent debugging states aren’t something to gloss
over - they’re showing you your optimisations are just wrong!

You would need a sophisticated deoptimisation system with frame states, like
JVMs or V8 has, to make them legal.

The Python compiler isn’t dumb - it’s correct.

~~~
ehsankia
From half way in the 2nd paragraph, it was obvious that the author had been
looking at Python all wrong

> I wonder if the code I’m writing is putting undue constraints on the
> bytecode compiler and limiting its options

If that's what you're wondering while writing Python, then you probably
shouldn't be writing Python.

~~~
fake-name
Ooooor, maybe you like to think about what's actually going on when you write
code?

I certainly basically mentally execute code to some extent as I write it. Is
there a way to program that _doesn 't_ do something like that?

~~~
antt
Nobody has any idea what a modern x86 cpu does under the hood because we have
no access to the microcode.

The only thing you can do is run your code and get an empirical answer to the
question 'Is it fast enough'.

~~~
TeMPOraL
Technically true, but also useless point of view, because it's essentially
saying "it's all arbitrary chaos underneath". But it isn't. You can get 90%
there with basic understanding of x86 assembly and some CPU-level
abstractions.

(I too execute the code in my head, but these days usually a couple layers
above x86 ASM, in my own mental "bytecode" that tracks how expensive are some
of programming language's operations and stdlib functions.)

~~~
lmm
Whether relevant things fit into L1 cache will make an order of magnitude more
difference than whether you're using a nominally 2-cycle or 4-cycle assembly
instruction. I don't believe any human programmer on a modern CPU is able to
keep track of what cache level their code operates at entirely in their head;
certainly such a human would not be thinking in terms of x86 assembly or
normal language stdlib functions. So attempting to execute the code in one's
head is overwhelmingly likely to be a waste of time.

------
Animats
Well, yes. It's a naive interpreter. The source is translated to the obvious
bytecode. At run time, everything is a CObject, including integers and floats.
Everything which you'd expect to be a dict really is a dict. So you spend a
lot of run time checking the types of objects, dispatching, and looking up
variables. There's a bit of memoization, I think, to cut down some of the
lookups.

It's tough to optimize a language where any thread can change the internal
variables of any other thread at any time. Python has gratuitous dynamism.
Rarely does code muck with the state of another thread, but it can, and the
code has to handle the worst case. There's no such thing as a thread-local
variable in Python.

~~~
bakery2k
> any thread can change the internal variables of any other thread at any time

Why is this different from Java, where the JVM can assume that local variables
cannot be changed by other threads?

That is, in the absence of special cases, which I believe include `volatile`
and function calls. (Any other special cases?)

~~~
mkl
Am I missing something, or does your first question answer itself? The
question asserts a specific difference. (I don't remember how it works in Java
to know if the assertion is correct.)

~~~
bakery2k
As I understand it, the JVM can assume that local variables are not modified
by other threads. My question is _why_ can't CPython make the same assumption?

~~~
anowlcalledjosh
Because it's not true! Threads in Python can muck around with the state of any
other thread, at any time – it would be a language change to assume otherwise.

~~~
maayank
What op means is that it’s not true in Java as well, but the JVM assumes it is
true until detecting otherwise

------
tambourine_man
JavaScript is really lucky. Imagine having you pet language’s execution being
continuously optimized for decades by the likes of Apple, Google, Microsoft
and Mozilla.

The amount of brainpower that has gone into making JavaScript fast is amazing.

~~~
tanilama
Well, there are multiple attempts in Python's history to make it faster, yet
failed.

Most noteworthy one is Unladen Swallow from Google:

[https://www.python.org/dev/peps/pep-3146/](https://www.python.org/dev/peps/pep-3146/)

And Dropbox's now dead Pyston.

While both are pretty dynamic, Python has a rich ecosystem of C-extensions,
which exposes a lot of interpreter details to developers, making moving away
from the CPython implementation much harder or straight impossible if
compatibility is required.

~~~
bakery2k
Unladen Swallow had a small fraction of the resources that Google allocated to
V8, though. I think it was only worked on by a few interns.

Pyston was better supported, but I think just too ambitious - IIRC it was
initially intended to be a full rewrite _and_ to be compatible with
C-extensions.

~~~
tanilama
Pyston as far as I was tracking, it seems to be mainly 2 devs from Dropbox,
hardly that better off.

You have a point though, no company can replace javascript , the cost is
straightly forbidden. But Python as mainly a backend language at the time, it
can be more realistically replaced, with newer more performant alternatives
like Golang, and to some extent node.js.

But it has its own stronghold, which is data/ml land stuff. However, that
community has gotten around with Python in its own way, either they are
tolering the slowness because that happens behind the scene, or they are
bypassing the performance bottleneck to c-extensions.

So in the end, I guess people love complaining about Python's performance,
including myself, but it never reaches the break point where they said enough
is enough.

~~~
cbxxx
The number of devs is not so important. One author who is left alone from
corporate BS for three years can achieve more than a collection of 10 randomly
hired programmers who just disturb the lead dev.

~~~
woadwarrior01
As someone who has worked on language runtimes in a corporate environment, I
believe there’s some truth to this. One successful example of this would be
Mike Pall, of the LuaJIT fame.

------
__s
Indeed, a patch I created in 2010 for the peepholer to optimize 'a,b=b,a' was
rejected with one particular reason being "changing order of execution starts
to venture into territory that we've stayed away from (on purpose)."
[https://bugs.python.org/issue10648](https://bugs.python.org/issue10648)

Not that I disagree with the patch being rejected, only that this is an
example of the compiler's philosophy

~~~
isatty2
I know you don't disagree with the patch being rejected, but I have to say
that the reviewer gave you a firm example in which your patch changes expected
behavior.

Philosophy aside, that is a fine reason to reject the patch unless you can
convince the reviewer (and the committee) that you are in the right (you very
well may be).

~~~
carlmr
Being in the right is very much a matter of taste. Python has always erred on
the side o less optimization and more obvious behavior. So something that
optimizes for performance while introducing subtle possibilities for bugs
would likely be accepted by the C++, but not the Python commitee.

And I think that's ok. Python wants to be simple and straightforward and
performance was never a goal.

If you need performance don't use Python, or write a python library in
C/C++/Rust and do the heavy lifting there.

------
abecedarius
Guido himself said "Python is about having the simplest, dumbest compiler
imaginable." Even a subset of this dumbest thing turned out to be a lot to
cover when I dug into it at
[https://codewords.recurse.com/issues/seven/dragon-taming-
wit...](https://codewords.recurse.com/issues/seven/dragon-taming-with-
tailbiter-a-bytecode-compiler)

------
pizlonator
Fast bytecode-based implementations of dynamic languages, like JavaScriptCore,
do optimizations _after_ the bytecode is generated. The bytecode is just the
common frame of reference for profiling and OSR.

~~~
bakery2k
LuaJIT is the same, as is V8 (at least with Ignition & TurboFan). The
article's assumption that _fewer instructions = faster_ does not necessarily
apply to these implementations, but is approximately correct for CPython
(because it's a simple bytecode interpreter).

------
thegeomaster
Python's slowness has been a very real problem for me. It's a great language
to write data-wrangling scripts in, but when there's a little bit more data,
it's terribly slow. And then you have to rewrite your script in a faster
language, like C++ or Rust. I've observed speedups on the order of 20x this
way. Which means that I could've saved myself a lot of time by writing them in
another language at the start.

~~~
sametmax
I doubt it.

If you start writting a C++ process first, you will take a lot of time.

Python allow you to have a good idea of how the thing works. And if,
eventually, you find out that it's not fast enough (even with pypy, numpy, a
few lines of cython, etc), you may rewrite it in C++. But the rewrite is going
to be much, much faster to do, because you know now what's up, and can just
translate it to C++ and focus on its gotchas.

~~~
thegeomaster
But usually, I know how the thing works. I have data in this format, and I
need data in this other format. The transformation may include cross-
referencing among the data, processing strings in various ways, computing sums
and counts, calling external libraries, etc. These are simple to code and
well-defined tasks. In my experience, they rarely benefit from prototyping.

My recent example is processing around 10m chess games to get statistics on
all positions that occurred inside them (above an occurrence threshold). This
required parsing the games in chess notation, using a chess library to
simulate the moves to get positions, and counting how often each position
occurred, in which matches, etc. My first try was with Python. After I
realized it's unbearably slow, I tried using PyPy, running multiple processes
for each core, etc., and in the end my approximation was that the job would
finish in a couple of hours. I tried more optimizations and nothing helped.
And there's no number crunching to use numpy for. Then I wrote the same script
in Rust, and it ran in a couple of minutes, finishing well before the original
Python script would have finished, had I left it to run. I arguably didn't
save time by using Python here.

~~~
sametmax
If you knew how it works, and it was fast to write it in rust, why didn't
write it in rust in the first place ?

Because it was easy in Python.

It was easy to write the same script in Rust, then, because you already got
the Python version working.

------
kccqzy
The local variable elimination optimization breaks introspection features.

    
    
        def foo():
          a=1
          return a
    

Then we can inspect the local variables:

    
    
        print(foo.__code__.co_nlocals)
        print(foo.__code__.co_varnames)
    

On a related note, I believe it was also a deliberate decision to keep the
name of all local variables during compilation. This is of course very
different from C.

~~~
greglindahl
Actually, C compilers with -O0 keep all of the local variables, for debugging
purposes. O0 is deliberately bad. And if you file bugs complaining about
debugging at higher optimization levels, well, if you want loop fusion and
splitting for high performance, you're going to lose some debugging.

~~~
aasasd
I wonder if Python could similarly have opt-in optimization (‘-Osomething’)
for those who are sure it won't break anything. Though I guess the standard
lib might get in the way immediately.

~~~
masklinn
CPython already has -O and -OO, they just don't do much that's useful (and are
mostly detrimental): -O will skip asserts and set __debug__ to False, -OO will
also remove docstrings.

CPython's compilation pipeline just has a fairly straightforward peephole
optimiser:
[https://github.com/python/cpython/blob/master/Python/peephol...](https://github.com/python/cpython/blob/master/Python/peephole.c)

~~~
greglindahl
You're describing opt-in pessmisation, not opt-in optimization.

------
yegle
> Unless the Python language specification is specific about this case

No there's no such thing called Python language specification. Every
implementation of Python is based on CPython. This also leads to the sad fact
that every non-CPython implementation has to provide C API that's compatible
with CPython if they want to be adopted widely.

~~~
bakery2k
> Every implementation of Python is based on CPython.

Not _quite_ true - there are notes in the documentation that call out parts of
CPython's behaviour as "implementation-specific".

These are quite rare though. I seem to recall the suggestion being made that
MicroPython shouldn't call itself "an implementation of Python" simply because
its internal string encoding is UTF-8.

> This also leads to the sad fact that every non-CPython implementation has to
> provide C API that's compatible with CPython if they want to be adopted
> widely.

A specification for the C API wouldn't necessarily help with this. The problem
for alternative implementations is that the C API is closely tied to CPython
internals - there is no reason that having a specification would change that.

Designing and specifying a more abstracted API would be useful - but I don't
see it happening.

~~~
kazinator
> _there are notes in the documentation that call out parts of CPython 's
> behaviour as "implementation-specific"_

In a document like this, that means "doing whatever CPython does in a way that
we are not bothering to commit to documentation" (and that any other
implementation will have to reverse engineer and implement).

------
sevensor
As the article points out, CPython is dumb _on purpose_ , which benefits
transparency, maintainability, and development. I've always appreciated this
-- it's obvious when you're doing something expensive in Python because you
know the interpreter is going to do pretty much what you'd expect it to.

------
rawmodz
"...in this case for my wife’s photo blog". Stunning bird photographs, really
exceptional. And nice work on that responsive static album generator. Now time
to continue reading about CPython byte code...

------
aboutruby
Some people are working on optimizing ruby "bytecode" (what the VM
interprets): [https://developers.redhat.com/blog/2019/02/19/register-
trans...](https://developers.redhat.com/blog/2019/02/19/register-transfer-
language-for-cruby)

I'm pretty sure the same concepts could be applied to CPython.

------
dbrueck
Seems like part of the problem is that there might not be much overlap between
the types of things he points out that aren't optimized - which are obscure
and useless - and what could "legally" be optimized.

So, yes, something like this isn't optimized:

def foo(): return [1024][0]

But it's also pretty unlikely to see something like that in actual code. One
could argue that that's just an example of a /type/ of a more general case,
but I think you'd find that the more general case can't be safely optimized
because Python is insanely dynamic. So e.g.

def foo(): return SomeArrayLikeThing(1024)[0]

can't be optimized because not only might the behavior be very different from
a typical Python array, the behavior could very easily not be determined until
the exact moment when that code is run.

IOW, the things the author points out are things that, in practice, end up
being so narrow and rare that there's no real point in trying to optimize
them.

------
beagle3
It hasn’t been updated in 5 years, and IIRC Python2 only - but Russell Power’s
“Falcon” was an attempt to solve these issues that showed a lot of promise.

[https://github.com/rjpower/falcon](https://github.com/rjpower/falcon)

~~~
bakery2k
Thanks for this - I hadn't seen it before. Looks like an implementation of the
core Python interpreter designed to work more like the reference interpreter
for Lua.

Unfortunately it seems to be incomplete - and even if support for exceptions
etc were added, I suspect some of Python's more highly dynamic features (e.g.
inspection of stack frame objects) would be extremely difficult to support in
a compatible way.

I think supporting such niche dynamic features would be essential - I've found
the Python community to be strongly against the idea of changing (or removing)
such features in the pursuit of increased performance.

~~~
jsmeaton
Have you seen this twitter thread?
[https://twitter.com/mitsuhiko/status/1091802711908106240?s=2...](https://twitter.com/mitsuhiko/status/1091802711908106240?s=21)

Lots of high profile python developers not so against a version of python that
was less dynamic and more performant.

------
fijal
If you want to see an advanced Python compiler, have a look at PyPy. While
CPython bytecode compiler is "dump", it's as smart as it can be. You can do
crazy stuff with speculative replacing of bytecodes with guards and type-
specific bytecodes, but then you might as well create a full just in time
compiler. It's incredibly hard to create optimizations on bytecode level
without changing semantics, which Python rightly stayed away from.

------
andrewf
This [1] is a bytecode optimizer for Python - it takes the bytecode from
CPython's compiler, and outputs bytecode with a few select optimizations.

Anyone know of someone taking this idea further?

[1] [http://code.activestate.com/recipes/277940-decorator-for-
bin...](http://code.activestate.com/recipes/277940-decorator-for-
bindingconstants-at-compile-time/) \- it's old, for Python2.4

~~~
joejev
I work on a library for doing Python bytecode transformations like this, but
with a more abstract API. Here is a similar transformation with this library,
which works with Python 3:
[https://github.com/llllllllll/codetransformer/blob/master/co...](https://github.com/llllllllll/codetransformer/blob/master/codetransformer/transformers/constants.py)

------
rurban
That's why php7 is now 2x faster. Just by simple escape analysis, optimizing
to stack-allocation and omitting unnecessary refcounting.

But you need to be aware that optimization passes are costly, and with a
dynamic language this adds to the overall performance, unlike as with static
languages where the optimizer may spend seconds, but run-time is unaffected.

~~~
jnwatson
The optimization is a one time cost at load time that Python already caches
anyway.

------
saagarjha
Even the Java compiler is pretty stupid: it’s HotSpot doing amazing things
under the hood that makes it fast.

------
alsadi
Some optimizations can be enabled

"python -OO -m compileall mydir"

But it's merely skipping comments and names

------
ChrisSD
I think CPython is quite clear about not optimising python code, no? I think
both the docs and python programmers quickly disabused me of that notion
(although this was some years ago so I guess things could have changed).

After all PyPy's selling point is that it does optimise, isn't it?

~~~
kbumsik
Of course he does not intend to criticize it, but to show what the dumb
compiler is.

> To be clear: This isn’t to say CPython is bad, or even that it should
> necessarily change. In fact, as I’ll show, dumb bytecode compilers are par
> for the course. In the past I’ve lamented how the Emacs Lisp compiler could
> do a better job, but CPython and Lua are operating at the same level. There
> are benefits to a dumb and straightforward bytecode compiler: the compiler
> itself is simpler, easier to maintain, and more amenable to modification
> (e.g. as Python continues to evolve). It’s also easier to debug Python (pdb)
> because it’s such a close match to the source listing.

~~~
hermitdev
Right, the CPython runtime is first trying to be correct (IMHO is also
correct). Very few optimizations are made by the compiler/interpreter. This
has long been known by Python devs, and exploited by specifically using APIs
known to be written in C/C++. e.g. using 'map' instead of a list
comprehension. Map was generally faster (at least in the 2.x days) than an
equivalent list comprehension.

~~~
joshuamorton
Just FYI this specific wisdom you're suggesting is exactly backwards, or at
least could be depending how how you read it.

    
    
        timeit.timeit(setup='x = range(10000); l = lambda n: n + 1', stmt='map(l, x)', number=1000)
    

provides approximately 1.2 seconds to do that. The same with a listcomp:

    
    
        timeit.timeit(setup='x = range(10000)', stmt='[n+1 for n in x]', number=1000)
    

runs in a half second.

Although dropping the inlined function runs slower still (1.6s):

    
    
        timeit.timeit(setup='x = range(10000); l = lambda n: n + 1', stmt='[l(n) for n in x]', number=1000)
    

The second is the fastest because you drop the overhead of the function call,
which does stack pushes and pops, and instead do them locally. The last is the
slowest because it does the stack pushes and pops, as well as loads extra
globals, which is slow.

On 3.6 the map version runs in 0.00067 seconds, because it doesn't actually do
anything, it just constructs a generator object.

The general point is well taken. My favorite bit of trivia like this is the
fast matrix transpose: `zip(*matrix)`, which does everything in pure c, and is
also likely the shortest way to do the transpose in python.

~~~
masklinn
> Just FYI this specific wisdom you're suggesting is exactly backwards, or at
> least could be depending how how you read it.

Yeah. map is usually faster _if you already have a function to invoke_
(especially if that function is a builtin or in C). If you have an expression
and have to wrap it in a lambda to get it in map, it's going to be way slower
than the listcomp because all things considered CPython's function calls are
expensive.

