Hi! This is part 4 of my Python behind the scenes series. The goal of this post is to understand how the CPython VM executes Python bytecode. You'll learn:
- what is the evaluation loop and how it's implemented
- when and how a thread may stop executing the bytecode to
release the GIL
- how CPython computes things
- how CPython handles exceptions and implements statements
like try-except, try-finally and with
Really enjoyed the read! Thanks for taking the time to write this down.
It's easy to forget, that if you are running "Python" in production, you are actually running CPython on a x86 VM configured with a bunch of *.py files. When things go sideways, you might find yourself in a situation where knowing CPython internals becomes relevant.
Would the loop be amenable to speedups from parallel execution? Some kind of parallel idempotent run ahead version of the loop might at least prime caches and resolve some dynamic dispatch stuff in advance. A bit like runahead execution in cpu design.
If you squint, this is basically what a JIT compiler such as PyPy does. The compiled code has a fast hot path that assumes that every branch and dynamic dispatch will work the way you expect it to, and if it doesn't then it falls back into the interpreter.
They also use "inline caching" where the result of method lookup is cached in the bytecode itself. Most of the time there's no need to try to pre-execute it.
The ultimate version of this idea of executing everything you can before running the dynamic bit is probably Partial Evaluation which is how GraalPython works.
Yep, Psyco was also based on partial evaluation and Futamura projections IIRC, lots of potential there.
I guess the very ultimate version is a transformation that takes your program AND its inputs, and completely specializes everything to those inputs :)
An advantage of the runahead idea is that with luck the change could be small and simple, and might realistically be accepted as an incremental change into the CPython codebase. Iff the runahead could be made truly guaranteed idempotent and obviously correct.
> The UNARY_NEGATIVE opcode pops value from the stack, negates it and pushes the result.
Why not have a top of stack register? Then all you need to put in your case statement (naively) is
tos = -tos;
thereby avoiding a pop and a push. In a virtual machine for a dynamically typed language such as Python, you'll need to handle different types, and for a statically typed language you'll need separate instructions for int and float, but in any case you avoid the pop and push. If the instruction takes more than one item from the stack, you at least have one fewer pop.
Incidentally, Burroughs mainframes were hardware stack machines and had two top of stack registers, A and B.
If you look a bit further down in the original article, you'll see that the BINARY_ADD instruction does something similar. It pops (a pointer to) the first operand, and modifies (a pointer to) the second one in-place.
Semantically, it makes sense to define operations as popping the operand(s) and pushing a result, for simplicity. But there's no reason the interpreter has to actually be implemented that way, as long as the observable behavior is the same.
In any case, I wouldn't be surprised if an extra push/pop ended up having very little performance impact. The compiler might be able to optimize away the pointer increment/decrement instructions, and if not, the stack pointer is pretty much guaranteed to be in the L1 cache.
As teraflop pointed out, CPython doesn't actually do the pop and the push in this case. Sorry for being misleading. I've added an update note to clarify this.
This is where "we" (in general) want to improve CPython for performance - a jit, or a faster interpreter would be interesting. Anyone have ideas about this? Pyston 2 seems to use dynasm, and then there's the core developer Mark who is proposing a paid project to speed up Python - he hasn't revealed what he has implemented yet.
GaalVM is ~5x faster for regular Python code, but considerably slower for native modules because it actually executes the native code, recompiled into llvm bitcode, in a kind of VM. Or at least that's partly the reason.