- what is the evaluation loop and how it's implemented
- when and how a thread may stop executing the bytecode to
release the GIL
- how CPython computes things
- how CPython handles exceptions and implements statements
like try-except, try-finally and with
It's easy to forget, that if you are running "Python" in production, you are actually running CPython on a x86 VM configured with a bunch of *.py files. When things go sideways, you might find yourself in a situation where knowing CPython internals becomes relevant.
They also use "inline caching" where the result of method lookup is cached in the bytecode itself. Most of the time there's no need to try to pre-execute it.
The ultimate version of this idea of executing everything you can before running the dynamic bit is probably Partial Evaluation which is how GraalPython works.
I guess the very ultimate version is a transformation that takes your program AND its inputs, and completely specializes everything to those inputs :)
An advantage of the runahead idea is that with luck the change could be small and simple, and might realistically be accepted as an incremental change into the CPython codebase. Iff the runahead could be made truly guaranteed idempotent and obviously correct.
Why not have a top of stack register? Then all you need to put in your case statement (naively) is
tos = -tos;
Incidentally, Burroughs mainframes were hardware stack machines and had two top of stack registers, A and B.
If you look a bit further down in the original article, you'll see that the BINARY_ADD instruction does something similar. It pops (a pointer to) the first operand, and modifies (a pointer to) the second one in-place.
Semantically, it makes sense to define operations as popping the operand(s) and pushing a result, for simplicity. But there's no reason the interpreter has to actually be implemented that way, as long as the observable behavior is the same.
In any case, I wouldn't be surprised if an extra push/pop ended up having very little performance impact. The compiler might be able to optimize away the pointer increment/decrement instructions, and if not, the stack pointer is pretty much guaranteed to be in the L1 cache.
But raw Python is super speedy.