Were Python calls uncached before this month? That seems extraordinary and the world's lowest hanging fruit and I'm amazed nobody ever picked it. Dropbox had a whole team working on a Python JIT and didn't try that first?
The Python core developers have been working on caching in the interpreter for close to a decade[1].
They're very conservative, though, both about making the interpreter more complex as well as ensuring that nothing regresses in performance unless absolutely necessary to fix something broken. So I wouldn't exactly describe it as "low hanging fruit"
I’m afraid I don’t know the details. However, this post seems relevant - someone who worked on the JIT at Dropbox describes how expensive function calls in CPython can be:
To my eye the follow up comments seem to imply it’s specifically for dict objects?
> Methods already has their own fast path, so this is just for reading attributes from instances with __dict__. If __slots__ are defined, there is an addtional 35% improvement.
But clearly it’s beyond just that:
> Also, imports are now faster for modules that have type annotations. The annotations are initially loaded as strings and only evaluated later if needed. The only downside is that it easy to forget to make the relevant typing imports.
__slots__ is an alternative in case your object has a fixed number of member variables. It's mostly a memory optimization:
>>> class MyFoo:
... __slots__=['bar']
...
>>> f=MyFoo()
>>> f.bar='baz'
>>> f.__dict__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'MyFoo' object has no attribute '__dict__'
Some types of objects have neither, these tend to be builtins or objects implemented using the C API:
>>> foo = int()
>>> foo.__dict__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '__dict__'
>>> foo.__slots__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'int' object has no attribute '__slots__'
Thanks for the clarification. I always wished Python would make it easier somehow to auto-detect the slots from the constructor. I don’t think it’s just a memory optimization though since they mention access of when slots is defined things are faster and benefit more from this optimization.
> improvements that could be made [...] Make a template interpreter like in the JVM. This will allow your specialized opcodes to directly make use of the call stack.
What is a “template interpreter”? How does it differ from a normal bytecode interpreter?
As I understand it, template interpreters work by generating machine code corresponding to the virtual machine instructions and jumping to that machine code. As opposed to sort of a big switch statement that you'd get in a normal bytecode interpreter.
Thanks. Is there a performance advantage to the template approach - compared to writing the handler for each opcode directly in assembly language? Or is it just easier to implement?
You cut out the code between the opcodes that's used for figuring out what to do next. (Removing an indirect jump at the end of each opcode execution.)
Yes people in this thread are confusing the two. It reduces overhead because it directly refers to machine registers and the machine stack to pass information between instructions, and directly uses machine jumps, rather than going through a switch and method calls.
Sounds like it’s similar to how LuaJIT works [0], using a machine-code handler for each VM opcode. AFAICT LuaJIT also uses a templating system to build these, except it generates them at compile-time instead of at run-time. I assume that’s easier than explicitly writing out each handler in assembly language.
A template JIT is a bit that just generates machine code that is exactly like the bytecode. That is in contrast to something like a tracing JIT that uses runtime information to generate more optimal machine code.
One benefit of a template JIT is that the step to native compilation is a lot shorter than with a tracing JIT. Apart from being A LOT simpler. Debugging a tracing JIT is a special kind of hell for extra evil programmers.
That's a 'template compiler' you're describing there, not a 'template interpreter'. OpenJDK doesn't include a 'template compiler'. Its lowest tier, C1, has a basic IR. OpenJDK does use a 'template interpreter'.
I should be clearer and link to more resources. Check my PL resources page for example. But I also plan on writing a post about template interpreters soon!
[0] https://mobile.twitter.com/raymondh/status/13574784866470051...
[1] https://github.com/markshannon/faster-cpython/blob/master/pl...