Hacker News new | past | comments | ask | show | jobs | submit login
Inline Caching: Quickening (bernsteinbear.com)
36 points by r4um on Feb 8, 2021 | hide | past | favorite | 19 comments



In Python, “Inline caching has been a huge success” [0]. I believe quickening has been proposed as part of a plan for further speed improvements [1].

[0] https://mobile.twitter.com/raymondh/status/13574784866470051...

[1] https://github.com/markshannon/faster-cpython/blob/master/pl...


Were Python calls uncached before this month? That seems extraordinary and the world's lowest hanging fruit and I'm amazed nobody ever picked it. Dropbox had a whole team working on a Python JIT and didn't try that first?


The Python core developers have been working on caching in the interpreter for close to a decade[1].

They're very conservative, though, both about making the interpreter more complex as well as ensuring that nothing regresses in performance unless absolutely necessary to fix something broken. So I wouldn't exactly describe it as "low hanging fruit"

[1] https://bugs.python.org/issue14757


I’m afraid I don’t know the details. However, this post seems relevant - someone who worked on the JIT at Dropbox describes how expensive function calls in CPython can be:

https://blog.kevmod.com/2020/05/python-performance-its-not-j...


To my eye the follow up comments seem to imply it’s specifically for dict objects?

> Methods already has their own fast path, so this is just for reading attributes from instances with __dict__. If __slots__ are defined, there is an addtional 35% improvement.

But clearly it’s beyond just that:

> Also, imports are now faster for modules that have type annotations. The annotations are initially loaded as strings and only evaluated later if needed. The only downside is that it easy to forget to make the relevant typing imports.


by far most python objects use the __dict__ mechanism, not just dicts. For example:

    >>> class MyFoo:
    ...     pass
    ...
    >>> foo=MyFoo()
    >>> foo.bar='baz'
    >>> foo.__dict__
    {'bar': 'baz'}
    >>> foo.__dict__['bar']=42
    >>> foo.bar
    42
__slots__ is an alternative in case your object has a fixed number of member variables. It's mostly a memory optimization:

    >>> class MyFoo:
    ...     __slots__=['bar']
    ...
    >>> f=MyFoo()
    >>> f.bar='baz'
    >>> f.__dict__
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'MyFoo' object has no attribute '__dict__'
Some types of objects have neither, these tend to be builtins or objects implemented using the C API:

    >>> foo = int()
    >>> foo.__dict__
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'int' object has no attribute '__dict__'
    >>> foo.__slots__
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'int' object has no attribute '__slots__'


Thanks for the clarification. I always wished Python would make it easier somehow to auto-detect the slots from the constructor. I don’t think it’s just a memory optimization though since they mention access of when slots is defined things are faster and benefit more from this optimization.


> improvements that could be made [...] Make a template interpreter like in the JVM. This will allow your specialized opcodes to directly make use of the call stack.

What is a “template interpreter”? How does it differ from a normal bytecode interpreter?


As I understand it, template interpreters work by generating machine code corresponding to the virtual machine instructions and jumping to that machine code. As opposed to sort of a big switch statement that you'd get in a normal bytecode interpreter.

A good intro to it here: https://programmersought.com/article/5521858566/


Thanks. Is there a performance advantage to the template approach - compared to writing the handler for each opcode directly in assembly language? Or is it just easier to implement?


You cut out the code between the opcodes that's used for figuring out what to do next. (Removing an indirect jump at the end of each opcode execution.)


That sounds more like an advantage of a template compiler, not a template interpreter?


Yes people in this thread are confusing the two. It reduces overhead because it directly refers to machine registers and the machine stack to pass information between instructions, and directly uses machine jumps, rather than going through a switch and method calls.


Sounds like it’s similar to how LuaJIT works [0], using a machine-code handler for each VM opcode. AFAICT LuaJIT also uses a templating system to build these, except it generates them at compile-time instead of at run-time. I assume that’s easier than explicitly writing out each handler in assembly language.

[0] https://www.reddit.com/r/programming/comments/badl2/luajit_2...


Yep, this is very similar.


A template JIT is a bit that just generates machine code that is exactly like the bytecode. That is in contrast to something like a tracing JIT that uses runtime information to generate more optimal machine code.

One benefit of a template JIT is that the step to native compilation is a lot shorter than with a tracing JIT. Apart from being A LOT simpler. Debugging a tracing JIT is a special kind of hell for extra evil programmers.


That's a 'template compiler' you're describing there, not a 'template interpreter'. OpenJDK doesn't include a 'template compiler'. Its lowest tier, C1, has a basic IR. OpenJDK does use a 'template interpreter'.

https://github.com/openjdk/jdk/blob/master/src/hotspot/share...


Sorry about that. Thanks for putting it straight.


I should be clearer and link to more resources. Check my PL resources page for example. But I also plan on writing a post about template interpreters soon!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: