
Baseline JIT and inline caches - zellyn
https://blog.pyston.org/2016/06/30/baseline-jit-and-inline-caches/
======
trishume
Interesting idea that I just came across that is somewhat relevant: The author
of LuaJIT 2.0 says an interpreter written in assembly is just as fast as a
baseline JIT and way easier to write.

He also has measurements to back this up: [http://lambda-the-
ultimate.org/node/3851#comment-57761](http://lambda-the-
ultimate.org/node/3851#comment-57761) (this is just one place he talks about
it, there's others in that thread and elsewhere)

~~~
gsnedders
> Interesting idea that I just came across that is somewhat relevant: The
> author of LuaJIT 2.0 says an interpreter written in assembly is just as fast
> as a baseline JIT and way easier to write.

I believe this is essentially the long-term goal of V8 with Ignition: an
interpreter written in what is essentially a macro-assembler (which mostly
just relies on the code-gen from the JIT, as I understand it, so you get as
much portability as the JIT has for free) and TurboFan as the only JIT tier
(unless they have some plan for multiple tiers varying only in what
optimizations are enabled within TurboFan, similar to Chakra?). Of course,
this is still very different insofar as TurboFan still ultimately works at the
function level, whereas LuaJIT 2.0 uses traces.

An interesting counterpoint is JavaScriptCore, which has both an interpreter
written in a portable assembly _and_ a baseline JIT. Presumably they believe
they get enough of the gain from it to justify it?

What surprises me with Pyston is the fact that they're heralding one of the
big gains of the baseline JIT as being inline caching, "[transforming] the
bjit from only being able to remove the interpretation overhead to a JIT which
actually is able to improve the performance by a much larger factor". Surely
the better fix then is to use inline caches in the interpreter, given that'll
give most of the speedup?

~~~
saynsedit
Not sure if a portable jit-constructed interpreter can reach the performance
of the hand-written assembly interpreter of LuaJIT. My assumption is that what
makes LuaJIT fast is careful consideration of architecture specific details.

Maybe you've seen this already:
[http://nominolo.blogspot.co.uk/2012/07/implementing-fast-
int...](http://nominolo.blogspot.co.uk/2012/07/implementing-fast-
interpreters.html) (notice the part about portability)

~~~
gsnedders
> Not sure if a portable jit-constructed interpreter can reach the performance
> of the hand-written assembly interpreter of LuaJIT. My assumption is that
> what makes LuaJIT fast is careful consideration of architecture specific
> details.

> Maybe you've seen this already:
> [http://nominolo.blogspot.co.uk/2012/07/implementing-fast-
> int...](http://nominolo.blogspot.co.uk/2012/07/implementing-fast-
> interpreters.html) (notice the part about portability)

My assumption would be that architecture specific details contribute
relatively little towards the performance of LuaJIT outside of register
constrained architectures (x86-32, most obviously). Mike's comment on Reddit
years ago about the performance of LuaJIT's interpreter seems to be consistent
with that:
[https://www.reddit.com/r/programming/comments/badl2/luajit_2...](https://www.reddit.com/r/programming/comments/badl2/luajit_2_beta_3_is_out_support_both_x32_x64/c0lrus0?st=iq43mp4g&sh=6b15e555).
We're now reaching a point where x86-32 performance isn't such a consideration
any more (simply because it's an increasingly rare architecture), so you can
just bound your number of virtual registers to the lower-bound of the
architectures you care more about (so therefore you maintain 1:1 mapping on
all those you care much about).

As that post says in its bit about portability (LLInt is JavaScriptCore's
interpreter, mentioned in my post above, for those unaware), most of these
portable assembly implementations have means to drop down and hand-write
architecture specific code for the cases where it makes much difference, and
in general you can do smart enough instruction selection it won't make much
difference.

Some of the big gains come from having control over stack layout (as you don't
need to separately maintain an interpreter stack separate to the thread's),
which accounted for a lot of the overhead in JavaScriptCore's old interpreter,
and which LLInt deliberately uses the same stack layout as the JITs do, hence
allowing on-stack replacement (OSR) between the interpreter and JIT. This is
something that Ignition is aiming to do, and it's something that LuaJIT does.

~~~
saynsedit
Thanks for replying, I agree with this.

------
eval-everything
A while ago, I posted a link to this talk at Stanford by Eliot Miranda and
Clement Bara on Pharo's (and Squeak's) new VM, Spur, and its Sista optimizer:
[http://www.youtube.com/watch?v=f4Cvia-
HZ-w](http://www.youtube.com/watch?v=f4Cvia-HZ-w)

In short, it leverages the inherent type information of a PIC to speculatively
inline jitted methods directly at their call sites.

Unfortunately, HN's spam filter eats threads started by new accounts, so I
don't think anyone saw it. Maybe a regular user with good karma can re-post
it?

~~~
chrisseaton
> In short, it leverages the inherent type information of a PIC to
> speculatively inline jitted methods directly at their call sites.

I'm not sure if you were suggesting that this was unique or novel, but the
technique has been around since 1991 and most JITs for languages like Java,
Smalltalk, Python, Ruby, etc already do this.

In many cases in those language, if we couldn't do this there wouldn't be any
inlining at all and performance would be terrible.

~~~
evaleverything2
I mean it inlines the the method itself at the callsite, eliding the overhead
of a method invocation altogether, whereas a normal (P)IC simply elides the
overhead of a class+selector method lookup.

~~~
chrisseaton
Yeah I get that - but this inlining of the method itself within an IC,
removing the method call overhead, has been done in every non-trivial dynamic
language VM since the early 90s.

------
zellyn
Curious that they don't mention actual performance numbers in this blog post…

~~~
eiopa
They never do

