Interesting idea that I just came across that is somewhat relevant: The author of LuaJIT 2.0 says an interpreter written in assembly is just as fast as a baseline JIT and way easier to write.
> Interesting idea that I just came across that is somewhat relevant: The author of LuaJIT 2.0 says an interpreter written in assembly is just as fast as a baseline JIT and way easier to write.
I believe this is essentially the long-term goal of V8 with Ignition: an interpreter written in what is essentially a macro-assembler (which mostly just relies on the code-gen from the JIT, as I understand it, so you get as much portability as the JIT has for free) and TurboFan as the only JIT tier (unless they have some plan for multiple tiers varying only in what optimizations are enabled within TurboFan, similar to Chakra?). Of course, this is still very different insofar as TurboFan still ultimately works at the function level, whereas LuaJIT 2.0 uses traces.
An interesting counterpoint is JavaScriptCore, which has both an interpreter written in a portable assembly and a baseline JIT. Presumably they believe they get enough of the gain from it to justify it?
What surprises me with Pyston is the fact that they're heralding one of the big gains of the baseline JIT as being inline caching, "[transforming] the bjit from only being able to remove the interpretation overhead to a JIT which actually is able to improve the performance by a much larger factor". Surely the better fix then is to use inline caches in the interpreter, given that'll give most of the speedup?
Not sure if a portable jit-constructed interpreter can reach the performance of the hand-written assembly interpreter of LuaJIT. My assumption is that what makes LuaJIT fast is careful consideration of architecture specific details.
> Not sure if a portable jit-constructed interpreter can reach the performance of the hand-written assembly interpreter of LuaJIT. My assumption is that what makes LuaJIT fast is careful consideration of architecture specific details.
My assumption would be that architecture specific details contribute relatively little towards the performance of LuaJIT outside of register constrained architectures (x86-32, most obviously). Mike's comment on Reddit years ago about the performance of LuaJIT's interpreter seems to be consistent with that: https://www.reddit.com/r/programming/comments/badl2/luajit_2.... We're now reaching a point where x86-32 performance isn't such a consideration any more (simply because it's an increasingly rare architecture), so you can just bound your number of virtual registers to the lower-bound of the architectures you care more about (so therefore you maintain 1:1 mapping on all those you care much about).
As that post says in its bit about portability (LLInt is JavaScriptCore's interpreter, mentioned in my post above, for those unaware), most of these portable assembly implementations have means to drop down and hand-write architecture specific code for the cases where it makes much difference, and in general you can do smart enough instruction selection it won't make much difference.
Some of the big gains come from having control over stack layout (as you don't need to separately maintain an interpreter stack separate to the thread's), which accounted for a lot of the overhead in JavaScriptCore's old interpreter, and which LLInt deliberately uses the same stack layout as the JITs do, hence allowing on-stack replacement (OSR) between the interpreter and JIT. This is something that Ignition is aiming to do, and it's something that LuaJIT does.
A while ago, I posted a link to this talk at Stanford by Eliot Miranda and Clement Bara on Pharo's (and Squeak's) new VM, Spur, and its Sista optimizer: http://www.youtube.com/watch?v=f4Cvia-HZ-w
In short, it leverages the inherent type information of a PIC to speculatively inline jitted methods directly at their call sites.
Unfortunately, HN's spam filter eats threads started by new accounts, so I don't think anyone saw it. Maybe a regular user with good karma can re-post it?
> In short, it leverages the inherent type information of a PIC to speculatively inline jitted methods directly at their call sites.
I'm not sure if you were suggesting that this was unique or novel, but the technique has been around since 1991 and most JITs for languages like Java, Smalltalk, Python, Ruby, etc already do this.
In many cases in those language, if we couldn't do this there wouldn't be any inlining at all and performance would be terrible.
I mean it inlines the the method itself at the callsite, eliding the overhead of a method invocation altogether, whereas a normal (P)IC simply elides the overhead of a class+selector method lookup.
Yeah I get that - but this inlining of the method itself within an IC, removing the method call overhead, has been done in every non-trivial dynamic language VM since the early 90s.
He also has measurements to back this up: http://lambda-the-ultimate.org/node/3851#comment-57761 (this is just one place he talks about it, there's others in that thread and elsewhere)