He also has measurements to back this up: http://lambda-the-ultimate.org/node/3851#comment-57761 (this is just one place he talks about it, there's others in that thread and elsewhere)
I believe this is essentially the long-term goal of V8 with Ignition: an interpreter written in what is essentially a macro-assembler (which mostly just relies on the code-gen from the JIT, as I understand it, so you get as much portability as the JIT has for free) and TurboFan as the only JIT tier (unless they have some plan for multiple tiers varying only in what optimizations are enabled within TurboFan, similar to Chakra?). Of course, this is still very different insofar as TurboFan still ultimately works at the function level, whereas LuaJIT 2.0 uses traces.
What surprises me with Pyston is the fact that they're heralding one of the big gains of the baseline JIT as being inline caching, "[transforming] the bjit from only being able to remove the interpretation overhead to a JIT which actually is able to improve the performance by a much larger factor". Surely the better fix then is to use inline caches in the interpreter, given that'll give most of the speedup?
Maybe you've seen this already: http://nominolo.blogspot.co.uk/2012/07/implementing-fast-int... (notice the part about portability)
> Maybe you've seen this already: http://nominolo.blogspot.co.uk/2012/07/implementing-fast-int... (notice the part about portability)
My assumption would be that architecture specific details contribute relatively little towards the performance of LuaJIT outside of register constrained architectures (x86-32, most obviously). Mike's comment on Reddit years ago about the performance of LuaJIT's interpreter seems to be consistent with that: https://www.reddit.com/r/programming/comments/badl2/luajit_2.... We're now reaching a point where x86-32 performance isn't such a consideration any more (simply because it's an increasingly rare architecture), so you can just bound your number of virtual registers to the lower-bound of the architectures you care more about (so therefore you maintain 1:1 mapping on all those you care much about).
If you consider LuaJit 1.x to be a baseline JIT, the interpreter mode of LuaJit 2.0 is actually on par there.
EDIT: actually I should say that in x86 mode they are on par, while on x64 mode, the interpreter of LuaJIT 2.0 is always faster
In short, it leverages the inherent type information of a PIC to speculatively inline jitted methods directly at their call sites.
Unfortunately, HN's spam filter eats threads started by new accounts, so I don't think anyone saw it. Maybe a regular user with good karma can re-post it?
I'm not sure if you were suggesting that this was unique or novel, but the technique has been around since 1991 and most JITs for languages like Java, Smalltalk, Python, Ruby, etc already do this.
In many cases in those language, if we couldn't do this there wouldn't be any inlining at all and performance would be terrible.