Isn't that exactly what would happen if you wrote an interpreter in any language with a tracing VM (ie. LuaJIT)? How is writing an interpreter in RPython better than writing one with LuaJIT? RPython makes you insert these hints to the trace compiler (can_enter_jit, jit_merge_point) about when to start/stop running a trace, does this buy you anything? If I had to guess, I'd suspect that this is actually a net loss because you have to guess ahead-of-time where it would make sense to start tracing. This sort of guessing is notoriously hard to do. An implementation like LuaJIT automatically decides when to start tracing based on run-time behavior, which seems like a more robust approach.
The one thing I do find very interesting about RPython is how it subsets a dynamic language such that types can be statically analyzed. I always wondered whether this was possible and what kind of restrictions you'd have to enforce. It's great to see an actual example of this -- it will be very instructive to anyone trying to do a similar thing.
But as far as using RPython as a "meta-tracing system," I'm not seeing what's ground-breaking here. I'd bet 50 cents that writing an interpreter in LuaJIT will be faster than writing it in RPython. And if I'm wrong about that, I'd bet 50 more cents that the reason RPython wins is because it's statically analyzable, not because of anything that's unique about its "meta-tracing system." I'm not sure that term really means anything.
No. Say you have an interpreter loop written as a switch:
bc = next_bytecode()
case OP_ADD: ...
When LuaJIT JIT's the process, it will generate native code to replace L0 and directly implement L1. The end result is the same as if you had written your interpreter in C and directly compiled it to native code.
When PyPy JIT's the process, it will use your hints to collapse both levels of interpretation. It will replace both L0 and L1 with native code that implements the bytecode being interpreted.
I don't think that follows; in the LuaJIT case wouldn't traces of the interpreter span the execution of several (custom) byte-code instructions? Guards would be inserted that effectively ensure that the next byte-code instruction is as expected; the net result would be a trace that indeed corresponds to a region of the interpreted program.
> When PyPy JIT's the process, it will use your hints to collapse both levels of interpretation. It will replace both L0 and L1 with native code that implements the bytecode being interpreted.
It's just not clear to me how PyPy is going to do better when the only extra information is has is some hints. They must be powerful hints that help in a way I don't understand if the PyPy approach truly outperforms a run-of-the-mill tracing VM.
No, it wouldn't do that. If the tracer worked like that, a simple loop from 0...10,000 would generate a huge trace. A trace doesn't (normally) unroll an inner loop like that. Instead it flattens all the function calls and linearizes all the control flow for a single iteration of the loop.
What happens depends on the interpreter, but the basic gist of the algorithm is:
1) Every time you do a backwards jump, you increment a hotness counter associated with the target of the jump.
2) If a target becomes hot, it's considered a loop header and tracing starts from the header.
3) You keep accumulating the trace until you jump back to the original target; if the trace gets too long before that happens, you abort.
4) You compile the trace to native code, inserting guards to ensure that branches go in the same direction as expected.
5) In the native code, if a guard fails, you can either extend the trace (if you're doing something like trace trees) or abort and fall back to the interpreter.
In the interpreter loop example, the top of the loop will be marked as the loop head. It will start tracing, following the switch down to whatever bytecode happened to appear that time. The bytecode body would be added to the trace, and the interpreter loop would jump back to the loop head. At that point tracing would stop. Native code would be generated that implemented a loop with that one bytecode. When it was executed the guard would fail on other bytecodes, and would be extended (if using trace trees) or the JIT would give up on that loop. In the former case, you'd end up with a native-code version of the interpreter loop. Possibly--because of the deep branching inside the loop, the JIT would most likely just bail on the loop rather than try to come up with a trace for it.
As I understand it, those are hints as to domain-specific optimizations that can be made by the trace compiler. The author mentioned that a lot of the low-hanging fruit in the converge vm was down to correctly hinting to the trace compiler what assumptions it could make.
"The second tactic is to tell the JIT when it doesn't need to include a calculation in a trace at all. The basic idea here is that when creating a trace, we often know that certain pieces of information are fairly unlikely to change in that context. We can then tell the JIT that these are constant for that trace: it will insert an appropriate guard to ensure that is true, often allowing subsequent calculations to be optimised away. The use of the word constant here can mislead: it's not a static constant in the sense that it's fixed at compile-time. Rather, it is a dynamic value that, in the context of a particular trace, is unlikely to change. Promoting values and eliding functions are the main tools in this context: Carl Friedrich Bolz described examples in a series of blog posts. The new Converge VM, for examples, uses maps (which date back to Self), much as outlined by Carl Friedrich."
I'm sure that you could provide such an interface to LuaJIT but it wouldn't be as trivial as just writing the interpreter in Lua.
For example, there is a decorator `@purefunction` which hints to the translator that a function will always return the same output given the same input, even if it is operating on a data structure that could be considered "opaque".
"One thing the trace optimiser knows is that because program_counter is passed to jit_merge_point as a way of identifying the current position in the user's end program, any calculations based on it must be constant. These are thus easily optimised away, leaving the trace looking as in Figure 3."
So, you're probably right that a fair amount of optimization can come from doing a good job of defining what identifies an execution frame. I'm curious though, if these are the sorts of "low hanging fruit" the author was referring to, or if they're included in the straightforward port of the original C interpreter.
From the section on "Optimizing an RPython JIT" it seems that the he's doing a lot more than just defining his execution frame.
> The first tactic is to remove as many instances of arbitrarily resizable lists as possible. The JIT can never be sure when appending an item to such a list might require a resize, and is thus forced to add (opaque) calls to internal list operations to deal with this possibility.
Someone correct me if I'm wrong, but I think that by inserting hints, you're simply defining a point at which you want to compare your execution frame. I don't think there's much guessing involved. You can place jit_merge_point at the top your interpreter's main loop, this is where you want to compare execution frames. Then can_enter_jit simply defines what is considered a loop in your language.