> RPython badges itself as a meta-tracing system, meaning that the user's end program isn't traced directly (which is what Figure 1 suggests), but rather the interpreter itself is traced.
Isn't that exactly what would happen if you wrote an interpreter in any language with a tracing VM (ie. LuaJIT)? How is writing an interpreter in RPython better than writing one with LuaJIT? RPython makes you insert these hints to the trace compiler (can_enter_jit, jit_merge_point) about when to start/stop running a trace, does this buy you anything? If I had to guess, I'd suspect that this is actually a net loss because you have to guess ahead-of-time where it would make sense to start tracing. This sort of guessing is notoriously hard to do. An implementation like LuaJIT automatically decides when to start tracing based on run-time behavior, which seems like a more robust approach.
The one thing I do find very interesting about RPython is how it subsets a dynamic language such that types can be statically analyzed. I always wondered whether this was possible and what kind of restrictions you'd have to enforce. It's great to see an actual example of this -- it will be very instructive to anyone trying to do a similar thing.
But as far as using RPython as a "meta-tracing system," I'm not seeing what's ground-breaking here. I'd bet 50 cents that writing an interpreter in LuaJIT will be faster than writing it in RPython. And if I'm wrong about that, I'd bet 50 more cents that the reason RPython wins is because it's statically analyzable, not because of anything that's unique about its "meta-tracing system." I'm not sure that term really means anything.
> Isn't that exactly what would happen if you wrote an interpreter in any language with a tracing VM (ie. LuaJIT)?
No. Say you have an interpreter loop written as a switch:
bc = next_bytecode()
case OP_ADD: ...
(Assume for the sake of argument you've got a Lua version and a Python version of the above). In both cases, the loop gets compiled to bytecodes. Before the JIT, you've got two levels of interpretation (call them L0 and L1). The first level of interpretation executes the second level of interpretation, written in Lua or Python bytecode, which interprets your custom bytecode.
When LuaJIT JIT's the process, it will generate native code to replace L0 and directly implement L1. The end result is the same as if you had written your interpreter in C and directly compiled it to native code.
When PyPy JIT's the process, it will use your hints to collapse both levels of interpretation. It will replace both L0 and L1 with native code that implements the bytecode being interpreted.
> The end result is the same as if you had written your interpreter in C and directly compiled it to native code.
I don't think that follows; in the LuaJIT case wouldn't traces of the interpreter span the execution of several (custom) byte-code instructions? Guards would be inserted that effectively ensure that the next byte-code instruction is as expected; the net result would be a trace that indeed corresponds to a region of the interpreted program.
> When PyPy JIT's the process, it will use your hints to collapse both levels of interpretation. It will replace both L0 and L1 with native code that implements the bytecode being interpreted.
It's just not clear to me how PyPy is going to do better when the only extra information is has is some hints. They must be powerful hints that help in a way I don't understand if the PyPy approach truly outperforms a run-of-the-mill tracing VM.
> I don't think that follows; in the LuaJIT case wouldn't traces of the interpreter span the execution of several (custom) byte-code instructions? Guards would be inserted that effectively ensure that the next byte-code instruction is as expected; the net result would be a trace that indeed corresponds to a region of the interpreted program.
No, it wouldn't do that. If the tracer worked like that, a simple loop from 0...10,000 would generate a huge trace. A trace doesn't (normally) unroll an inner loop like that. Instead it flattens all the function calls and linearizes all the control flow for a single iteration of the loop.
What happens depends on the interpreter, but the basic gist of the algorithm is:
1) Every time you do a backwards jump, you increment a hotness counter associated with the target of the jump.
2) If a target becomes hot, it's considered a loop header and tracing starts from the header.
3) You keep accumulating the trace until you jump back to the original target; if the trace gets too long before that happens, you abort.
4) You compile the trace to native code, inserting guards to ensure that branches go in the same direction as expected.
5) In the native code, if a guard fails, you can either extend the trace (if you're doing something like trace trees) or abort and fall back to the interpreter.
In the interpreter loop example, the top of the loop will be marked as the loop head. It will start tracing, following the switch down to whatever bytecode happened to appear that time. The bytecode body would be added to the trace, and the interpreter loop would jump back to the loop head. At that point tracing would stop. Native code would be generated that implemented a loop with that one bytecode. When it was executed the guard would fail on other bytecodes, and would be extended (if using trace trees) or the JIT would give up on that loop. In the former case, you'd end up with a native-code version of the interpreter loop. Possibly--because of the deep branching inside the loop, the JIT would most likely just bail on the loop rather than try to come up with a trace for it.
The difference is that LuaJIT would be tracing your interpreter, while PyPy tries to trace the user's program while it is being traced by your interpreter. Interpreter's loop is usually quite unpredictable and is therefore not the best candidate for tracing JIT optimization. PyPy, on the other hand, (IIRC) allows you to denote which variables belong to the interpreter and which belong to the user program, and so the traces would only guard on the user variables.
> RPython makes you insert these hints to the trace compiler (can_enter_jit, jit_merge_point) about when to start/stop running a trace, does this buy you anything?
As I understand it, those are hints as to domain-specific optimizations that can be made by the trace compiler. The author mentioned that a lot of the low-hanging fruit in the converge vm was down to correctly hinting to the trace compiler what assumptions it could make.
"The second tactic is to tell the JIT when it doesn't need to include a calculation in a trace at all. The basic idea here is that when creating a trace, we often know that certain pieces of information are fairly unlikely to change in that context. We can then tell the JIT that these are constant for that trace: it will insert an appropriate guard to ensure that is true, often allowing subsequent calculations to be optimised away. The use of the word constant here can mislead: it's not a static constant in the sense that it's fixed at compile-time. Rather, it is a dynamic value that, in the context of a particular trace, is unlikely to change. Promoting values and eliding functions are the main tools in this context: Carl Friedrich Bolz described examples in a series of blog posts. The new Converge VM, for examples, uses maps (which date back to Self), much as outlined by Carl Friedrich."
I'm sure that you could provide such an interface to LuaJIT but it wouldn't be as trivial as just writing the interpreter in Lua.
I don't think that can_enter_jit and jit_merge_point are hints for optimization. I'm not particularly familiar with JIT compilation, or RPython, but it looks like RPython provides other methods for doing trace optimization.
For example, there is a decorator `@purefunction` which hints to the translator that a function will always return the same output given the same input, even if it is operating on a data structure that could be considered "opaque".
I'm not familiar either, I'm just going off the article.
"One thing the trace optimiser knows is that because program_counter is passed to jit_merge_point as a way of identifying the current position in the user's end program, any calculations based on it must be constant. These are thus easily optimised away, leaving the trace looking as in Figure 3."
I think you're right that jit_merge_point affects performance. My point was that this isn't an optimization of the JIT compilation, so much as it is a requirement for a JIT compiler to be able to operate. The compiler has to know what defines an execution frame. Once you've done that, there are definitely other optimizations that go in to reducing the opacity of your interpreter to the JIT compiler.
So, you're probably right that a fair amount of optimization can come from doing a good job of defining what identifies an execution frame. I'm curious though, if these are the sorts of "low hanging fruit" the author was referring to, or if they're included in the straightforward port of the original C interpreter.
From the section on "Optimizing an RPython JIT" it seems that the he's doing a lot more than just defining his execution frame.
> The first tactic is to remove as many instances of arbitrarily resizable lists as possible. The JIT can never be sure when appending an item to such a list might require a resize, and is thus forced to add (opaque) calls to internal list operations to deal with this possibility.
I guess my orginal point was that RPython has lots of hooks (eg the decorators that you mentioned) which allow the JIT to effectively trace the interpreted program. You would probably have to extend LuaJIT with similar hooks in order to use it in the same way. As tomp pointed out, the interpreter loop itself is not a good candidate for tracing. The JIT needs to at least know about the interpreters program counter to be able to identify loops in the interpreted program. You're right that referring to these as optimizations is inaccurate.
jit_merge_point isn't an optimization. It tells the PyPy JIT how to separate the parts of the trace that are part of your interpreter loop and the parts of the trace that are implementation of each bytecode.
Regarding the hints: without those hints the compiler would also trace the interpreter dispatch loop. If you trace them, too, then your generated code would contain unnecessary branching code. Hence, in a sense, these mechanisms allow the trace recorder to record the sequence of interpreter instructions executed without the interpreter dispatch interfering.
> RPython makes you insert these hints to the trace compiler (can_enter_jit, jit_merge_point) about when to start/stop running a trace, does this buy you anything? If I had to guess, I'd suspect that this is actually a net loss because you have to guess ahead-of-time where it would make sense to start tracing. This sort of guessing is notoriously hard to do.
Someone correct me if I'm wrong, but I think that by inserting hints, you're simply defining a point at which you want to compare your execution frame. I don't think there's much guessing involved. You can place jit_merge_point at the top your interpreter's main loop, this is where you want to compare execution frames. Then can_enter_jit simply defines what is considered a loop in your language.