I don't think that follows; in the LuaJIT case wouldn't traces of the interpreter span the execution of several (custom) byte-code instructions? Guards would be inserted that effectively ensure that the next byte-code instruction is as expected; the net result would be a trace that indeed corresponds to a region of the interpreted program.
> When PyPy JIT's the process, it will use your hints to collapse both levels of interpretation. It will replace both L0 and L1 with native code that implements the bytecode being interpreted.
It's just not clear to me how PyPy is going to do better when the only extra information is has is some hints. They must be powerful hints that help in a way I don't understand if the PyPy approach truly outperforms a run-of-the-mill tracing VM.
No, it wouldn't do that. If the tracer worked like that, a simple loop from 0...10,000 would generate a huge trace. A trace doesn't (normally) unroll an inner loop like that. Instead it flattens all the function calls and linearizes all the control flow for a single iteration of the loop.
What happens depends on the interpreter, but the basic gist of the algorithm is:
1) Every time you do a backwards jump, you increment a hotness counter associated with the target of the jump.
2) If a target becomes hot, it's considered a loop header and tracing starts from the header.
3) You keep accumulating the trace until you jump back to the original target; if the trace gets too long before that happens, you abort.
4) You compile the trace to native code, inserting guards to ensure that branches go in the same direction as expected.
5) In the native code, if a guard fails, you can either extend the trace (if you're doing something like trace trees) or abort and fall back to the interpreter.
In the interpreter loop example, the top of the loop will be marked as the loop head. It will start tracing, following the switch down to whatever bytecode happened to appear that time. The bytecode body would be added to the trace, and the interpreter loop would jump back to the loop head. At that point tracing would stop. Native code would be generated that implemented a loop with that one bytecode. When it was executed the guard would fail on other bytecodes, and would be extended (if using trace trees) or the JIT would give up on that loop. In the former case, you'd end up with a native-code version of the interpreter loop. Possibly--because of the deep branching inside the loop, the JIT would most likely just bail on the loop rather than try to come up with a trace for it.