

Fast enough VMs in fast enough time - ltratt
http://tratt.net/laurie/tech_articles/articles/fast_enough_vms_in_fast_enough_time

======
sjwright
This is the most interesting thing I've read on hacker news for months. And
that's despite its subject matter being way beyond my skill level. I actually
understand how a tracing JIT works now.

~~~
Wilduck
I agree, I learned a lot from this article. I also really appreciated the
linked tutorials on writing a Brain Fuck interpreter and JIT compiler in
PyPy/RPython: Part 1 [1] and Part [2]. The simplicity of the example and the
good writing cleared the topic up for me immensely.

[1] [http://morepypy.blogspot.com/2011/04/tutorial-writing-
interp...](http://morepypy.blogspot.com/2011/04/tutorial-writing-interpreter-
with-pypy.html)

[2] [http://morepypy.blogspot.com/2011/04/tutorial-
part-2-adding-...](http://morepypy.blogspot.com/2011/04/tutorial-
part-2-adding-jit.html)

------
haberman
> RPython badges itself as a meta-tracing system, meaning that the user's end
> program isn't traced directly (which is what Figure 1 suggests), but rather
> the interpreter itself is traced.

Isn't that exactly what would happen if you wrote an interpreter in _any_
language with a tracing VM (ie. LuaJIT)? How is writing an interpreter in
RPython better than writing one with LuaJIT? RPython makes you insert these
hints to the trace compiler (can_enter_jit, jit_merge_point) about when to
start/stop running a trace, does this buy you anything? If I had to guess, I'd
suspect that this is actually a net loss because you have to guess ahead-of-
time where it would make sense to start tracing. This sort of guessing is
notoriously hard to do. An implementation like LuaJIT automatically decides
when to start tracing based on run-time behavior, which seems like a more
robust approach.

The one thing I _do_ find very interesting about RPython is how it subsets a
dynamic language such that types can be statically analyzed. I always wondered
whether this was possible and what kind of restrictions you'd have to enforce.
It's great to see an actual example of this -- it will be very instructive to
anyone trying to do a similar thing.

But as far as using RPython as a "meta-tracing system," I'm not seeing what's
ground-breaking here. I'd bet 50 cents that writing an interpreter in LuaJIT
will be faster than writing it in RPython. And if I'm wrong about that, I'd
bet 50 more cents that the reason RPython wins is because it's statically
analyzable, _not_ because of anything that's unique about its "meta-tracing
system." I'm not sure that term really means anything.

~~~
rayiner
> Isn't that exactly what would happen if you wrote an interpreter in any
> language with a tracing VM (ie. LuaJIT)?

No. Say you have an interpreter loop written as a switch:

    
    
        while(has_bytecodes) {
            bc = next_bytecode()
            switch(bc) {
                case OP_ADD: ...
            }
        }
    

(Assume for the sake of argument you've got a Lua version and a Python version
of the above). In both cases, the loop gets compiled to bytecodes. Before the
JIT, you've got two levels of interpretation (call them L0 and L1). The first
level of interpretation executes the second level of interpretation, written
in Lua or Python bytecode, which interprets your custom bytecode.

When LuaJIT JIT's the process, it will generate native code to replace L0 and
directly implement L1. The end result is the same as if you had written your
interpreter in C and directly compiled it to native code.

When PyPy JIT's the process, it will use your hints to collapse both levels of
interpretation. It will replace both L0 and L1 with native code that
implements the bytecode being interpreted.

~~~
haberman
> The end result is the same as if you had written your interpreter in C and
> directly compiled it to native code.

I don't think that follows; in the LuaJIT case wouldn't traces of the
interpreter span the execution of several (custom) byte-code instructions?
Guards would be inserted that effectively ensure that the next byte-code
instruction is as expected; the net result would be a trace that indeed
corresponds to a region of the interpreted program.

> When PyPy JIT's the process, it will use your hints to collapse both levels
> of interpretation. It will replace both L0 and L1 with native code that
> implements the bytecode being interpreted.

It's just not clear to me how PyPy is going to do better when the only extra
information is has is some hints. They must be powerful hints that help in a
way I don't understand if the PyPy approach truly outperforms a run-of-the-
mill tracing VM.

~~~
rayiner
> I don't think that follows; in the LuaJIT case wouldn't traces of the
> interpreter span the execution of several (custom) byte-code instructions?
> Guards would be inserted that effectively ensure that the next byte-code
> instruction is as expected; the net result would be a trace that indeed
> corresponds to a region of the interpreted program.

No, it wouldn't do that. If the tracer worked like that, a simple loop from
0...10,000 would generate a huge trace. A trace doesn't (normally) unroll an
inner loop like that. Instead it flattens all the function calls and
linearizes all the control flow for a single iteration of the loop.

What happens depends on the interpreter, but the basic gist of the algorithm
is:

1) Every time you do a backwards jump, you increment a hotness counter
associated with the target of the jump.

2) If a target becomes hot, it's considered a loop header and tracing starts
from the header.

3) You keep accumulating the trace until you jump back to the original target;
if the trace gets too long before that happens, you abort.

4) You compile the trace to native code, inserting guards to ensure that
branches go in the same direction as expected.

5) In the native code, if a guard fails, you can either extend the trace (if
you're doing something like trace trees) or abort and fall back to the
interpreter.

In the interpreter loop example, the top of the loop will be marked as the
loop head. It will start tracing, following the switch down to whatever
bytecode happened to appear that time. The bytecode body would be added to the
trace, and the interpreter loop would jump back to the loop head. At that
point tracing would stop. Native code would be generated that implemented a
loop with that one bytecode. When it was executed the guard would fail on
other bytecodes, and would be extended (if using trace trees) or the JIT would
give up on that loop. In the former case, you'd end up with a native-code
version of the interpreter loop. Possibly--because of the deep branching
inside the loop, the JIT would most likely just bail on the loop rather than
try to come up with a trace for it.

~~~
haberman
Thanks for the helpful clarification/explanation.

------
sb
While I find it good that the article explicitly addresses issues with trace-
based compilation (usually this is not the case), a completely fair account
needs to present the additional memory requirements for using the PyPy tool
chain. Quite recently, somebody here has addressed this by mentioning that he
does not really care for all the performance speedup he gets, if the memory
requirements become outlandish at the same time.

It would also be _very_ informative to know what the differences in automatic
memory management techniques are (i.e., what did the previous implementation
do?) Personally, I am also interested in interpreter optimization techniques,
and it would therefore be interesting to me what--or if at all--the previous
VM used for example threaded code or something along these lines.

~~~
zellyn
Yeah: I wish speed.pypy.org also compared memory use. Admittedly, that's more
complicated.

~~~
fijal
It does not for a very trivial reason - we lack benchmarks that have some
reasonable memory impact. Right now most benchmarks would measure interpreter
size, which is slightly boring

~~~
sb
Hm, I agree about having benchmarks with memory impact is more compelling, but
wouldn't it be at least interesting to show the memory impact as it is right
now? (i.e., how much more memory does PyPy need?)

------
j_s
"What RPython allows one to do is profoundly different to the traditional
route. In essence, one writes an interpreter and gets a JIT for free. I
suggest rereading that sentence again: it fundamentally changes the economics
of language implementation for many of us. To give a rough analogy, it is like
moving from manual memory management to automatic garbage collection."

(RPython being the foundation of PyPy.)

------
wisty
I'm surprized there's not a PyPy version of Ruby. _why's unholy showed it's
trivial to compile Ruby into Python (in many cases). It shouldn't be too hard
(famous last words) to make a Ruby VM with PyPy. If they don't like the idea,
they could make a RPython <-> RRuby translator, and port the whole thing to
RRuby.

Yes, I've heard about Rubinous (the Ruby equivalent to PyPy), but it doesn't
have the resources.

~~~
judofyr
> _why's unholy showed it's trivial to compile Ruby into Python (in many
> cases). It shouldn't be too hard (famous last words) to make a Ruby VM with
> PyPy.

Basic structures like if, else, while etc. is easy. The hard part is the
object model (including constant lookup) and long jumps (exceptions,
next/break). Let's not forget the big core library which you need to implement
if you want it to be usable at all…

(Related: I started working on a Ruby version in JavaScript:
<https://github.com/judofyr/rubyscript>)

~~~
mdaniel
> Let's not forget the big core library which you need to implement if you
> want it to be usable at all

Wouldn't one just point the translator/compiler to whatever Ruby's equivalent
is of /usr/lib/python2.7 and let it be "turtles all the way down"?

~~~
judofyr
Core != Stdlib. Core is implemented in C. There are 111 methods just on the
String class in core…

------
codebaobab
Squeak Smalltalk bootstrapped itself (back in 1996) by implementing its
virtual machine in Slang, a subset of Smalltalk that can essentially be
pretty-printed into C.

Back to the future: the story of Squeak, a practical Smalltalk written in
itself. <http://www.vpri.org/pdf/tr1997001_backto.pdf>

Its a neat twist that RPython is taking the same engine that optimizes the VM
to JIT code that runs on the VM.

------
iskander
Very cool article. I have to admit I'm a little frightened of the bulk and
complexity of the RPython translation pipeline. I'm not happy about the
prospect of waiting an hour to learn my code runs afoul of poorly documented
type inference logic. Perhaps when PyPy stabilizes the team can trim back some
of the abandoned paths and speed up translate.py?

~~~
fijal
That has two sources - one is the complexity of the python interpreter the
other is the complexity of the toolchain itself. Compiling converge takes ~5
minutes, while compiling python interpreter about 30minutes. There is
definitely an ongoing effort to make it faster, for example by using the STM
to parallelize it to multiple cores.

PS. Type inference errors happen in roughly 1/5th of the compilation time. Not
ideal, but better than waiting 30min.

