

Have Tracing JIT Compilers Won? - nex3
http://lambda-the-ultimate.org/node/3851

======
illumen
Ahead of time compilers do work quite well for a subset of dynamic languages.
Just like tracing JIT works well for a subset. This has been proven by
projects like shedskin, cython, and tinypyC++. Given typing information,
either explicitly or through type inference - massive speed-ups are possible.

Also, tracing JIT currently does not do multiple CPU optimizations - which
static compilers have been doing. They also don't do SIMD or MIMD
optimizations, which compilers like gcc/icc/etc are doing.

However, runtime assemblers - like 'Orc' 'corepy' and older ones like
'softwire' are proving to be lots faster in multimedia applications compared
to traditional compilers or interpreters. These runtime assemblers give
authors of programs access to the compilers/assemblers. Just like allowing
people to specify typing, letting people construct programs at runtime can
give massive speed-ups.

C has even had fast interpreters for a while - see tinycc as an example. It
doesn't use a tracing compiler, but a very straight forward direct translation
compiler which happens to be faster than many interpreters.

It's quite easy to argue 'No' to this question I think.

~~~
beagle3
> C has even had fast interpreters for a while - see tinycc as an example. It
> doesn't use a tracing compiler, but a very straight forward direct
> translation compiler which happens to be faster than many interpreters.

How is tinycc an interpreter? It's a full-blown compiler, alas a very simple
one with very little optimization. If it weren't faster than an interpreter,
I'd be shocked. For well written code, it reaches 50%-70% of modern gcc speed
from tests I did a couple of years ago. I think that's on par, or even
slightly better, than gcc 1.x

~~~
barrkel
It helps that C is designed to have an almost one to one mapping between its
native operators and machine operations and addressing modes.

For example, the pre and post increment operators are expressions with side
effects, can be confusing and can provoke undefined behaviour with ease.
Having two variants also seems redundant. But the two variants may have
different optimal translations to underlying CPU addressing modes, while
leaving the pre or post modified value in a register is often a side-effect of
the CPU translation. If the language only permitted ++ as a statement, the
usefulness of side-effect would be lost.

It's a cliche that C is little more than portable assembler, but at a design
level there's a lot of truth to it. The length of the assembly translation of
C code is usually pretty directly proportional to the source code token
length. In more expressive languages with more powerful abstractions, that's
frequently not the case.

~~~
andrewcooke
not arguing with your general point, but the difference between pre and post
increment is very useful when writing things like stacks and queues with
pointers. these days maybe people use libraries or templates, but i suspect
they were introduced for ease of programming, not ease of translation.

[edit: really, it's two sides of the same coin. it's useful, so both c and
cisc chips have it, so the two match up...]

~~~
eru
> not arguing with your general point, but the difference between pre and post
> increment is very useful when writing things like stacks and queues with
> pointers.

Of course this only applies to ephemeral variants of those data structures.

------
amix
"Will a more static style JIT like Google's V8 be able to keep up?"

V8 compiles everything to assembler and does not do JIT. V8 seems to beat
TraceMonkey on both CPU and memory benchmarks (
[http://shootout.alioth.debian.org/u64/benchmark.php?test=all...](http://shootout.alioth.debian.org/u64/benchmark.php?test=all&lang=v8&lang2=tracemonkey)
), so I doubt their approach has "lost". They probably do this because most
JavaScript programs are very small and it's relatively cheap to compile
everything to assembler from the start. They do their optimizations on
assembler level as well - instead of doing it at bytecode level as it's done
in most other VMs. This is at least the information I gathered when I looked
at V8 last time - it was a year ago, so things might have changed.

~~~
iskander
Just because V8 doesn't have an IR doesn't mean it's not a JIT. V8 still has
to generate assembly throughout the execution of a javascript program (to
create new hidden classes, patch the inline cache, etc...). See:
<http://code.google.com/apis/v8/design.html>

------
gizmo
I'd say that for languages such as Python or Javascript tracing JIT compilers
are the only viable approach. Because the languages allow anything you can't
make any sensible predictions about types and call graphs at compile time
(people tried; didn't work well). The incremental-JIT approach (LuaJIT) has
less information to work with, and it's not aware which parts of the program
are bottlenecks, so it can't intelligently decide which parts parts need
optimization most.

The tracing JIT is such an obvious tactic that it should be part of any
dynamic language. Measure which parts are slow, then optimize those parts
using the parameter type information gathered at runtime. Fall back on
original code on type mismatch. The implementation is hairy, but it's a no-
brainer conceptually speaking.

I suspect that dynamic languages will evolve over the next 10 years to allow
for really efficient compilers, but for the current languages such as
Javascript I doubt we're ever going to see massive performance improvements.

~~~
beagle3
> The incremental-JIT approach (LuaJIT) has less information to work with, and
> it's not aware which parts of the program are bottlenecks, so it can't
> intelligently decide which parts parts need optimization most.

You are probably thinking of LuaJIT 1.x; The recent LuaJIT 2 is tracing JIT,
and it beats to hell out of every other dynamic language compiler, bar none.

Mike Pall for president!

------
stcredzero
With modern IDEs, tracing JIT, and type inference, I'm surprised that someone
hasn't made the same lateral thinking sidestep that Hennessey & Patterson made
for RISC. They offloaded much of the decoding work from the chip to the
compiler, yielding faster chips in a smaller die size.

Perhaps the Go language is the start of such a sidestep. We should be able to
offload a lot of the work of type annotation to the IDE and compiler, yielding
a compiled language with the dynamic feel of an interpreted one. In fact, for
langs with clean runtime models like Lisp, we should be able to allow
intermediate states of partial annotation, and allow some interpreted
execution for fast prototyping. (But demand full annotation before
deployment.)

(EDIT: Yes, I was aware of Haskell when I posted this. Haskell. There I said
it!)

Is there a tool that uses a Bayesian technique for suggesting type annotation
that cannot be inferred? This would be tremendously useful. (And very
dangerous in the hands of the incompetent.)

~~~
jrockway
You have just described Haskell. Annotate types if you want, let the compiler
infer otherwise. Very dynamic feeling, but much safer. And you're right, it
can make for some very fast code.

~~~
cynicalkane
I've always got the impression that Haskell's lazy properties makes it hard to
reason about code performance. You can force high-performance code, but that
requires circumventing the beauty of the language. Has this problem been
solved?

~~~
jrockway
The "reasoning about performance" is more of a trap for the unwary than a
fundamental problem.

Example; you have two threads in a producer/consumer setup. The first thread
puts "g x" into the queue. The second thread removes this, and applies "f",
and puts the result into a result queue. Then a result-processing thread
prints those to the file.

You run this, and notice that it doesn't really take advantage of all cores.
It seems to run slower on 2 or 3 cores than it does on 1 core, even though "f"
and "g" are of roughly equal complexity and the application of each happens in
a different thread! WTF?!?

The problem is, of course, that lazy evaluation "helped you". You never needed
"g x" in the first thread, so it wasn't evaluated; the thunk was put in the
queue. When you applied "f (g x)" in the second thread, you still didn't need
the value of either computation, so it was never evaluated. You just put a
thunk in the result queue. Finally, in the result-printing thread, you did
need the result -- so all the work to evaluate the result was done there; in a
single thread.

You realize this, switch to strict queues, and now your app is 2x faster.

(BTW, I simplified the producer/consumer model a bit. Normally you have one
thread feeding work into a queue, n threads reading from that queue, doing
work, and putting results into a result queue, and then a final result-
aggregator thread. The end result would be the same; all the work would be
done in that last thread. Slow!)

------
mikek
Can somebody please explain what a tracing JIT is?

~~~
jrockway
The runtime watches the code run, and if it notices that a specific type is in
use in a certain code section, it rewrites the generated code to use that type
when possible.

For example, if you want to add all the integers from 1 to 2^31, the JIT will
eventually realize that the accumulator can be a register (with a machine int
in it) instead of a JavaScript big integer. (If that overflows, it will switch
back.) Obviously addition is faster when it's one CPU instruction with no
memory loads, so this is a big speed win.

JIT in general means you generate native code at runtime instead of at compile
time.

~~~
stcredzero
Often, there's more information at runtime than compile time. This may even be
true at times for statically typed languages.

I think the static/dynamic type debate is just a vestige of the state of
interpreter/compiler technology in the 80's. I wish everyone would get over it
and introspect and debug their knee-jerk reactions so we can get on with the
_next_ stage of language implementation.

~~~
scott_s
I think you're only considering static typing as an efficiency benefit. I'm
equally as interested in its program-correctness benefits.

~~~
silentbicycle
Static typing and JIT compiling aren't mutually exclusive. You can do all the
static analysis, checking, whole-program optimization, etc. that's feasible at
compile time, compile to bytecode, and then do JIT compilation at runtime.

Edit: We appear to be in fierce agreement.

~~~
stcredzero
_Static typing and JIT compiling aren't mutually exclusive, though_

Static/Dynamic typing are no longer mutually exclusive either. We need to get
past this idea already.

(I mean we as a field, not present company necessarily.)

------
KirinDave
I'm confused. In the javascript space, it seems like V8 and squirrelfish (or
is it Nitro? I prefer SF) have soundly trounced tracemonkey. Last time I
checked, Squirrelfish isn't a tracing JIT centric implementation (nor is V8).
I've yet to see a ttJIT javascript engine that hasn't been soundly trounced by
the more conventional approaches.

Perhaps it's different in the Lua world, but it seems to me like this post is
claiming the exact opposite of what the evidence suggests?

Don't get me wrong, here. I think trace trees are a very cool optimization
technique and surely have a place in the future of JIT compilers. I just don't
think they've “Won” at this point.

~~~
daeken
Not to mention I don't think any pure approach will ever "win". To date, we've
used a lot of different compilation techniques for wholly different use cases,
as well as together for different parts of compilation where one makes more
sense than another. I don't see any reason why that won't happen here. There's
no reason this is an all-or-none proposition.

~~~
stcredzero
Smalltalk images are a clear demonstration that it doesn't have to be "all or
none." Given a clean runtime model, one should be able to switch the execution
engine on the fly, just as Smalltalk lets a Smalltalk app (debugger) act as a
bytecode interpreter for the process being debugged, then switches back to JIT
machine code when you hit "run".

------
iskander
1) No. See V8.

2) Tracing JIT's _might_ win for loop-centric languages. They miss some
serious opportunities for parallelization in languages with rich collection-
oriented operators (functional and array languages).

