
Tracing JITs and modern CPUs - jsnell
https://github.com/lukego/blog/issues/5
======
krig
For anyone interested in learning more about Tracing JITs and the problems
involved in implementing them should read the paper "Trace-based Just-in-time
Compilation for Haskell" by Thomas Schilling [1].

In it, he explains the details of implementing a basic VM and tracing JIT
based on LuaJIT, and deals with a lot of the issues involved. For example, the
choice of where to begin a trace and for how long to trace is crucial for
performance. Traces that are too long will rarely complete, and selecting
poorly and tracing something that won't actually be on the hot path has
significant cost. With poor trace selection, a tracing JIT can even be slower
than an optimized interpreter. Interestingly, the language itself also
influences the viability of tracing: A language with explicit loop
instructions is easier to trace since any loop instruction is an intuitive
starting point, whereas a language which relies on recursion and TCO is less
cooperative in this regard. One possibility is rewriting recursive
constructions to imperative loops in a pass prior to trace selection.

Personally, after reading the paper I think that there is the possibility for
amazing performance from tracing JITs, but the unpredictability and reliance
on heuristics makes the practical value over method JITs or static compilation
questionable. It is similar to the complexities of GC implementation: As
performance gains are made, complexity shoots through the roof while
predictability suffers. There's no easy answer to that problem.

[1]:
[http://src.acm.org/2013/ThomasSchilling.pdf](http://src.acm.org/2013/ThomasSchilling.pdf)

~~~
mafribe
The Schilling paper is mostly a re-rendering of the original paper

    
    
       Dynamo: A Transparent Dynamic Optimization System [1]
    

by Bala et al, which invented (or better reinvented and popularised) tracing
JIT compilation. It might be worth reading the original. In addition, Haskell
is not such a great target for JITing, as it's statically typed, leaving less
scope for optimisation at run-time.

[1]
[https://www.cs.virginia.edu/kim/courses/cs771/papers/bala00d...](https://www.cs.virginia.edu/kim/courses/cs771/papers/bala00dynamo.pdf)

~~~
chrisseaton
> Haskell is not such a great target for JITing, as it's statically typed,
> leaving less scope for optimisation at run-time

I'd refute this. Types are just one thing that you can speculatively optimise
for at runtime which may not be practical to do at compile time. Other things
include value ranges, tighter types than there are in the source code,
branches taken/not-taken, contended/non-contended shared resources such as
MVars and TVars, whether an Integer fits into a word or not, etc etc.

In my group we're looking at using a JIT for C, where we can do things such as
inline through a function pointer by speculating that it is stable.

~~~
mafribe
I agree that there is scope for JITing in Haskell. But JITing is not without
cost, and in statically typed languages, some of the big gains that make JITs
for Javascript or Python so powerful, go away.

If you are using C in a way that requires frequent invocation through a
function pointer in hot code, you are probably using an OO-idiom, so casting,
so circumventing C's typing.

~~~
sitkack
In Dynamo, it was tracing actual instructions, not the source.

~~~
mafribe
Yes, Dynamo traces machine code. What language to trace is orthogonal to the
idea of tracing. Any language can be traced.

~~~
sitkack
My comment should have added something about the trace not having access to
type information in the language. During luajit and v8 traces the VM has
access to a higher level form than what Dynamo was dealing with.

You were specifically talking about tracing a static vs dynamic language. I
agree, the opportunities for a speed up under dynamic languages is larger
because there is more code bloat to dispatch on object type, where in Haskell
that should have been elided during compilation, the Haskell code is much
tighter to begin with. So the speedups seen for tracing Haskell should be
similar to the gains seen by the Dynamo team when they traced native
executables (most were generated from C afaik). Dynamo didn't have type
information other than what it could extract from the assembly.

I'll try and make a more complete post next time.

~~~
mafribe
I don't know the innards of the Haskell compiler well enough, but, as
chrisseaton writes above, tracing Haskell could lead to speedups better than
Dynamo. For example data for generic functions in hot code could be unboxed,
maybe laziness could be 'switched off' locally etc. But the easy gains that
you see for dynamically typed languages, as we agree, are unlikely to be
achievable. Nevertheless SPJ has suggested to the PyPy/RPython people to write
a meta-tracer for Haskell, just to be able to measure what's possible.

As to Dynamo, I think some of the key gains came from the ability to shortcut
jumps to jumps to jumps to ... that could arise from linking separately
compiled code. However, modern CPUs diminish the penalty such frivolous
jumping incurs, so Dynamo became less and less competitive with tracing on
higher-level languages. (BTW, Dynamo can trace everything, not just C
executables.)

------
pjc50
We have known for a long time that the worst performance comes from iterating
a "fetch value from memory -> make decision based on result about what to
fetch next" pattern. Unfortunately almost everything that can be described as
business logic looks like that. It's also common in object-orientated systems.

~~~
geocar
Only if you program with atoms. If you program with arrays then it is my
experience that very few problems look that way.

Array programming also keeps absolute program size small which means your
program isn't running in main memory but in cache memory which produces
anywhere from 10-100x speedups for business software just by itself!

I don't know if I fully understand what you mean by object-orientated.

~~~
pjc50
I don't suppose you could elaborate on what you mean by atoms vs arrays here?
I've not heard anyone say "atom" outside of a LISP context.

A fairly random example of OO "business logic":
[https://git.eclipse.org/c/b3/b3.git/tree/org.eclipse.b3.aggr...](https://git.eclipse.org/c/b3/b3.git/tree/org.eclipse.b3.aggregator.engine.maven/src/org/eclipse/b3/aggregator/engine/maven/InstallableUnitMapping.java)

(I went clicking through Eclipse because I knew I could count on it to have
lots and lots of this kind of abstraction-heavy bland Java. I'm not even
saying this particular bit of code is slow, it's just that there's a _lot_ of
OO code that looks vaguely like this.)

~~~
vvanders
I believe what he's referring to is also called AoS(Arrays of Structures) vs
SoA(Structure of Arrays).

If you're iterating over one or two values in a struct via AoS it's very
painful from a memory caching standpoint since you're only getting
sizeof(member) / sizeof(struct) efficiency.

In a SoA situation all your data for members is tightly packed so you usually
get better cache coherency.

That said the best approach is actually looking at your data access patterns
and packing your data appropriately. Unfortunately some languages don't have
value types(most managed languages except C#). This is why C/C++ is usually
faster than managed languages, they can get close on generated code but can be
off by a factor of 50x in memory access performance.

~~~
pron
> This is why C/C++ is usually faster than managed languages, they can get
> close on generated code but can be off by a factor of 50x in memory access
> performance.

This is a common misconception that is a result of Java's current state.
Control over memory layout and garbage collection are two _completely
orthogonal_ issues. At the moment, Java just happens to be both GCed _and_ to
afford little control over memory layout. Currently, the HotSpot team (HotSpot
is the name of OpenJDK's JVM) is working on project Valhalla and project
Panama, two efforts scheduled for Java 10 and intended to give the JVM all the
memory layout control you need.

~~~
vvanders
Not completely.

Most GCs want to be able to compact the heap, that means they lean in favor of
reference based models by default.

It's not exclusive like you mention but to my knowledge C# is the only
language with explicit guarantees. I was aware of the effort around Java's
value types, however how long till we can reasonably see this in production?

You can do this in Java today but it involves lots of nasty ByteBuffer
manipulation(in which FlatBuffers is fantastic for). You'll still pay for the
conversions from bytes to the actual types you want.

If you're dealing with these types of performance problems it's best to treat
them with a language that's well suited to deal with them. There's nothing
wrong with using Java for higher level business logic and delegating the heavy
lifting to a stack that's designed to deal with them.

~~~
pron
> Most GCs want to be able to compact the heap, that means they lean in favor
> of reference based models by default.

I don't see how this follows. Copying GCs do compact the heap, but when you
use values you're basically saying "these two things go together". You're only
making life easier for the GC (well, you're also creating larger objects, but
that might just require a small adjustment of the GC strategy).

> how long till we can reasonably see this in production?

Four years probably... Still, that doesn't change the fact that layout and GCs
are orthogonal.

------
bluetomcat
He mentions that the demands of tracing JITs and current Intel CPUs are well
aligned. I wonder, how well do they play with the instruction cache?
Generating little chunks of code on-demand and then executing them seems to be
a recipe for instruction cache misses?

------
talles
First time I see someone blogging with GitHub issues and, surprisingly, looks
like it works well.

Markdown, tags, comments plus a index (README.md) if you need one.

~~~
qznc
Also "follow" and "kudos", named "watch" and "star". Getting a feed of the
blog requires filtering, though.

------
haberman
I think the article is making out trace compilers to be far more limited than
they actually are.

> Consider these optimization tips: Avoid nested loops.

I don't think this is true at all. The article is arguing that because a trace
only contains one loop, that nested loops can't be trace-compiled.

As I understand it, nested loops are totally possible to trace-compile by
having multiple traces that link to each other. See what Mike Pall wrote about
LuaJIT ([http://www.freelists.org/post/luajit/How-does-LuaJITs-
trace-...](http://www.freelists.org/post/luajit/How-does-LuaJITs-trace-
compiler-work,1)):

"If the inner loop has a low iteration count, it'll be unrolled and inlined.
For most loop nestings there'll be a single trace for the outer loop that goes
'around' the inner loop and joins its exit with its entry. Otherwise traces
link to each other in linear pieces."

I would take what this article says with a heavy grain of salt. I don't think
that tracing JITs are at all like "programming in a straight jacket."

Look at the LuaJIT performance optimization guide. There is nothing in there
about avoiding nested loops, avoiding tiny loops, or avoiding loops with
unpredictable iteration counts. It does however recommend reducing
unpredictable branches: [http://wiki.luajit.org/Numerical-Computing-
Performance-Guide](http://wiki.luajit.org/Numerical-Computing-Performance-
Guide)

~~~
mrottenkolber
> I think the article is making out trace compilers to be far more limited
> than they actually are.

Luke (the posts' author) and others are building a high performance software
Ethernet switch[1] using LuaJIT. Let me assure you that these considerations
actually come from experience rather than speculation.

> As I understand it, nested loops are totally possible to trace-compile by
> having multiple traces that link to each other.

Sometimes it works, sometimes it doesn't. We can't really expect everybody
using Snabb Switch to understand the internals of LuaJIT so restrictive
guidelines are useful for us.

Disclaimer: Consulting for Snabb GmbH

[1]:
[https://github.com/snabbco/snabbswitchhttps://github.com/sna...](https://github.com/snabbco/snabbswitchhttps://github.com/snabbco/snabbswitch)

~~~
losername
>
> [https://github.com/snabbco/snabbswitch](https://github.com/snabbco/snabbswitch)

fixed

------
TheLoneWolfling
Two semi-tangential ideas:

I wish it was possible to do things along the lines of "fetch this data from
RAM and at the same time start recomputing it, whichever is faster".

And also, I wish there was a "branch upcoming / branch execute" split. In
other words, instead of just "here's a branch", it's more along the lines of
"you will have to branch on <x>. <other instructions> now branch on <x>."
Effectively a variable number of branch-delay slot(s). Wouldn't help for
straight pointer-chasing, but in other situations it could.

