
Branch Prediction and the Performance of Interpreters – Don’t Trust Folklore - nkurz
https://hal.inria.fr/hal-01100647/document
======
rurban
This research misses a few important points.

First, the fastest interpreter is never switch or "jump threading" (i.e.
computed goto) based. It is either based on passing the next pointer from one
opcode to the next (e.g. perl5) or using an opcode array (lua). Those two
cache much better and don't need the dispatch overhead with a jump table at
all. Not talking about asm tuned interpreter dispatch yet, which count as best
of the best.

Second, the famous cited Ertl paper mostly suggested to use smaller opcodes to
influence the icache. Lua and luajit do this with one-word ops and blow all
other interpreters away.

Using the computed goto extension only helps in bypassing the out of bounds
check at the start of the switch dispatch, and gains the famous 2-3%, and less
on more modern CPU's. When talking about the differences, the differences
should be explained.

Summary: This research result is not useful for state of the art interpreters,
only for conventional slow interpreters. And for those the result should be
ignored at all, as much better interpreter speedup techniques exist, not noted
in this paper.

~~~
loup-vaillant
> _passing the next pointer from one opcode to the next (e.g. perl5)_

> _using an opcode array (lua)_

Would you perchance have a couple links explaining what those are? A quick
search on my end didn't turn up anything useful.

------
umanwizard
tl;dr for those who don't want to read the paper in detail:

* Interpreters are conceptually a big switch(bytecode) statement in an infinite loop. They use various tricks to make it easier for the processor to predict the target of the next indirect branch

* This previously had a dramatic impact on performance, because branch mispredictions are very expensive and it was difficult for the processor to predict them accurately.

* Recent Intel architectures have improved branch prediction, so the aforementioned tricks no longer have a very big impact (on Haswell x86 processors specifically)

~~~
acqq
> various tricks

They actually present (approximately) a very simple construct that is used to
speed the interpreters up:

    
    
       void labels[] = { &&ADD, &&SUB . . . };
       ...
       goto labels[ vpc++ ];
       ADD:
          ...
          goto labels[ vpc++ ];
    

instead of

    
    
        ...
        switch ( *vpc++ )
        {
        case ADD:
             ....
             break;
        case SUB:
    

It's not really hard. The only problem is that the former is not a standard C
and not present in MSVC. See

[http://stackoverflow.com/questions/6421433/address-of-
labels...](http://stackoverflow.com/questions/6421433/address-of-labels-msvc)

"Erlang does (that) for building on Windows. They use MSVC for most of the
build, and then GCC for one file to make use of the labels-as-values
extension. The resulting object code is then hacked to be made compatible with
the MSVC linker."

It obviously mattered enough for Erlang developers to use GCC even for just
that critical piece of C code. It doesn't mean that it has to be regularly
used in every code (not every switch is executed the way the main interpreter
loop is).

------
bch
Would an editor kindly put a note in the OP link that this is a pdf?

------
acqq
If my coworker would come to me and show me his results where he measured that
the "Labels as Values" implementation is not need while "on his Haswell CPU
the speedup is only 3%" I'd just ask "can it be guaranteed that the code you
propose to revert to the plain switch will run _only_ on that CPU?"

If not and the switch is in the performance sensitive place I'd keep the
"Labels as Values" implementation.

There are many more CPUs in the world than "the latest Intel x86 CPU." Kudos
for the developers of Haswell, but don't blindly trust the Inria researchers
that tell you "Don’t Trust Folklore."

I also don't like that this research was "partially supported" by the European
grant thus explained: "ERC Advanced Grants allow exceptional established
research leaders of any nationality and any age to pursue ground-breaking,
high-risk projects that open new directions in their respective research
fields or other domains." (1)

1) [http://erc.europa.eu/advanced-grants](http://erc.europa.eu/advanced-
grants)

Is ground-breaking, high-risk research discovering that the algorithm in the
latest Intel processor matches the one other researchers (which I respect much
more) discovered in 2006, and variant of which won more "branch prediction
championships"? I'd say no.

At least they honestly write about the fact that the given algorithm is
already known as highly efficient.

~~~
vardump
While I somewhat agree with you, the real answer is _assume nothing and
profile_. Always. But there's a catch. Things are getting a bit out of hand,
there are just too many architectures and configurations.

I think modern consumer-oriented code should run at least well on Intel
Nehalem - Skylake, AMD K10 - K12, ARM Cortex A7/A8/A9/A15/A53/A57/A72,
Qualcomm Krait/Kryo.

Focus on 64-bit, but some attention should still be paid to 32-bit
performance.

Number of actual targets is even worse, because cache and memory architecture
vary so much. At least all those have 64-byte cache lines.

In high performance code, there's often the choice between memory and compute
load balancing. What memory access patterns and layouts you're going to use -
where do you want your bottlenecks? You can often reduce bandwidth
requirements by computing more and vice versa.

Take Sandy Bridge for example:

While L1 and L2 cache sizes are fixed on Sandy Bridge, L3 cache varies between
1 - 20 MB.

Clock speeds vary between 1.0 to 3.6 GHz.

There are 1-4 memory channels, I believe they're usually arranged 64 byte
granularity round robin. Say your access pattern is access every second
64-byte cache line vs. access every line. On 1 and 3 channel systems, both
perform about same. When skipping 64 byte cache line on 2 channel system,
effective memory bandwidth is equal to just 1 channel system. 4 channel system
will behave like 2 channel one. The code might be bandwidth bottlenecked on
1-2 channel system and compute bound on 3-4 channel system.

Then you're going to hit DRAM page miss every... 1-8 kB? System dependant, of
course.

To make the best choice for Sandy Bridge based systems, among other things,
you need to know number of memory channels + their arrangement, amount of L3
cache and clock speed. On some Sandy Bridge CPUs you have a lot of time to do
complicated data packing and unpacking, trying to just keep memory bandwidth
usage down.

That was just one architecture. Now add all the other platforms in the mix.
There are pretty many different instruction set extensions. If you don't use
them properly, you can lose an order of magnitude of performance.

I think we'll need runtime code generation in the future. On system JIT/AOT
also for traditionally compiled languages, such as C and C++.

~~~
astrobe_
The fact that the performance of an implementation varies significantly
depending on the actual CPU could already be deduced from some of the Ertl
papers.

Moreover, given that one uses specifically a bytecode interpreter and not a
JIT is generally the sign that the interpreter is to be used on multiple
platforms. "Optimizing" in this context simply means (almost) nothing.

The obvious conclusion is to not care too much about raw performance and focus
instead on ease of interfacing with libraries, and to make it easy to add
primitives/instructions to your VM. With the idea that if your interpreter is
too slow, you just rewrite the critical parts in native code.

~~~
vardump
> you just rewrite the critical parts in native code.

Rewrite critical parts of interpreter in native code? My point was there's no
optimal native code anymore.

Pure interpreters are falling out of fashion anyways. For JITted systems,
there's a significant cost for calling native code. At least until JITs
actually inline natively called code, possibly even through dynamic library
(.so, .dylib, .dll, etc.) call.

We need a lightweight, thin profile guided JIT/AOT engine. No standard
library, memory management agnostic. Something that can target different
architectures, memory & cache configurations and instruction set extensions
that may have been unknown at design phase. A compilation target for some
C/C++/Rust/etc. and a runtime that takes care of the architecture details.

~~~
astrobe_
> Rewrite critical parts of interpreter in native code? My point was there's
> no optimal native code anymore.

No, I mean rewrite the critical parts of the "interpreted" program as
interpreter primitives/instructions: the best way to eliminate interpreter
overhead is to merge N primitives into one.

But I realise we have a different perspective. You seem to consider
interpreter VMs as off the shelf components, when I consider VMs as custom
components because I'm working mainly with embedded systems. In the embedded
world, JIT is often simply not an option either because they are not available
for your particular target, or because the resource constrains don't allow it.
But still you sometimes want to trade resources for ease of development and/or
flexibility.

~~~
vardump
Right, I also do work in embedded (without OS), so I'm familiar with the
limitations. However, there are situations where it's tempting to write a very
primitive JIT or at least glue pieces of code together in a buffer and execute
it, to get good inner loop performance by making it as few instructions as
possible.

So far C has been enough. Achieving low (=preferably none) defect rate using C
is challenging enough.

Usually FPGAs can take care of performance critical part. They do increase
total cost, though.

------
chrisdevereux
Is there a reason that machine languages don't allow the
programmer/compiler/runtime to explicitly control the CPU's cache &
instruction pipeline?

Presumably they have access to much better information about the code's intent
and future behaviour than the cpu does.

~~~
cfallin
In addition to what others have said (mainly, it would be specific to a
particular version of the CPU, which is probably the biggest thing -- imagine
if no x86 binary older than ~1 year could run on your chip), one more point:

The programmer doesn't _necessarily_ have better knowledge than the CPU,
because the programmer can't see or respond to runtime behavior. There are a
lot of cases where behavior is input-data-dependent, or simply too complicated
to reason about analytically a-priori. The big wins in computer architecture
over the past two decades have all been in mechanisms that adapt dynamically
in some way: dynamic instruction scheduling (out-of-order), branch prediction
based on history/context, all sorts of fancy cache eviction/replacement
heuristics, etc.

Itanium tried "VLIW + compiler smarts" and the takeaway, I think, was that it
was just too hard to build good enough static analysis to beat a conventional
out-of-order CPU.

~~~
sklogic
The problem with the OoO is that it's a way too expensive (in terms of power
and area) in many cases. You won't ever see OoO in GPUs and DSPs, unlikely in
the microcontrollers. So, VLIW and the other "stupid core, smart compiler"
approaches are still legitimate and will always remain valuable.

~~~
cfallin
Yup, in some domains it definitely still makes sense. GPUs work well for
highly-data-parallel applications (they're essentially vector machines, modulo
branch divergence) and VLIW-style DSPs work because the code is numerical and
easy to schedule at compile time.

I've worked mostly in the "need performance as high as possible for general-
purpose code" domain, so I may be biased!

------
caf
Is it really true that "Jump threading, though, can- not be implemented in
standard C."? Surely, each computed goto:

    
    
      goto *labels[*vpc++];
    

can be replaced with a full copy of the switch:

    
    
      switch (*vpc++) { case OP_ADD: goto ADD; case OP_SUB: goto SUB; }
    

..and hopefully the compiler will replace the entries for the cases in the
switch's jump table with the goto targets.

------
nly
I wonder if indirect calls have also improved in latest generations? Anything
vtable based would benefit.

~~~
loup-vaillant
Considering that Vtables are basically the same thing as a jump table in a
switch… They have most probably improved in the same way.

------
sklogic
The article is missing one important difference between a primitive Python use
of jump threading and the OCaml bytecode interpreter which _pre-compiles_ the
bytecode into a threaded code. A difference in performance between these two
is huge.

~~~
loup-vaillant
> _the OCaml bytecode interpreter […] pre-compiles the bytecode into a
> threaded code_

Wait a minute, how is that even possible? Or rather, what do you mean?

Here's my layman's view of things: the bytecode is basically a string of
opcodes. The interpreter has to look up the opcode before executing it
somehow. And it can be threaded, or use an ordinary switch, or perform some
other bizarre optimization I don't understand.

But what does it mean for the bytecode _itself_ to be compiled into a threaded
form?

~~~
sklogic
> Wait a minute, how is that even possible? Or rather, what do you mean?

It means that the stream of bytecodes (32-bit numbers in OCaml) is replaced
with the corresponding label addresses (for 64-bit hosts - addresses minus a
fixed offset) once. Then the execution is trivial. This is called "indirect
threaded code" [1] and used a lot in Forth implementations, for example.

See the definition of the "Next" macro in interp.c [2] for details.

[1]
[https://en.wikipedia.org/wiki/Threaded_code#Indirect_threadi...](https://en.wikipedia.org/wiki/Threaded_code#Indirect_threading)

[2]
[https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c](https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c)

EDIT: if you wonder where the actual threading is done, see
[https://github.com/ocaml/ocaml/blob/trunk/byterun/fix_code.c](https://github.com/ocaml/ocaml/blob/trunk/byterun/fix_code.c)

~~~
loup-vaillant
OK, I'm rephrasing to make sure I understand.

In the ordinary, C compliant switch code, you would do this:

    
    
      int instructions[] = { /* bytecode */ };
    
      // main loop
      int* ip = instructions;
      while(1) {
        switch (*ip) {
        case 1: /* instruction 1 */ break;
        case 2: /* instruction 2 */ break;
        case 3: /* instruction 3 */ break;
        /* etc */
        }
        ip++; /* goto next instruction. beware jumps */
      }
    

With a jump threaded implementation, you would do this instead:

    
    
      int instructions[] = { /* bytecode */ };
    
      void jump_table[] = {
        &&lbl1,
        &&lbl2,
        &&lbl3,
        /* etc */
      };
    
      // main loop
      int* ip = instructions;
      goto *jump_table[*ip];
      lbl1: /* instruction1 */ ip++; goto *jump_table[*ip];
      lbl2: /* instruction2 */ ip++; goto *jump_table[*ip];
      lbl3: /* instruction3 */ ip++; goto *jump_table[*ip];
      /* etc */
    

Which means, instead of having the compiler constructing a jump table under
the hood, I do this myself, and get the benefit of jumping from several
locations instead of just one. But I still look up that table. Now the
indirect threading you speak of:

    
    
      int instructions[] = { /* bytecode */ };
    
      void jump_table[] = {
        &&lbl1,
        &&lbl2,
        &&lbl3,
        /* etc */
      };
    
      // translating bytecode into adresses
      void* labels[] = malloc(sizeof(void*) * nb_instructions);
      for (uint i = 0; i < nb_instructions; i++)
        labels[i] = jump_table[instructions[i]];
    
      // main loop
      void** lp = labels;  
      goto **lp;
      lbl1: /* instruction1 */ lp++; goto **lp;
      lbl2: /* instruction2 */ lp++; goto **lp;
      lbl3: /* instruction3 */ lp++; goto **lp;
      /* etc */
    

If I got that correctly, instead of accessing the jump table at some random
place, I only access the label table in a much more linear fashion, saving one
indirection and some memory access in the process —this should relieve some
pressure off the L1 cache.

Did I get that right?

~~~
sklogic
Exactly, you got it right. One indirection less, and no jump table in the
cache. And CPUs are very well optimised for this kind of indirect jumps,
because of vtables.

There is also one step further - emit the actual jump instructions, making a
direct threaded code. It is a bit more complicated and not very portable, but
if you want to squeeze all the cycles you can it's still an easy and fast
option before resorting to a full-scale compiler.

~~~
ehaliewicz2
I think direct threading is usually implemented just by reducing the
indirection another level,

e.g.

    
    
      goto *lp
    

rather than

    
    
      goto **lp
    

You could implement an interpreter with generated JMP instructions, which
might be called direct-threading, but at least in the forth world, where most
jumps are to nested user-defined subroutines, subroutine-threading using JSR
instructions is typically used.

