
Are Jump Tables Always Fastest? - luu
http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html
======
ufo
> Threaded code should have better branch prediction behavior than a jump
> table with a single dispatch point

This is not the case anymore, at least for modern Intel processors. Starting
with the Haswell micro-architecture, the indirect branch predictor got much
better and a plain switch statement is just as fast as the "computed goto"
equivalent. Be wary of any references about this that are from before 2013.

For more info, I would recommend "Branch Prediction and the Performance of
Interpreters - Don’t Trust Folklore", by Rouhou, Swamy and Seznec. Pdf link:
[https://hal.inria.fr/hal-01100647/document](https://hal.inria.fr/hal-01100647/document)

> although indirect branch prediction should help even the field.

Indeed :)

\-----

Fun story: I experimented with adding indirect threading to the Lua
interpreter and was excited to find an improvement of up to 30% on some
selected microbenchmarks (running on my IvyBridge workstation). But the
improvement dropped to 0% when I tested it on a machine with a Haswell
processor. Measuring with perf indicated that the improved branch predictor
was indeed responsible for this.

~~~
haberman
That's really good to know. It's still annoying though that C semantics
apparently require there to be a bounds-check for every iteration of the
switch(): [https://eli.thegreenplace.net/2012/07/12/computed-goto-
for-e...](https://eli.thegreenplace.net/2012/07/12/computed-goto-for-
efficient-dispatch-tables) (see "Doing less per iteration")

I've tried using "default: __builtin_unreachable()" but this doesn't seem to
help.

~~~
phkahler
IIRC GCC is capable of dropping the check if it knows the possible range of
the switch variable. An example would be using a single byte and having all
256 possibilities exist in the switch. Another would be switch (x & 7) with 8
cases. I have not checked in a while, but I spent a lot of time on code with
this issue. It's at the core of my ray tracer of all things.

------
jcdavis
As yes, the classic "I can outsmart the compiler".

I've been down the exact rabbithole, before, but with java. There are 2
different bytecodes for representing a switch statement: tableswitch, which is
dense (has a case for every key from X to Y), and lookupswitch, which is
sparse. Of course the dense one must be better, I thought: O(1) vs O(log n) !
Maybe if I added a few more cases manually to my switch statement to cover
missing holes, my lookupswitch would become a tableswitch and my hot loop
would be faster.

Turns out, of course, that not only is O(1) not necessarily any faster than
O(log n) when n is small and the constant factor is large (see this article),
but in fact its irrelevant since the hotspot uses the same function to
generate the IR for both bytecodes
([http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/b756e7a2ec...](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/b756e7a2ec33/src/share/vm/opto/parse2.cpp#l512)
), and thus the decision about whether to use a jumptable or binary search is
entirely unreleated to the bytecode that represents the switch statement :)

~~~
naasking
Not only that, but the O(log n) sparse table has multiple branch points that
are more predictable, so the branch predictor works better. It can be faster
overall than a single highly unpredictable branch point.

------
notacoward
One nice thing about a classic jump table is that you can change the function
pointers dynamically. That might seem like a questionable thing to do, but
it's pretty handy to intercept or wrap functions. Calling different functions
depending on state/context can be more efficient than calling one function
that has to check that same state/context on every single call, and it's
slightly easier to set breakpoints that way too. Forget about doing any of
that with the switch-statement or threaded versions. Unless you're writing
something that has to run billions of times per second, like an interpreter's
inner loop, the extra flexibility is worth it.

~~~
terminalcommand
On my first attempt of writing an IRC parser by following a specification at
an RFC, I wrote the whole tokenizer in a single switch statement. But as I
needed further branches I resorted to using nested switch statements. It
quickly became a mess.

Then I switched to calling seperate functions within the switch statement for
tokens. If the char is a colon and state is prefix, call parsePrefix. If the
char is a whitespace, advance state etc.

I didn't know about jump tables back then, but relying on enums for states and
branching proved to be a good idea. e.g If state is < 3, parseParameters(); if
state is <2, parseMessage() etc.

Using nested conditionals and loops in the same function on the other hand
proved to be a terrible idea.

------
Lerc
Am I reading this right? The performace difference seems to be unaccounted for
in the data

    
    
        Performance counter stats for './x86_64-binary 5000000' (5 runs):
    
            6,883,819,114      cycles                    #    2.090 GHz                      ( +-  0.43% )
              232,004,486      instructions              #    0.03  insns per cycle          ( +-  0.06% )
               56,828,213      branches                  #   17.257 M/sec                    ( +-  0.04% )
                1,262,892      branch-misses             #    2.22% of all branches          ( +-  0.05% )
    
              3.299025345 seconds time elapsed                                          ( +-  0.43% )
    
        Performance counter stats for './x86_64-vtable 5000000' (5 runs):
    
            7,709,225,443      cycles                    #    2.087 GHz                      ( +-  0.95% )
              217,283,422      instructions              #    0.03  insns per cycle          ( +-  0.03% )
               51,631,368      branches                  #   13.976 M/sec                    ( +-  0.03% )
                  957,553      branch-misses             #    1.85% of all branches          ( +-  0.10% )
    
             3.706410106 seconds time elapsed                                          ( +-  1.04% )
    

One would assume with all else being equal that code which ran fewer
instructions with fewer branches and a better branch prediction rate would be
faster.

Given that is not what we get, can we assume that 'all else was not equal'
Where did the cycles get used? Average cost of branch-miss? Cache miss?
Loading the pointers from the jump table with rep movsb?

~~~
tokenrove
I wouldn't scrutinize it too much as the benchmarking approach is wildly
inaccurate, but I'm curious about that too. I might investigate it later. (I
am the author of the post.) This was also run on a pretty ancient AMD machine,
which isn't terribly representative of modern branch prediction hardware.

------
alain94040
Obviously, a faster way would be to place your functions at predictable
addresses that could be computer with simple bit arithmetic. No double lookup
or conditional branches required.

~~~
firethief
Computed goto is also not always fastest. Someone benched it (I think it was
in the context of interpreter main loops); any of the 3 approaches can win
depending on arch and usage patterns.

------
jankotek
In java large jump tables prevents JIT from compiling methods.

------
pechay
dispatch(-1)

~~~
siberianbear
Yes, I thought the exact same thing. The value being passed in is an int,
which can be negative. So, the statement "if (state > 4) abort();" isn't
enough of a guard.

I managed a team of C/C++ programmers for many years in Silicon Valley. I
always encouraged my engineers to write clean code without fancy tricks. Fancy
tricks lead to bugs that spend a lot of time to debug. If I had an engineer
write that dispatch() function with the vtable, I'd have beaten them with a
wet noodle until they promised never to do that again.

~~~
Const-me
C++ already has virtual tables built-in.

You still need to select the correct table for every incoming packet though.
But again C++ has idiomatic ways for that, e.g. std::unordered_map<uint8_t,
IPacketHandler* > for sparse values, or std::array<IPacketHandler* , n> for
dense zero-based values.

While this approach is slightly more complex (need to register handlers
somehow), debuggers are happy with that kind of dynamic dispatch because
standard OO C++. IMO over time it’s more maintainable, e.g. it’s trivial to
add another handling method, and the compiler will check that you implement
that for each protocol you support.

------
exabrial
Site is awful on mobile chrome

~~~
majewsky
Same on mobile Firefox. The text only uses 50% of the width of the narrow
screen, and quotes go down to 25%, so it's no more than two words (or a single
long one) per line.

