
Fast and slow if-statements: branch prediction in modern processors - adg001
http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/
======
InclinedPlane
This particular bit of information is worse than useless for the vast majority
of developers. For the parts-per-million development efforts where this sort
of optimization is necessary (a tiny subset even of kernel developers) there
will be many orders of magnitude more developers for whom this information is
actively harmful. Devs will spend their time trying to second guess the
compiler's optimizations of their if statements in order to eek out micro-
seconds of performance improvement. Meanwhile, by focusing their efforts on
the wrong thing they will take attention away from quality of design and
execution as well as macro-optimizations. Their code will be lower quality
and, ironically, slower.

Micro-optimization is a silly diversion the vast majority of the time. Wait to
optimize at that level until you have the tooling and the measurements that
indicate you truly need it (more often than not you won't).

~~~
scott_s
Developers should know how the layers beneath them work. Sometimes that means
they have to be exposed to information that could be "dangerous" if they use
it in the wrong context. Such is the risk of education.

~~~
InclinedPlane
Sure. I'm not objecting to the information itself, I'm objecting to paying
more attention to it than is warranted (the amount that is warranted being so
close to 0.00 for the average developer that it's hardly worth mentioning).

The best-case scenario for this information is for the vast majority of
developers to ignore it, or to read about it and then do absolutely nothing
about "the problem".

Even in the rare case where you happen to be that one guy who's writing
kernel, library, or game code which needs to be heavily optimized at every
level this particular issue is extremely unlikely to be relevant. And even in
the super rare cases where it is relevant the given article doesn't provide
very good advice on improving performance (hint: for a lot of cases it'd be
better to rewrite your loop so you don't even have a conditional inside it).

In short, this article is silly trivia, nothing more, and roughly equally
useful to the practice of programming as looking at pictures of cats with
funny captions, maybe less so.

P.S. The real risk here is that devs become overly focused on a "problem",
lose their situational awareness and ignore more serious problems they should
be concerned with. It's the sort of thing that causes pilots to crash planes
while they are distracted by a warning light they can't figure out:
<http://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401>

~~~
nkurz
Maybe you'd like the message better if it was inverted: "Contrary to popular
wisdom, with modern processors there's very little penalty for conditionals
where the same branch is taken each time. It's only when the branches are
random that slowdowns occur, and even so it's usually not a big deal."

I disagree about the "one guy" theory. While it may be true that very few Ruby
programmers would benefit from this knowledge, if you are bothering to use C
or C++ you might as well take full advantage of the speed it has to offer. If
not, why not use a higher level language? I'd certainly want the author of my
interpreter to be fully aware of these issues:
<http://bugs.python.org/issue4753>

------
codesink
The Linux kernel exploits that by using two macros (likely() and unlikely())
that give hints to the compiler using GCC's __builtin_expect:

if(likely(condition)) dosomething();

~~~
matthavener
even without likely/unlikely, branch prediction still occurs in branch-
prediction supporting processors. the kernel optimization just optimizes for
the first case. once the branch prediction state machine as aligned itself
towards the more likely branch, accurate branch prediction will occur for the
resulting calls (typically iterations or something like that)

------
jwegan
What the article fails to mention is branch predictors in modern CPUs are
generally correct >95% of the time for real world benchmarks (usually SPEC).

Furthermore each CPU uses different approaches to branch prediction such that
a bad pattern for one will not be in another. So spending time trying to
optimize your code in this way will only optimize it for a particular
processor which wont work if you're trying to make a distributable binary.
(i.e. different x86 processors use different branch prediction algorithms. You
will basically be optimizing specifically for the Intel i7 or specifically for
the AMD Athlon 3)

------
Zot95
Interesting article. It has been a long time since I have worked this "close
to the metal." Back when I coded at the machine level, there were no branch
predictions performed by the processor to help maintain the pipeline. What you
had were different opcodes for your conditional jumps - 1 for when the jump
was likely, 1 for when it was not. It had amazed me (a bit) that never got
reflected in higher level languages. Now that I see that the processor makes
(educated?) guesses, I suppose I am no longer surprised.

------
warfangle
Would systems that use trace trees (e.g., TraceMonkey) have the same sort of
issues?

I have a feeling that they would notice repeated calls to the same if-
statement and optimize accordingly. They may still have the same sorts of
issues, but would the used benchmark (looping and querying the if-statement
_n_ times) have completely different results in a JIT system?

------
TrevorBurnham
I'm astonished at the magnitude of the difference correct prediction can make
in a scenario where the effect of a misprediction is to either increment a sum
or not. How is that possible? How many clock cycles does an accurate
prediction save in this context, and how many are used in moving through the
loop and evaluating the condition?

------
malkia
Results from Luajit - <http://gist.github.com/413482>

D:\p\badif>..\luajit\src\luajit badif.lua

(i & 0x80000000) == 0 time: 1.057

(i & 0xFFFFFFFF) == 0 time: 0.696

(i & 0x00000001) == 0 time: 3.682

(i & 0x00000003) == 0 time: 5.339

(i & 0x00000002) == 0 time: 3.686

(i & 0x00000004) == 0 time: 3.683

(i & 0x00000008) == 0 time: 3.686

(i & 0x00000010) == 0 time: 3.856

------
j_baker
Just out of curiosity, how applicable is this to optimizing a higher-level
language like python or ruby?

~~~
jwegan
It is not applicable unless you plan on running your code on only a specific
CPU (and by specific CPU I don't mean that you plan on running only on x86, I
mean you only plan on running on the Intel i7).

GCC has some macros where you can specify if a branch is likely or unlikely so
the compiler will optimize the code for you based on which cpu you are
compiling on, but at higher level languages you would have to do this by hand.
The performance gain from this optimization will be negligible compared to if
you had spent that time trying to optimize something else.

~~~
nkurz
I think you are misinterpreting this. All branch predicting CPU's will deal
better with basic and regular patterns: 100 trues in a row followed by 100
falses. It's only with regard to how well they handle different ABAB-type
patterns that they differ. Modern and more expensive processors are better at
detecting more complex patterns.

The macros you refer to a are an even more minor effect --- they only affect
what happens the first time through the loop. It's a tiny optimization of an
optimization. Branch prediction is a hardware feature, and is going to happen
whether you use these macros or not. It's the logic of your code that's going
to make the big difference, not the macros.

~~~
prodigal_erik
It's the first time through the loop after each time the loop is evicted from
the instruction cache. If the loop is short or has been unrolled heavily
(causing each branch to be predicted separately) it could matter.

~~~
nkurz
Is the Branch Target Buffer synonymous with the instruction cache[1]? But yes,
it certainly can matter. I thought the links that 'wendroid' offered above
were interesting: a Scheme interpreter, presumably already fairly well
optimized, saw about a 10% improvement through the hinting macros.

But I'd argue that if this level of optimization is required, the bigger gains
are to be see through changing your algorithms to become more regular: for
example, using fixed length blocks in a compression scheme rather than
checking for null termination [2]. But if this is for some reason impossible,
then use hinting!

[1] <http://www.agner.org/optimize/microarchitecture.pdf>

[2] <http://www2008.org/papers/pdf/p387-zhangA.pdf>

~~~
prodigal_erik
Thanks, I didn't know that branch predictions are now cached separately from
instruction decodings. Hopefully evictions are a rarer problem.

