
A history of branch prediction - darwhy
https://danluu.com/branch-prediction/
======
userbinator
The use of previous branch history and branch address as a "context" for
prediction reminds me of the very similar technique used for prediction in
arithmetic compression as used in e.g. JBIG2, JPEG2000, etc. --- the goal
being that, if an event X happens several times in context C, then whenever
context C occurs, the probability of X is more likely.

Also, since modern CPUs internally have many functional units to which
operations can be dispatched, I wonder if, in the case that the "confidence"
of a branch prediction is not high, "splitting" the execution stream and
executing both branches in parallel until the result is known (or one of the
branches encounters another branch...), would yield much benefit over
predicting and then having to re-execute the other path if the prediction is
wrong. I.e. does it take longer to flush the pipeline and restart on the other
path at full rate, or to run both paths in parallel at effectively 1/2 the
rate until the prediction is known?

~~~
mjevans
You could do it, but 'work' produces heat.

From that point of view a branch predictor /saves/ you from spending the heat
of the cases you /don't/ need to have processed.

The performance per watt of such a design would probably leave it on the back
of a napkin as an educated guess of how costly that would be.

~~~
Simon_says
Pie-in-the-sky idea here, but only irreversible computations produce heat.
Maybe in the distant future we can make chips that do many parallel
computations of all branches reversibly, and only make the results
irreversible once the correct branch is known?

~~~
dfox
In theory, irreversible computations have to produce heat, while reversible do
not. In practice this is mostly irrelevant, because the heat involved is so
minuscule that there will always be other significantly larger sources of
inefficiency involved. Also it is somewhat questionable whether one could
actually construct physical realization of useful logic primitive that is
truly reversible.

------
ramshorns
Very informative. I missed the part about 1500000 BC though – a time when our
ancestors lived in the branches of trees?

Another beginner-friendly explanation of the effects of branch prediction is
this Stack Overflow post which compares a processor to a train:
[https://stackoverflow.com/questions/11227809/why-is-it-
faste...](https://stackoverflow.com/questions/11227809/why-is-it-faster-to-
process-a-sorted-array-than-an-unsorted-array)

~~~
eberkund
Maybe it is just an arbitrarily long time in the past, since there were no
computers then there is no history related to branch prediction so it's a joke
that makes it sound like a longer spanning history than it actually is.

------
ufo
One surprising thing that I discovered recently is that after Haswell, Intel
processors got much much better at predicting "interpreter loops", which are
basically a while true loop with a very large seemingly unpredictable switch
statement. It lead to a dramatic improvement in micro benchmarks and made some
traditional optimizations involving computed goto and " indirect threading"
obsolete.

Does anyone know how it achieved this?

~~~
fulafel
It doesn't look like a series of comparisons from the CPU point of view.
Normally switch statements are compiled like a series of "if" statements, but
the interpreter loop style switch gets compiled into a table of jump targets
that is indexed by bytecode. Same kind of indirect branch prediction features
that were previously designed to help C++ "virtual" functions help here - a
branch target buffer, etc.

The VM interpreter loop is mostly a main bottleneck in languages that have
rather low-level VM instructions and data types. In high level VMs the
dispatch on operand type is the main bottleneck. This too benefits from
indirect branch prediction.

~~~
deepnotderp
Longer switches are usually compiled into binary search trees.

~~~
fulafel
It depends on whether the switched-on values are sparse. Contiguous ranges of
bytecodes are more efficiently compiled to straight jump tables (and enable
indirect branch prediction mechanisms in CPUs to work).

For another boost to leveraging indirect branch prediction, threaded code is
still a little better since each VM instruction has a unique jump call site:
[http://eli.thegreenplace.net/2012/07/12/computed-goto-for-
ef...](http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-
dispatch-tables)

~~~
ufo
That blog post is from 2012. What I observed is that with the more recent
processors threaded code doesn't make much of a difference.

------
Sniffnoy
> PA 8000 (1996): actually implemented as a 3-bit shift register with majority
> vote

This actually seems interestingly different from the two-bit saturating
counter. Like, it's not just a different way of implementing it; you can't
realize the saturating counter as a "quotient" of the shift/vote scheme.

~~~
oshepherd
It's the same just with redundant representations (the 2-bit repr is the count
of ones in the 3-bit repr)

    
    
      '00' -> '000'
      '01' -> '001', '010', '100'
      '10' -> '011', '101', '110'
      '11' -> '111'
    

this kind of transformation is truly bread & butter in hardware; we regularly
numbers between binary counts, mask/number-of-set-bits and one-hot
representations for optimisation purposes.

~~~
Sniffnoy
That's the obvious attempt at an equivalence one would come up with on being
told that they're the same, but, as I stated, it doesn't work. As an example:
In the 2-bit saturating counter, if you start at 00, if you see a 1 and then a
0 you're back to where you started. Whereas in the bit-shift register, if you
start at 000 and see a 1 and then a 0, you're now at 010, which would
correspond to 01 rather than 00.

------
irishsultan
I seem to be missing something when the two bit scheme is introduced it's said
that it's the same as the one bit scheme except for storing two bits (seems
logical), but then the index in the lookup table seems to involve both the
branch index (already the case in the one bit scheme) and the branch history
(as far as I can see never introduced).

~~~
Sniffnoy
Looks like he just used the wrong picture -- used the picture from "two-level
adaptive, global" instead.

------
ajkjk
Is there any system out there that supports branch 'annotations', of a sort,
so that the programmer or the compiler can just _tell_ the CPU what the branch
behavior is going to be?

Like -- it seems kinda silly for the CPU to do so much work to figure out if a
loop is going to be repeated frequently, when the code could just explicitly
say "fyi, this branch is going to be taken 99 times out of 100".

Or, if there's a loop that is always taken 3 times and then passed once, that
could be expressed explicitly, with a "predict this branch if i%4 != 0"
annotation.

~~~
ryangittins
Yes and no. I say _yes_ because C and C++ both have likely() and unlikely()
functions (well, technically It's part of a compiler extension rather than the
language), which you wrap around the condition inside your if() statement like
so:

    
    
      for(i=10000; i<100000; i++) {
        if(unlikely(isPrime(i))) {
          // do something special
        }
        else {
          // do something boring
        }
      }
    

I say _no_ because most modern compilers simply ignore the functions.
Compilers have become so sophisticated over the years that developers trying
to help them along or optimize often make things worse, whether that's by
getting in the compiler's way or just writing code that's harder to read and
debug.

I say _no_ (or at least _probably not_ ) to your second questions as well. No
language I know if implements anything as sophisticated as a specification for
regular intervals of branch switching. Many modern compilers have
sophisticated branch prediction routines which can detect simple regular
intervals like you describe.

To develop such a specification would optimize the branch prediction by a tiny
margin which would be absolutely _dwarfed_ by the overhead of learning the
syntax for the specification, not messing it up, debugging it if you do mess
it up, communicating the decision to other team members, and all of the other
real-world stuff that gets in the way.

Computers are faster than ever and branch prediction algorithms are smarter
than ever. Yes, you could help it along in theory but the portion of
applications which _really_ require you to do so is dwindling all the time.

------
legulere
I really have problems reading this website. You don't have to make a website
bloated to make it readable:
[http://bettermotherfuckingwebsite.com](http://bettermotherfuckingwebsite.com)

~~~
PyComfy
Have a look at the Chromium/Chrome's extension Just Read

before / after
[https://i.imgur.com/Ihy5wQh.png](https://i.imgur.com/Ihy5wQh.png)

~~~
jmkni
I've tried a couple of these types of extensions, Just Read is the best one
I've found so far.

------
filereaper
Ryzen has rolled out a Neural-Net based branch predictor, would be curious to
see its accuracy compared to the listed approaches.

~~~
glenneroo
It's even mentioned in the article:

> Some modern CPUs have completely different branch predictors; AMD Zen (2017)
> and AMD Bulldozer (2011) chips appear to use perceptron based branch
> predictors. Perceptrons are single-layer neural nets[0].

[0]:
[https://www.cs.utexas.edu/~lin/papers/hpca01.pdf](https://www.cs.utexas.edu/~lin/papers/hpca01.pdf)

------
lordnacho
Top quality article. Now we need one with specifics of how to write code
that's aware of this. For instance when do use what compiler hints. Anyone
have links or books?

~~~
SolarNet
The compiler is (probably) smarter than you. Generally speaking, it will
automatically decide which branches are most likely and arrange them
accordingly (e.g. for things like for loops and while loops especially, where
the biggest gains are). You'd likely gain more performance out of algorithmic
changes, and then a number of other processor optimizations (like
vectorization and pre-fetch hints) first. Also, CPU manufactures ignore the
classic hints, and don't really have information on how to tune branch
predictors. So while GCC/Clang does have a special intrinsic
(`__builtin_expect`) for it in c/c++ - most other languages are too high level
for it to matter - it probably won't do much and is an insanely early
optimization to consider making.

~~~
adrianN
Whether or not it's an early optimization depends on when you add the
__builtin_expect, no? Perhaps you already have identified a very hot branch in
your code that the compiler fails to treat correctly even though you provided
it with profile data.

~~~
SolarNet
Except that anything newer than 1995 probably ignores the hint provided by
__builtin_expect, and you'd probably have more luck changing the algorithm or
vectorizing the code.

------
seedragons
Is this correct? "Without branch prediction, we then expect the “average”
instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.8 * 1 + 0.2 * 20
= 0.8 + 4 = 4.8 cycles"

other than branch_pct and non_branch_pct being reversed, this seems to be
assuming that 100% of branches are guessed incorrectly. Shouldn't something
like 50% be used, to assume a random guess? ie 0.8 * 1 + 0.2 * (0.5 * 20 + 0.8
* 1)=2.96

~~~
0xffff2
It's correct if you take "without branch prediction" to include any pipelining
of instructions after a branch.

The very first branch prediction algorithm ("predict taken") is to simply
enable pipelining by assuming the generally more likely branch.

>...this seems to be assuming that 100% of branches are guessed incorrectly...

Rather, it's assuming that 100% of branches are not guessed at all.

~~~
seedragons
Ah I see, I assumed it was a penalty for only mis-guessed, not the number of
cycles required to evaluate the statement. Thanks!

------
zaptheimpaler
I love your posts dan. High quality writing, no fluff and bullshit every time
:)

------
unkown-unknowns
Figures 12 and 14 are the same but I think the figure used is only supposed to
be like that for figure 14, not for figure 12.

The "two-bit" scheme that fig 12 is for does not have branch history, whereas
"two-level adaptive, global" which has fig 14 fits the bill.

------
agumonkey
Beautiful article. The kind that makes you want to dig deeper in the whole
field.

------
deepnotderp
TAGE and perceptron combined are the SOTA right now, right?

~~~
redraga
Yes. TAGE still does better than perceptron, but combining the two is likely
to give you the best performance currently. Of course, all this is dependent
on area allocated to the predictor and the workloads being run.

