
How and why CPUs do “branch prediction” (2017) - majke
https://danluu.com/branch-prediction/
======
saagarjha
One fun thing to note is that the author doesn’t really talk a lot about
anything from this decade, which makes sense considering how much complexity
was being added for the tiny improvements towards the end. Modern branch
predictors are quite complicated so they can eke out improvements in accuracy
by fractions of a percent.

~~~
Jach
Improving CPUs since ~2003 has just been incredibly hard. My favorite satire-
history account remains The Slow Winter:
[https://scholar.harvard.edu/files/mickens/files/theslowwinte...](https://scholar.harvard.edu/files/mickens/files/theslowwinter.pdf)

> "I wonder if we can make branch predictors even more accurate,” and the next
> day you’d start XOR’ing the branch’s PC address with a shift register
> containing the branch’s recent branching history, because in those days, you
> could XOR anything with anything and get something useful, and you test the
> new branch predictor, and now you’re up to 96% accuracy, ...

> When you retire in 2003, your face is wrinkled from all of the smiles, and
> even though you’ve been sued by several pedestrians who suddenly acquired
> rare paintings as hats, you go out on top, the master of your domain. You
> look at your son John, who just joined Intel, and you rest well at night,
> knowing that he can look forward to a pliant universe and an easy life.
> Unfortunately for John, the branches made a pact with Satan and quantum
> mechanics during a midnight screening of “Weekend at Bernie’s II.”

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=15078574](https://news.ycombinator.com/item?id=15078574)

------
alphaBetaGamma
In some/many cases there is code before the branch that does not influence the
branch condition. In that case, would it be possible to architect the cpu such
that it could execute the branch 'early'? I'm thinking of a special asm
instruction like 'execute the next 4 instructions, and only then do the
conditional jump'.

E.g. instead of:

    
    
      i=0
      StartLoop:
      i+=1
      do_stuff_0
      do_stuff_1
      do_stuff_2
      do_stuff_3
      if i<N goto StartLoop
    

we would have:

    
    
      i=0
      StartLoop:
      i+=1
      do_stuff_0
      do_stuff_1
      if i<N goto StartLoop in 2 instructions
      do_stuff_2
      do_stuff_3

~~~
projektfu
Yes, that was made explicit in the MIPS instruction set with branch (and load)
delay slots, and it's implicit in out-of-order processors. As I understand it
the branch delay slot paradigm did not pan out as well as was hoped and it has
not found its way into many other architectures.

~~~
gpderetta
The issue with the branch delay slot is that it assumes there is exactly one
clock cycle (I.e. one stage) between fetch and execute. This was true on some
of the early 5 stage in order riscs, but hasn't been true for a while. In an
high frequency in order design there are maybe a dozen stages/clock cycles
which would be hard to fill with delay slots. OoO is even more complicated.

------
vectorEQ
really great article. love it, super clear and consise. This kind of
mechanisms are super interesting to learn more about. I'm curious to the new
AMD cpus which use neural network to predict branches, how that fits into the
performance and what kind of accuracy it peaks at.

~~~
ovi256
A brief introduction can be found in this post here:

[https://chasethedevil.github.io/post/the_neural_network_in_y...](https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/)

It talks about “Dynamic Branch Prediction with Perceptrons” (Jimenez 2001)
which sparked the whole thing.

The AMD engineers read this paper (and probably many more!), put together a
development program, and shipped it.

------
tuananh
reading these make me feel like i don’t know much about programming :D

~~~
rramadass
Ha, ha.:-) A lot of us are in the same boat except that few are willing to
acknowledge it openly. That said; i can suggest the following for further
reading (from simple to advanced);

a) Eben Upton's write-up on "Raspberry Pi and Spectre/Meltdown" gives a very
nice overview of the main features of modern processors -
[https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-
vulne...](https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulnerable-
to-spectre-or-meltdown/)

b) Inside the Machine: An Illustrated Introduction to Microprocessors and
Computer Architecture by Jon Stokes.

c) Computer Systems: A Programmer's Perspective by Bryant and O'Hallaron.

d) Modern Processor Design: Fundamentals of Superscalar Processors by Shen and
Lipasti

------
myNewName1001
Regarding fixed precision neural networks, it looks like Intel had some fairly
mature work on this at ISCA -- fast inference, but have to train elsewhere in
the machine:

[https://arxiv.org/abs/1906.09889](https://arxiv.org/abs/1906.09889)

------
jedberg
This is a good primer for understanding the Spectre[0] issue. In a nutshell,
Spectre is caused by the CPU keeping the branches in an unencrypted cache that
is accessible from programs other than the one that generated the branch.

Their fix was to add an extra kernel check for unauthorized access, which is
why it causes a huge performance hit. The real fix will be in the next
generation of chips, which will either encrypt or at the least protect the
branch cache from unauthorized access from other programs.

[0]
[https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...](https://en.wikipedia.org/wiki/Spectre_\(security_vulnerability\))

~~~
saagarjha
> In a nutshell, Spectre is caused by the CPU keeping the branches in an
> unencrypted cache that is accessible from programs other than the one that
> generated the branch.

I would instead say that the branch prediction has observable side effects
even when it’s thrown away; say, for example, the branch performs a memory
access: it may leave things in the cache even if the processor doesn’t take
the branch and hence it may be possible to recover information by running a
timing attack on how long it takes to access certain things.

------
londons_explore
I feel like fixed precision neural nets are probably the future of branch
prediction. Both 'hardcoding' a neural net on a large amount of software and
then designing the weights into the CPU in a ROM, but also dynamic approaches
to deciding which prediction strategy to use for a specific branch based on
state recorded by another neural net.

Approaches like this have been used as a fun exercise, but now I think we're
reaching the point where single thread performance is so important, and
mispredicts are getting more expensive, so it's worth taking quite a big power
and area hit to get a few more percentage points of performance...

~~~
jcranmer
> but now I think we're reaching the point where single thread performance is
> so important

Ever since Denard scaling stopped, the biggest boosts to performance have been
in increasing the parallelization opportunity, both at the SIMD level and at
the multi-core level. Admittedly, working in HPC gives me a biased view, but
everyone I see has resigned themselves to processors ceasing to give
meaningful boosts to single-threaded performance.

Moreover, the ceiling you could get in boosting single-threaded performance
with a perfect branch predictor (for conditional branches) over current state-
of-the-art is around 4%. There's just not a lot of upside to be had, and even
just using the extra space for another core looks preferable at that low
ceilings. Extra area for the branch predictor is likely to go to better
indirect branch prediction (including virtual function calls), which are
increasingly important in modern workloads, and where there is a much larger
headroom for performance improvement.

I'll also add that the effectiveness of neural nets in modern machine learning
contexts has come from making them big and deep, and deep means adding latency
to the branch predictor, which is not what you want (especially because you
want to know ASAP if you need to prefetch a different I-cache line).

~~~
gpderetta
Thing is not everything is amenable to simd or multithreaded optimization. As
you said yourself, a lot of HPC code tend to parallelize more easily than
average. For a lot of code out of order execution is the only way to extract
some amount of parallelism, and one of the biggest limiting factors to
increasing the reorder window is branch prediction rate. Even an improvement
of 0.1% in the prediction rate can lead to non trivial improvements in
performance.

~~~
dragontamer
> For a lot of code out of order execution is the only way to extract some
> amount of parallelism

But that's the thing. A CPU is simply a parallel machine forced to accelerate
"sequential" code.

A "truly sequential" code sequence, like linkedList->head->next->next->next
cannot be parallelized on a modern CPU. The only stuff that can be
parallelized are if-statements / loops (aka: branch prediction: try to do the
future speculatively), and anything Tomasulo's algorithm happens to pick up.

Even then: the modern CPU will attempt to parallelize that linked-list access
because of Cache prefetching. That's actually why Arrays work so well in
today's architectures: because the L1, L2, L3, and DRAM components are working
in parallel through Cache Prefetchers (and Arrays are so simple that
speculation will certainly be correct).

In effect: people are writing parallel programs. They are just leaving it to
the CPU to figure out the parallelism details.

X = Y + Z; A = B + C can be made parallel (independent variables).

X = Y + Z; X = D + E. Also parallel: Write-after-write hazard, so X=D+E can
execute first, and just throw away X=Y+Z entirely. Etc. etc.

CPUs simply find these patterns in your code and execute them in parallel.

AMD Zen has 4-integer pipelines + 4-vector pipelines + 2 load/store units per
core.

Intel has 8-pipelines of varying capabilities per core. 0, 1, 5 are vector
units, but 0 is also the division pipeline. Its a bit more complicated, but
its still 8x parallelism that is being fed by the front-end (and reordered
into the correct order by the backend).

\--------------

Basically: the parallelism does exist in the code that was written. The
programmer just hasn't explicitly acknowledged it yet.

But with every "speculative" branch taken, the CPU does work, and then (maybe)
throws it away. The GPU-basis of coding is to not do any speculative work at
all. Instead, GPUs throw 10-threads per shader (kinda like a core) and SMT the
heck out of them.

If thread#0 accesses memory, it will be stuck doing that for 300+ cycles. So
thread#1 will take over the core. When thread#1 waits on memory, thread#2
takes over. Etc. etc. By the time thread#9 and #10 roll around, thread#1
probably has its memory ready.

Instead of speculatively executing thread#0, the GPU is provided with
alternative work it can do. In any HPC context, there's probably "more work to
do" somewhere, so the GPU can remain saturated with work to do. Its naturally
a more power-efficient methodology: there's less wasted work.

It just requires the programmer to figure out a lot of tasks that the GPU can
do to remain saturated. While a CPU will dedicate more and more resources, all
the way towards speculative execution, to accelerate the same task. Even if
its highly wasteful.

------
danharaj
What's the state of the art of indirect jump prediction?

~~~
pjmgut
[https://hal.inria.fr/hal-00639041/document](https://hal.inria.fr/hal-00639041/document)

Might be of interest

~~~
danharaj
Thank you!

~~~
pjmgut
If you want to see other fun uses of (IT)TAGE I'd recommend reading Arthur
Persis & Seznec's work on value prediction
[https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...](https://www.google.com/url?sa=t&source=web&rct=j&url=https://hal.inria.fr/hal-01193175&ved=2ahUKEwjR2sqp1JbjAhW0VBUIHeviAtIQFjAKegQIBRAC&usg=AOvVaw2TpKm23-KjxyAj6jxnlCIR)

