As a novice in this area, it's not clear to me after reading this what exactly t...

sillywalk · 2024-07-27T03:55:24 1722052524

It's from around 30 years ago, my guess is it's referring to this[0] paper from 1996. It's above my head, but it seems to help with branch prediction issues arising with both many instruction units and high-clock speed, which were sort of either or in the '90s, but I think most modern processors are both.

Multiple-block ahead branch predictors

Abstract:

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors."

[0] https://dl.acm.org/doi/10.1145/237090.237169

EDIT: oops. Looks like eyegor posted the link earlier. Oh well, enjoy the abstract.

cpldcpu · 2024-07-26T21:34:06 1722029646

My understanding is that they do not predict the target of the next branch but of the one after the next (2-ahead). This is probably much harder than next-branch prediction but does allows to initiate code fetch much earlier to feed even deeper pipelines.

layer8 · 2024-07-26T22:42:26 1722033746

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

ithkuil · 2024-07-26T23:17:06 1722035826

Interestingly you can build a branch predictor that predicts the second branch without predicting the first.

A branch predictor result is just a tuple of ("branch instruction address", "branch target address") that hints the processor that when the CPU will encounter a given branch instructions in the future (at "branch instruction address") it will likely branch to the branch target and so it would make sense to start fetching that address and filling the instruction pipeline with whatever steps are safe to perform before the jump will be actually performed.

Now, commonly this branch happens to be at the end of the current basic block and I assume some branch predictors may also leverage this fact in order to encode only offsets from the current instruction pointer.

But there is no reason why the branch location might be after some other branches may be taken. As long as the cpu eventually gets to that branch location the prediction will be useful. If the IP never reaches that location it's like the branch was never actually taken.

toast0 · 2024-07-27T06:12:06 1722060726

> Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

I'm not involved in CPU design, I just read a lot, but...

I think you need to do something special to have a second prediction, because you have to track three windows of out of order execution:

Window 0: code you're definitely running but is still being completed.

Window 1: code from the branch you think will be taken

Window 2: code from the 2-ahead branch you think will be taken.

If you figure out that the window 1 branch isn't taken, you have to drop the whole pipeline (pipeline bubble). But if you figure out that window 1 is taken, then window 1 becomes window 0 and window 2 bcomes window 1.

With a 1 ahead predictor, the pipeline stalls if you get to a conditional branch while speculating in window 1, because the processor can't manage three instruction windows.

IMHO, it sounds like if the core is doing SMT and both threads are active, each thread only gets 1-ahead prediction because the two windows are statically divided between the cpu threads. This may mean a) a significant boost for some loads when SMT is not in use and b) SMT can branch predict in both threads in the same cycle, I don't think that was possible on AMD before (no idea for other vendors)

gpderetta · 2024-07-27T10:47:54 1722077274

With a typical OoO CPU you have reorder buffers in the hundreds of instructions, so you are not speculating two or three branches but potentially dozens ahead of non-speculative execution (and that's why prediction accuracy is so important).

So the question remain, what's the innovation here? I'm sure there is something, but it is not simply speculating two ahead.

I need tor was further, bit this might be an optimization in the fetched tage to avoid a bubble when fetching consecutive taken branches.

Filligree · 2024-07-27T00:06:48 1722038808

Building on the sibling comment:

  if (a) { ... }

  if (b) { return x; } else { return y; }

The two branches can be wholly independent, but predicting the second is still a two-ahed prediction.

fulafel · 2024-07-27T07:07:16 1722064036

> Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

I'd think if you are at PC N and there are branches at N+1 and N+2, predicting just branch N+2 is fine because you predicted the N+1 branch previously, at PC N-1.

flamedoge · 2024-07-27T05:53:03 1722059583

I wonder what they need before this change. Branch predictor hardware may not have accounted for depth beyond single conditional branch? but pipeline was probably always filled, unpredicted.

emn13 · 2024-07-26T22:23:02 1722032582

Ah, that makes sense in the context of the article - thanks!

hmry · 2024-07-26T23:45:05 1722037505

You're not alone. Non-novice, same here. Article spends a lot of time explaining the absolute basics of branch prediction but then when it gets to 2-ahead it just skips over explaining it...

eyegor · 2024-07-26T23:58:29 1722038309

I think it just predicts 2 branches per cycle instead of 1. So it can evaluate the result of n+2 ahead of time instead of only n+1 (typical branch prediction). How this works without wrecking the L1 cache, I'm not sure. It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

> Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding.

The original paper is open access but I haven't read far into it: https://dl.acm.org/doi/10.1145/237090.237169

phire · 2024-07-27T07:01:53 1722063713

> It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

The frontend is already predicting dozens of branches ahead of what the backend can actually confirm. Looking ahead by one extra branch ahead doesn't really hurt.

Also, modern TAGE branch predictors are scary accurate, well above 99% on most code (including unpredictable indirect jumps). Besides, the majority of branch prediction targets are already in the L1 cache, it only predicts them because it saw them recently.

The branch predictor in Apple's M1 actually takes advantage of the latter fact. It doesn't predict what address to fetch next, it predicts which cacheline in L1 holds the target. So you only actually get branch predictions for targets in L1.

mjevans · 2024-07-27T19:14:50 1722107690

That seems like a good idea against speculation based attacks too; predict within what is at hand, do not cause side effects.

phire · 2024-07-28T00:18:14 1722125894

I didn't even notice that advantage. I was just thinking about how it minimised each branch target entry to about 16 bits.

I suspect Apple are also using it as a way predictor. If the BTB points directly to the correct cacheline and each cacheline points to the next way then the only time you need to do a search is on a branch misspredit.

tigrou · 2024-07-31T14:42:07 1722436927

A regular branch predictor try to guess which way a branch (eg: if-else) will go before it's executed. That way, CPU can fetch and decode instructions in advance.

Each side of the branch leads to the start of a new block of instructions. The last instruction of such a block is usually another branch.

In other words, a branch predictor guess the address of the next block. A 2-ahead branch predictor does the same thing, but for the two subsequent blocks.

As stated in the paper: "information from the current instruction block is used for predicting the address of the block following the next instruction block".

Unlike a regular branch predictor, it can do that without having to wait for instructions in the next block to be decoded. That way, you can feed multiple instruction decoders at once.

This is particularly useful in modern CPUs where the instruction decoder has become a bottleneck. With only 1 decoder and 1 instruction decoded per cycle (at best) it cannot cope with a wide frontend that can execute many instructions (eg: 4-6) per cycle.

deadmutex · 2024-07-26T21:16:40 1722028600

You can check out the seminal paper linked in the article. Or start by summarizing the paper with Gemini, Claude, ChatGPT, etc. to get a high level overview (and then confirm the answer by reading the paper).

coherentpony · 2024-07-26T22:01:54 1722031314

https://link.springer.com/article/10.1007/s10676-024-09775-5