Using up scarce branch-prediction slots is a good way to make your program unopt...

astrange · on Feb 22, 2022

Using up BTB slots is an interesting problem but in practice doesn't seem to be a big issue. If it was, ISAs would use things like hinted branches but instead they've been taking them away. Code size is more important but hot/cold splitting can help there.

A problem with using exceptions instead is they defeat the return address prediction by unwinding the stack.

ncmncm · on Feb 23, 2022

That hinted branches are not useful tells us nothing about the importance of branch predictor footprint. When hinting branches gives you a bigger L1 cache footprint, it has a high cost. Compilers nowadays use code motion to implement branch hinting, which does not burn L1 cache. (Maybe code motion is what you mean by "hot/cold splitting"?)

Anyway the hint we really need, no ISA has: "do not predict this branch". (We approximate that with constructs that generate a "cmov" instruction, which anyway is not going away.)

How does using exceptions defeat return address prediction? You are explicitly not returning, so any prediction would be wrong anyway. In the common case, you do return, and the predictor works fine.

astrange · on Feb 23, 2022

> When hinting branches gives you a bigger L1 cache footprint, it has a high cost.

It was the same size on PPC, and on x86 using recommended branch directions (but not prefixes).

> Compilers nowadays use code motion to implement branch hinting, which does not burn L1 cache. (Maybe code motion is what you mean by "hot/cold splitting"?)

Hot/cold splitting is not just sinking unlikely basic blocks, it's when you move them to the end of the program entirely.

That doesn't hint branches anymore, though; Intel hasn't recommended any particular branch layout since 2006.

> How does using exceptions defeat return address prediction? You are explicitly not returning, so any prediction would be wrong anyway.

Anything that never returns is a mispredict there; most things return. What it does instead (read the DWARF tables, find the catch block, indirect jump) is harder to predict too since it has a lot of dependent memory reads.

ncmncm · on Feb 23, 2022

It suffices, for cache footprint, for cold code to be on a different cache line, maybe 64 bytes away. For virtual memory footprint, being on another page suffices, ~4k away. Nothing benefits from being at the "end of the program".

Machines do still charge an extra cycle for branches taken vs. not, so it matters whether you expect to take it.

Negligibly few things never return; most of those abort. Performance of those absolutely does not matter.

Why should anyone care about predicting the catch block a throw will land in after all the right destructor calls have finished? We have already established that throwing costs multiple L3 cache misses, if not actual page faults.

astrange · on Feb 23, 2022

> Nothing benefits from being at the "end of the program".

It's not about the benefit, that's just the easiest way to implement it - put it in a different TEXT section and let the linker move it.

Although, there is a popular desktop ARM CPU with 16KB pages.

> Machines do still charge an extra cycle for branches taken vs. not, so it matters whether you expect to take it.

Current generation CPUs can issue one taken branch or 2 not-taken branches in ~1 cycle (although strangely Zen2 couldn't), but yes it is better to be not taken iff not mispredicted. (https://www.agner.org/optimize/instruction_tables.pdf)

> Negligibly few things never return; most of those abort. Performance of those absolutely does not matter.

Throwing an exception isn't a return, nor longjmp/green threads/whatever. Sometimes they're called abnormal or non-local returns, but according to your C++ compiler your throwing function can be `noreturn`.

Error path performance is important since there are situations like network I/O where errors aren't at all unexpected. If you're writing a program you can just special case your hotter error paths, but if you're designing the language/OS/CPU under it then you have to make harder decisions.

> Why should anyone care about predicting the catch block a throw will land in after all the right destructor calls have finished? We have already established that throwing costs multiple L3 cache misses, if not actual page faults.

More prediction is always better. The earlier you can issue a cache miss the earlier you get it back.

For instance, that popular desktop ARM CPU can issue 600+ instructions at once (according to Anandtech). That's a lot of unnecessary stalls if you mispredict.

And so its vendor has their own language, presumably compatible with it, which doesn't support exceptions.

ncmncm · on Feb 23, 2022

More prediction is not, in fact, always better.

Specifically, a prediction is not better when it makes no difference. Then, it is worse, because it consumes a resource that would better be applied in a different place where it could make a difference.

Exceptions are the perfect example of a case where any prediction expenditure would be wasted; except predicting that no exception will be thrown. It is always better to predict no exception is thrown, because only the non-throwing case can benefit.

Running ahead and pre-computing results that would be thrown away if an exception is thrown is a pure win: With no exception, you are far ahead; with an exception, there is no useful work to be sped up, it is all just overhead anyway.

This is similar to a busy-wait: you are better off to have leaving the wait predicted, because that reduces your response latency, even though history predicts that you will continue waiting. This is why there is now a special busy-wait instruction that does not consume a branch prediction slot. (It also consumes no power, because it just sleeps until the cache line being watched shows an update.)