The premise of the paper is not that branch prediction isn't important for performance, but that it is possible to design an entirely different ISA where it matters much less.
The idea here is to essentially create a branch delay slot instruction, which then can be used to set the latency of a branch such that it doesn't require prediction to not stall the pipeline. Like so:
dec r1
basicblock 5 # next five instructions WILL execute, branches after that, if any
jz exit_loop
this
will
always
run
-- branch takes effect here
Then the compiler may commonly reorganize the code such that the variable branch delay instructions perform useful work.
The problem with techniques like this is that they're microarchitecture dependent. The optimal amount of delay slots to have will depend how how long the pipeline of the CPU is, for instance. This means that you'll always be leaving performance on the table, because you have to cater to the lowest common denominator, or change the isa and recompile everything
Yes; the Mill architecture's (I'm beginning to think it'll never be released) answer to that is to make compiling the code from an intermediate representation either just before runtime or at install time a normal part of the process.
Maybe desktop applications will see something like what Android does, wrt shipping applications as bytecode and doing the last stages of compilation on the device, to get uarch-specific optimizations. CPU upgrades potentially being traumatic would be a downside, of course...
We already have that, at least on windows with .net. Only problem is that there's still native applications around for whatever reason. Even android apks can contain native libraries.
Does that do AoT-compiling-on-install, or is it still just JIT? I thought the JVM ecosystem was ahead on purely static binaries with Graal, but I might be mistaken.
I think the idea here is that the INT/FP pipeline length of CPU designs has stayed relatively fixed over ~15 years, so if you compile something today for "BBRISC-V", the branch delay lengths chosen by the compiler will likely be adequate for a pretty long time. I suspect the variable-length delay makes it much easier for a compiler to actually perform useful work here, so if a uarch would need a bubble, the performance loss is fractional (e.g. delay length in code is 3 insns, but the core would rather have 4, then only one bubble has to be inserted).
MIPS was. It had a single branch delay slot, which was often unused and also not enough for long pipelines. DJB’s variable-length system could work better.
the measurements show that they can generate bb of about 10~20 instructions (optimistic numbers as they measure an average of 5) which allows them to move up the branches of about 10~20 instructions. As with this ISA the bb determines the bound of the instruction window, then the instruction window of this ISA is limited around 20~40 instructions (current bb plus next bb).
But modern superscalar processors have a instruction window > 100 instructions to provide high performance.
The low performance loss of their model compared to the branch prediction model may perhaps be explained by the fact that they use an in-order CPU that makes very little use of the ILP (Instruction Level Parallelism).
Moreover, it only addresses the problem of speculative execution, but there are other types of transient execution.