DJB wrote a paper about how "Speculative execution is much less important for pe...

formerly_proven · on Aug 8, 2020

The premise of the paper is not that branch prediction isn't important for performance, but that it is possible to design an entirely different ISA where it matters much less.

The idea here is to essentially create a branch delay slot instruction, which then can be used to set the latency of a branch such that it doesn't require prediction to not stall the pipeline. Like so:

    dec r1
    basicblock 5  # next five instructions WILL execute, branches after that, if any
    jz exit_loop
    this
    will
    always
    run
    -- branch takes effect here

Then the compiler may commonly reorganize the code such that the variable branch delay instructions perform useful work.

gruez · on Aug 8, 2020

The problem with techniques like this is that they're microarchitecture dependent. The optimal amount of delay slots to have will depend how how long the pipeline of the CPU is, for instance. This means that you'll always be leaving performance on the table, because you have to cater to the lowest common denominator, or change the isa and recompile everything

rainbow-road · on Aug 8, 2020

Yes; the Mill architecture's (I'm beginning to think it'll never be released) answer to that is to make compiling the code from an intermediate representation either just before runtime or at install time a normal part of the process.

https://en.wikipedia.org/wiki/Mill_architecture

remexre · on Aug 8, 2020

Maybe desktop applications will see something like what Android does, wrt shipping applications as bytecode and doing the last stages of compilation on the device, to get uarch-specific optimizations. CPU upgrades potentially being traumatic would be a downside, of course...

gruez · on Aug 8, 2020

We already have that, at least on windows with .net. Only problem is that there's still native applications around for whatever reason. Even android apks can contain native libraries.

remexre · on Aug 9, 2020

Does that do AoT-compiling-on-install, or is it still just JIT? I thought the JVM ecosystem was ahead on purely static binaries with Graal, but I might be mistaken.

gruez · on Aug 9, 2020

https://docs.microsoft.com/en-us/dotnet/framework/tools/ngen...

formerly_proven · on Aug 8, 2020

I think the idea here is that the INT/FP pipeline length of CPU designs has stayed relatively fixed over ~15 years, so if you compile something today for "BBRISC-V", the branch delay lengths chosen by the compiler will likely be adequate for a pretty long time. I suspect the variable-length delay makes it much easier for a compiler to actually perform useful work here, so if a uarch would need a bubble, the performance loss is fractional (e.g. delay length in code is 3 insns, but the core would rather have 4, then only one bubble has to be inserted).

labawi · on Aug 8, 2020

Note the important cleverness - the delay is set by compiler, so the delay is only as big as it's useful and no need to change ISA.

Architectures with ISA-specified delay has significant issues as you mention.

sroussey · on Aug 8, 2020

Wasn’t sparc like this? Lots of noop instructions had to be inserted by the compiler.

tlb · on Aug 8, 2020

MIPS was. It had a single branch delay slot, which was often unused and also not enough for long pipelines. DJB’s variable-length system could work better.

avianes · on Aug 13, 2020

I'm very skeptical about the paper's conclusions.

the measurements show that they can generate bb of about 10~20 instructions (optimistic numbers as they measure an average of 5) which allows them to move up the branches of about 10~20 instructions. As with this ISA the bb determines the bound of the instruction window, then the instruction window of this ISA is limited around 20~40 instructions (current bb plus next bb). But modern superscalar processors have a instruction window > 100 instructions to provide high performance.

The low performance loss of their model compared to the branch prediction model may perhaps be explained by the fact that they use an in-order CPU that makes very little use of the ILP (Instruction Level Parallelism).

Moreover, it only addresses the problem of speculative execution, but there are other types of transient execution.

geertj · on Aug 8, 2020

Thanks for sharing, very interesting paper!