I'm not saying you're wrong — I'm completely ignorant at the microcode level — b...

Tuna-Fish · on July 7, 2023

The two sequences look very similar, and could be implemented the same way, but the actual implementation could not be more different.

> If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

You cannot just reverse or apply one operation. The way speculation works, when the frontend encounters a conditional jump, the entire architectural state of the current thread is stored, and all future memory writes are held in the store buffer and not written out. Then a long time, potentially dozens of cycles later, after the je is executed in the backend either the old state is restored and the pending writes are discarded, or the saved state is discarded and the pending writes are released.

In contrast, in ALUs, the inputs for instructions are always available before the instructions are scheduled to execute. It would be possible to implement sete like je, but this would imply significant changes to how and where it is executed. ALU ops cannot trigger speculation because there is no machinery for storing state at that part of the pipeline.

And no-one is ever going to implement cmov or sete like a jump, because moving the op from being an ALU op to being one that is speculatively executed in the frontend like jmp would make both positive and negative changes, and that would be a significant pessimization of existing software because for decades cmovs have been used for unpredictable values, where sequencing and waiting for the real value is a better idea than speculating and failing half the time. Using a cmov serializes execution when any following operations use the value, but if you can have independent work after it, you can always successfully execute that. Speculating at an unpredictable CMOV would cause that to be thrown away uselessly half the time.

celrod · on July 7, 2023

Taking the example:

      cmpb $115, %cl
      sete %dl
      addl %edx, %eax

vs

      cmpb $115, %cl
      jne _run_switches_jmptgt1
      mov $1,   %dl
     _run_switches_jmptgt1:  
      addl %edx, %eax

The argument about why `jne` might be faster is that that in the former case, the CPU always executes a dependency chain of length 3: `cmpb` -> `sete` -> `addl`. Each of these instructions have to be computed one after the other, as `sete` depends on the result of `cmpb`, and `addl` depends on the result of `sete`.

With `jne`, the CPU might predict the branch is not taken, in which case, the dependency chain is `mov` -> `addl` (the `mov` of an immediate might be handled by register renaming?).

Or that it is taken, in which case in which case the dependency chain is just `addl`.

I guess you're arguing that the CPU should handle `sete` the same way? That is, instead of treating `addl` as dependent on the result, predict what `sete` does and start executing `addl` before `sete` finishes, rewinding if that went wrong?

torstenvl · on July 7, 2023

Yeah, or at least I don't understand why that wouldn't be possible.

Microcode can set the EIP register based on its prediction of what the result of cmpb $115, %cl will be.

Why can't it set the EDX register based on its prediction of what the result of cmpb $115, %cl will be?

gpderetta · on July 7, 2023

In principle is perfectly possible to speculatively execute cmov (and viceversa to change jump-over-one-instruction into conditional execution).

But Intel historically didn't do it as programs tend to use cmov when the condition is unpredictable , so there was little reason to optimize it.

After Spectre, I believe intel has given an architectural guarantee that cmov is never speculated so it can be used as part of speculation attack prevention.

ithkuil · on July 7, 2023

The purpose of control flow speculation is to avoid stalling the pipeline.

If each instruction was executed in one single clock cycle, the cost of executing a branch would be one cycle and that's it.

However since there is a maximum speed at which operations can happen in hardware, the period of such a clock cycle that can execute a whole instruction would be very long and so the amount of "instructions per second" the CPU could execute would be low.

Now, if you can break up each instruction in smaller steps and execute the smaller steps in an overlapping manner, such that while you're executing the second step of the first instruction you're executing the first step of the next instruction and so on (like on an assembly line in a factory) you can have a much shorter clock period for each of these steps, and at the end of each clock tick an instruction would complete execution. The CPU will be still running one instruction per clock cycle, but since each clock period is shorter the overall instruction per second rate will be higher.

But for this to work the next instruction you want to execute must be known in advance so that at each clock cycle the CPU can start step 1 of a new instruction.

That's easy when the program is executing sequentially but when there are branches involved it's more tricky.

And that's tricky also if the branch is not conditional! If the instruction execution is broken into many small steps, it may take one or more steps before figuring out that you have a branch in the first place, let alone decoding where you need to branch to. In the meantime the CPU will have happily started to execute the first "steps" of the next instruction.

This is called a "branch hazard"

Early CPU implementations handled branch hazards by just throwing away the intermediate states if the few instructions that we're half way through the pipeline and call it a day (stalling the pipeline).

Early RISC CPUs attempted to be clever and use a trick called "delay slots": the instruction(s) already in the pipeline will continue to execute as if they were logically before the branch. This puta the onus to the programmer (or the compiler) to make sure that only instructions that are safe to be executed regardless of whether the branch is taken or not, are actually put after the branch instruction (otherwise you can just write nops).

But branch delay slots are not a panacea. As pipelines got deeper it became I practical to have a large number of delay slots and even a small number of delay slots were often just filled with nops anyway.

Improving on UNconditional branches was done by "looking ahead" in the instruction stream for branch instructions. When the instructions are all of the same size it's easy to quickly look a few instructions ahead and tell when you found a branch. You also need an instruction encoding scheme that is relatively fast to decode, at the very least it should be fast to decode branches (the more complicated the logic to decode a branch is, the farther ahead you'd have to look in the instruction stream, which in turn would limit the size of the sequence of instructions you can fill your pipeline with between subsequent branches).

To further complicate the matter, even if you found the branch instruction and you decoded it, it doesn't mean you yet know where it will branch to!

Indirect jumps (where the address is in a register) are similar to conditional jumps in that you don't know the address you're jumping to by merely looking ahead in the instruction stream and noticing the branch instruction. You need to either wait until you execute the branch and stall the pipeline in the meantime, or keep them in the pipeline and flush the pipeline once you know the target of the branch.

The next trick that CPU designers came up way before speculative execution is "branch target prediction".

The CPU keeps a little associative memory that maps addresses of a branch instruction to branch targets. When the lookahead logic spots a branch instruction it looks in this map and gets a guess of the branch target and uses that immediately ad the next instruction so that the pipeline is kept fed with something.

If by the time the branch instruction is executed the guess turned out to be wrong, the pipeline is flushed in the same way it would have to be flushed anyway if we had no clever branch lookahead in the first place. But if the guess was right we paid only one cycle to execute the branch.

This works for indirect unconditional branches and also for conditional branches! The prediction logic can be more subtle and complicated, many many things gave been attempted but this the general idea.