When Linus made that comment cmov was like a 6 cycle latency. For the last decade it has been 1 cycle, and I don't think there is any scenario where cmov is now slower than a branch.
The problem is not the 1 cycle latency, but the data dependency on both values. A correctly-predicted branch cuts the dependency on the value that is not used.
I've definitely measured scenarios where cmov/branchless is slower than a branch for a given algorithm. Especially if the branchless version is doing a bit more work to avoid the branch.
It is both though. At 6+ cycles there are only a few places where CMOV is a win. At 1 cycle you can be more liberal with its use and the lack of dependency breaking is a tradeoff.
cmov turns a control dependency into a data dependency. It is still there just in a different way. (its like the Fourier Transform of instructions - lolol).
I haven't seen cmov lose to a conditional in a few years (on very recent Intel hardware). Maybe there might be some circumstances where in creates a loop carried dependency that wrecks the optimizer, but I haven't seen one in a minute.
To be fair - cmov versions often involve some extra compution to get things in a form where cmov can be used, and I often just opt for the well predicted conditional since those couple extra instructions to set up the value to move can be avoided.
Good point. It makes me wonder if modern out-of-order processors can skip performing unused computations altogether, if their result registers are overwritten by other data later (in program order).
CPUs could in theory speculate CMOV, reintroducing prediction when predictable. But after Spectre, IIRC Intel now guarantees that CMOV is never speculated.
a cmov-based approach is necessarily slower than a branch-based approach that was correctly predicted, since cmov requires computing both branches then selecting the result at the end.
Per Fog's tables, Ice Lake shows a reg-to-reg cmov decoding to 1 uop with a latency of 1 and throughput of 0.5. *mont all have latency 2. P4 had a latency of 6. AMD's K10 latency 4.