When Linus made that comment cmov was like a 6 cycle latency. For the last decad...

haberman · 2024-08-22T05:53:39 1724306019

The problem is not the 1 cycle latency, but the data dependency on both values. A correctly-predicted branch cuts the dependency on the value that is not used.

I've definitely measured scenarios where cmov/branchless is slower than a branch for a given algorithm. Especially if the branchless version is doing a bit more work to avoid the branch.

gpderetta · 2024-08-22T11:45:59 1724327159

It is both though. At 6+ cycles there are only a few places where CMOV is a win. At 1 cycle you can be more liberal with its use and the lack of dependency breaking is a tradeoff.

jnordwick · 2024-08-23T18:06:48 1724436408

cmov turns a control dependency into a data dependency. It is still there just in a different way. (its like the Fourier Transform of instructions - lolol).

I haven't seen cmov lose to a conditional in a few years (on very recent Intel hardware). Maybe there might be some circumstances where in creates a loop carried dependency that wrecks the optimizer, but I haven't seen one in a minute.

To be fair - cmov versions often involve some extra compution to get things in a form where cmov can be used, and I often just opt for the well predicted conditional since those couple extra instructions to set up the value to move can be avoided.

unnah · 2024-08-22T07:35:44 1724312144

Good point. It makes me wonder if modern out-of-order processors can skip performing unused computations altogether, if their result registers are overwritten by other data later (in program order).

gpderetta · 2024-08-22T08:59:57 1724317197

CPUs could in theory speculate CMOV, reintroducing prediction when predictable. But after Spectre, IIRC Intel now guarantees that CMOV is never speculated.

ants_a · 2024-08-22T08:50:06 1724316606

No, and it feels unlikely that they will either.

mgaunard · 2024-08-22T08:06:34 1724313994

a cmov-based approach is necessarily slower than a branch-based approach that was correctly predicted, since cmov requires computing both branches then selecting the result at the end.

clausecker · 2024-08-22T07:55:49 1724313349

Are you sure? I recall cmov always having single cycle latency.

BoardsOfCanada · 2024-08-22T09:10:34 1724317834

I know that it was decoded into 2 micro-ops and thus had to be decoded by the wide decoder, so perhaps 2 cycles?

clausecker · 2024-08-22T14:14:55 1724336095

That could be. Agner's tables seem to confirm that.

jnordwick · 2024-08-25T13:58:12 1724594292

Per Fog's tables, Ice Lake shows a reg-to-reg cmov decoding to 1 uop with a latency of 1 and throughput of 0.5. *mont all have latency 2. P4 had a latency of 6. AMD's K10 latency 4.