Yeah, the need of the "-mllvm -x86-cmov-converter=false" hack is stupid; forgot ...

ryao · 2025-02-09T21:04:20 1739135060

It just occurred to me that Clang and GCC did not necessarily fail to use conditional moves in your examples. While they failed to use explicit cmov instructions, cmp/jmp 1 instruction/mov is actually an idiom for an implicit cmov. Some CPU instruction decoders are able to turn it into a cmov without an explicit cmov instruction. In the case of RISC-V, the designers are philosophically opposed to explicit cmov instructions and expect compilers to generate this idiom for CPUs that support cmov. I asked them to implement cmov virtual instructions to be nice to people reading RISC-V assembly, but I am not sure if anything will come of it:

https://github.com/riscv/riscv-bitmanip/issues/185

I do not know if any x86 CPUs recognize the implicit cmov idiom offhand, but if any do, then while an extra instruction was used, the conditional move would still be done on those that recognize the idiom.

By the way, I just noticed a case where you really don’t want the compiler to generate a cmov, explicit or otherwise since it would risk division by zero:

https://github.com/openzfs/zfs/commit/f47f6a055d0c282593fe70...

Here is a godbolt link showing some output:

https://www.godbolt.org/z/4daKTKqfr

Interestingly, Clang correctly does not generate a cmov (implicit or explicit) for the outer ternary operation, while it does generate an explicit cmov for the inner ternary operator in MIN() without -mllvm -x86-cmov-converter=false. Passing -mllvm -x86-cmov-converter=false to Clang does not change the output, which makes Clang’s behavior correct.

GCC will not generate cmov for either ternary operator, which while also technically correct, is slow. This could still have been an implicit conditional move had GCC not avoided the implicit cmov idiom.

Using GCC’s __builtin_expect_with_probability() in MIN() does not cause GCC to change its output. If we remove the outer ternary, GCC will happily generate a cmov instruction. Given that GCC generally assumes that undefined behavior is not invoked to make code faster and will happily generate the cmov when there is a division by 0 bug, it is odd that upon seeing a check that verifies the assumption GCC made is true, GCC decides to stop generating a cmov. I am sure the way GCC does things is much more complicated than my interpretation of the output, but the behavior is odd enough to merit a comment.

dzaima · 2025-02-09T21:28:32 1739136512

I haven't heard of anything outside RISC-V having jump-over-mov as an idiom (though I've heard of potentially some CPUs having the ability to unwind only necessary parts on mispredictions over small bits of code or something; still some misprediction penalty though I believe; and even with -march=haswell behavior doesn't change).

I find the RISC-V solution (which fwiw I mentioned in a sibling thread[0]) rather sad; there's no way to check whether it's implemented, and even where it is I could imagine it being problematic (i.e. if the instructions cross a fetch block or cacheline or something and it gets ran as a branch, or some instrs around it break the fusion pattern checking), and where it's unsupported or otherwise doesn't work properly it'll "work" but be horrifically slow.

fwiw I haven't ever seen __builtin_expect_with_probability actually do anything for unpredictable branches; I just included it in my compiler explorer link for completeness.

Using a version of MIN that caches the X/Y computations gets gcc to produce a cmov, but makes clang's output longer: https://www.godbolt.org/z/6h8obxKG8

[0]: https://news.ycombinator.com/item?id=42992533

ryao · 2025-02-09T21:36:58 1739137018

You might want to give feedback to the risc-v developers (although it might be too late at this point). What is the way to check if implicit cmov instructions are implemented in the CPU instruction decoder?

If AMD did not implement this in Zen 5, maybe we could ask them to add it in Zen 7 or 8. I assume it would be too late to ask them to add this in Zen 6.

Thanks for the caching tip.

dzaima · 2025-02-09T21:46:54 1739137614

There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.

I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).

Oh, another thing I don't like about fusing condjump+mv - it makes it stupidly more non-trivial to intentionally use branches on known-predictable conditions for avoiding the dependency on both branches.

ryao · 2025-02-09T22:01:16 1739138476

> There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.

I was afraid the answer to my question would be that, but since my read of your previous comment “there's way to check whether it's implemented” seemed to suggest you knew a way I did not, I had my fingers crossed. At least, it had been either that you knew a trick I did not, or that a typo had deleted the word “no”.

> I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).

Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding logic to detect the idiom to the instruction decoder is the way to go. x86 cores spend enormous amounts of silicon on situational tricks to make code run faster. I doubt spending silicon on one more trick would be terrible, especially since the a number of other tricks to extract more performance from things likely apply to even more obscure situations. As for what happens in the x86 core, the instruction decoder would presumably emit what it emits for the explicit version when it sees the implicit version. I have no idea what that is inside a x86 core. I suspect that there are some corner cases involving the mov instruction causing a fault to handle (as you would want the cpu to report that the mov triggered the fault, not the jmp), but it seems doable given that they already had to handle instruction faults in other cases of fusion.

Also, if either of us were sufficiently motivated, we might be able to get GCC to generate better code through a plugin that will detect the implicit cmov idiom and replace it with an explicit cmov:

https://gcc.gnu.org/onlinedocs/gccint/Plugins.html

A similar plugin likely could be written for LLVM:

https://llvm.org/docs/WritingAnLLVMNewPMPass.html#registerin...

Note that I have not confirmed whether their plugins are able to hook the compiler backend where they would need to hook to do this.

Of course, such plugins won’t do anything for all of the existing binaries that have the implicit idiom or any new binaries built without the plugins, but they could at least raise awareness of the issue. It is not a full solution since compilers don’t emit the implicit cmov idiom in all cases where a cmov would be beneficial, but it would at least address the cases where they do.

dzaima · 2025-02-09T22:22:00 1739139720

> since my read of your previous comment seemed to suggest you knew a way I did not, I had my fingers crossed.

Whoops, typo! edited.

> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go

From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).

Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).

No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)

I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.

ryao · 2025-02-09T22:54:44 1739141684

I had suggested the virtual instructions to the RISC-V developers to eliminate branchy cmov assembly, as I am not happy with it either. It is surprising to realize that x86 cores are not making more use of macro-ops fusion, contrary to my expectation, but I guess it makes sense now that I think about it. Their designers have plenty of other knobs for tuning performance and the better their branch predictor becomes, the less this actually matters outside of the cases where developers go out of their way to use cmov.

A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.

dzaima · 2025-02-10T01:56:25 1739152585

Ah, your gcc example does have the branchful branch that could be done branchlessly; I was thinking about my original example with a load, which can't be transformed back.

On fusion, https://dougallj.github.io/applecpu/firestorm.html mentions ones that Apple's M1 does - arith+branch, and very specialized stuff.