Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depen...

IshKebab · 2025-02-09T16:35:10 1739118910

It's more about removing branches than instruction counts or latency.

dzaima · 2025-02-09T16:43:55 1739119435

The "y^((y^z)&x)" method is already branchless, and close in performance to the Zicond variant, is my point; i.e. Zicond doesn't actually add much.

IshKebab · 2025-02-09T18:04:40 1739124280

Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.

Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.

dzaima · 2025-02-09T18:40:04 1739126404

Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.

Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.

I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.

Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.