Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).
Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.
Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.
Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.
Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.
I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.
Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.