Hacker News new | past | comments | ask | show | jobs | submit login

Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).





It's more about removing branches than instruction counts or latency.

The "y^((y^z)&x)" method is already branchless, and close in performance to the Zicond variant, is my point; i.e. Zicond doesn't actually add much.

Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.

Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.


Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.

Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.

I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.

Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: