I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:
"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"
—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.
The main point is that the conditional didn't actually introduce a branch.
Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value
But it's possible that the compiler is smart enough to optimize the step() version down to the same code as the conditional version. If true, that still wouldn't justify using step(), but it would mean that the step() version isn't "wasting two multiplications and one or two additions" as the post says.
(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)
Okay but how does this help the reader? If the worse code happens to optimize to the same thing it's still awful and you get no benefits. It's likely not to optimize down unless you have fast-math enabled because the extra float ops have to be preserved to be IEEE754 compliant
Fragment and vertex shaders generally don't target strict IEEE754 compliance by default. Transforming a * (b ? 1.0 : 0.0) into b ? a : 0.0 is absolutely something you can expect a shader compiler to do - that only requires assuming a is not NaN.
> Unless you’re writing an essay on why you’re right…
He's writing an essay on why they are wrong.
"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."
Hence his branchless codegen samples are sufficient.
Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.
You missed the second part where article says that "it actually runs much slower than the original version", "wasting two multiplications and one or two additions", based on idea that compiler is unable to do a very basic optimization, implying that compiler compiler will actually multiply by one. No benchmarks, no checking assembly, just straightforward misinformation.
I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.
You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.
I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.
What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.
You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).
Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.
The GPU will almost always execute f and g due to GPU differences vs CPU.
You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.
It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.
f or g may have side effects too. Like writing to memory.
Now a conditional has a different meaning.
You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.
> it's also concerning that we have syntax where an if is some times a branch and some times not.
That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.
For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.
cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.
Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.
But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.
A non-branching version of the same algorithm will also run code equivalent to both branches. The branching version may sometimes skip one of the branches, the non-branching version can't. So if the functionality you want is best described by a branch, then use a branch.
godbolt has rga compiler now, you can always paste in hlsl and look at the actual rdna instructions that are generated (what GPU actually runs, not spirv)
I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.
A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me.
That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch.
I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.
If someone says branching without qualification, I have to assume it’s incoherent. The branching mechanics might have lower overhead today, but the basic physics of the situation is that throughput on each side of the branch is reduced to the percentage of active threads. If both sides of a branch are taken, and both sides are the same instruction length, the average perf over both sides is at least cut in half. This is why the belief that branches are slow on GPUs is both persistent and true. And this is why it’s worth trying harder to reformulate the problem without branching, if possible.
coherent branches are "free" but the extra instructions increase register pressure. that's the main reason why dynamic branches are avoided, not that they are inherently "slow".
These sort of avoid-branches optimizations were effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.
Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.
Is that true and to what extent? Looking at the ISA for RDNA2[0] for instance - which is the architecture of both PS5 and Xbox Series S|X - all I can find is 32-bit scalar instructions for integers.
You're likely going to have a rough time with 64-bit arithmetic in any GPU. (At least on Nvidia GPUs, the instruction set doesn't give you anything but a 32-bit add-with-carry to help.) But my understanding is that a lot of the arithmetic hardware used for 53-bit double-precision ops can also be used for 32-bit integer ops, which hasn't always been the case.
It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
Hmm, it looks like the underlying SASS does have 64-bit integer instructions now, but only with the 12.0 capability level in the recent Blackwell processors. Older versions emulate it via chained 32-bit instructions. Take this example kernel:
So if you want real 64-bit support, have fun getting your hands on a 5070! But even on sm_120, things like 64-bit × immediate 32-bit take a UIMAD.WIDE.U32 + UIMAD + UIADD3 sequence, so the support isn't all that complete.
(I've been looking into the specifics of CUDA integer arithmetic for some time now, since I've had the mad idea of doing 'horizontal' 448-bit integer arithmetic by storing one word in each thread and using the warp-shuffle instructions to send carries up and down. Given that the underlying arithmetic is all 32-bit, it doesn't make any sense to store more than 31 bits per thread. Then again, I don't know whether this mad idea makes any sense in the first place, until I implement and profile it.)
I'm less concerned about it being 32-bit and more about them being exclusively scalar instructions, no vector instructions. Meaning only useful for uniforms, not thread-specific data.
[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]
The programming model is that all threads in the warp / thread block run the same instruction (barring masking for branch divergence). Having SIMD instructions at the thread level is a rarity given that the way SIMD is implemented is across warps / thread blocks (groups of warps). It does exist, but only within 32-bit words and really only for limited use cases, since the proper way to do SIMD on the GPU is by having all of the threads execute the same instruction:
Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.
That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.
It's always been less of a big deal than it used to be -- at least on "big" GPUs -- but the article isn't really about avoiding branches.
The code presented is already branchless.
The people giving out the advice seem to think they are avoiding branches as optimisation but their understanding of what branching code is is apparently based on if they can see some sort of conditional construct.
“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.
Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.
As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...
I think figuring out the fastest version of a shader at runtime is very non-trivial, I'm not aware of any game or engine that can do this.
I think it'd be possible in principle, because most APIs (D3D, GL, Vulkan etc) expose performance counters (which may or may not be reliable depending on the vendor), and you could in principle construct a representative test scene that you replay a couple times to measure different optimizations. But a lot of games are quite dynamic, having dynamically generated scenes and also dynamically generated shaders, so the number of combinations you might have to test seems like an obstacle. Also you might have to ask the user to spend time waiting on the benchmark to finish.
You could probably just do this ahead of time with a bunch of different GPU generations from each vendor if you have the hardware, and then hard-code the most important decision. So not saying it'd be impossible, but yeah I'm not aware of any existing infrastructure for this.
The last time I did anything like this (it was for CPU linear algebra code designed to run in very heterogeneous clusters), I first came up with a parameterization that approximated how I'd expect an algorithm to perform. Then, once for each hardware combination, you sweep through the possible parameterization space. I used log-scaled quantization to make it cheap to index into an array of function pointers based on input specifics.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.
This would be acceptable if it meant adding one more shader, but with "modern" graphics APIs forcing us to sometimes have thousands of permutations for the same shader, every variant you add multiplies that count by 2x.
We also don't have an infinite amount of time to work on each shader. You profile on the hardware you care about, and if the choice you've made is slower on some imaginary future processor, so be it - hopefully that processor is faster enough that this doesn't matter.
Funnily enough, this is sort of what the NVIDIA drivers do: they intercept game shaders and replace them by custom ones optimized by NVIDIA. Which is why you see stuff like this in NVIDIA drivers changelog: "optimized game X, runs 40% faster"
I don't work on the nvidia side of things but it's likely to be the same. Shader replacement is only one of a whole host of things we can do to make games run faster. It's actually kind of rare for use to do them since it boats the size of the driver so much. A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
> > A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
That will break code sufficienly reliant on the behaviour of sungle precision, though.
In the case that does happen, then we don't apply that setting. Most of the changes applied are extensively tested and toggles like that are more often used for already broken shaders.
I will note, half of the customer facing bugs I get are "works on nvidia." Only to find out that it is a problem with the game and not the driver. Nvidia allows you to ignore a lot of the spec and it causes game devs to miss a lot of obvious bugs. A few examples:
1) Nvidia allows you to write to read only textures, game devs will forget to transition them to writable and will appear as corruption on other cards.
2) Nvidia automatically work with diverging texture reads, so devs will forget to mark them as a nonuniform resource index, which shows up as corruption on other cards.
3) Floating point calculations aren't IEEE compliant, one bug I fixed was x/width*width != x, On Nvidia this ends up a little higher and on our cards a little lower. The game this happened on ended up flooring that value and doing a texture read, which as you can guess, showed up as corruption on our cards.
1 and 2 are specifically required by the microsoft directx 12 spec, but most game devs aren't reading that and bugs creep in. 3 is a difference in how the ALU is designed, our cards being a little closer to IEEE compliant. A lot of these issue are related to how the hardware works, so stays pretty consistent between the different gpus of a manufacturer.
Side note: I don't blame the devs for #3, the corruption was super minor and the full calculation was spread across multiple functions (assumed by reading the dxil). The only reason it sticks out in my brain though is because the game devs were legally unable to ever update the game again, so I had to fix it driver side. That game was also Nvidia sponsored, so it's likely our cards weren't tested till very late into the development. (I got the ticket a week before the game was to release.) That is all I'm willing to say on that, I don't want to get myself in trouble.
> A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
What benefit would that give? Is double precision faster than single on modern hardware?
That's specifically because gpus aren't IEEE compliant, and calculations will drift differently on different gpus. Double precision can help avoid divide by zero errors in some shaders because most don't guard against that and NANs propagate easily and show up as visual corruption.
Only more precision. But no, doubles are not faster. At best they’re the same instruction latency & throughput as singles, and that’s only on a few expensive pro/datacenter GPUs. Even if they are technically the same instruction speed, they’re still 2x the memory & register usage, which can compromise perf in other ways. Doubles on consumer GPUs are typically anywhere from 16 to 64 times slower than singles.
FWIW, I’ve never heard of shader replacement to force doubles. It’d be interesting to hear when that’s been used and why, and surprising to me if it was ever done for a popular game.
Does the studio pay them to do it? Because Nvidia wouldn't care otherwise?
Does Nvidia do it unasked, for competitive reasons? To maximize how much faster their GPU's perform than competitors' on the same games? And therefore decide purely by game popularity?
Or is it some kinda of alliance thing between Nvidia and studios, in exchange for something like the studios optimizing for Nvidia in the first place, to further benefit Nvidia's competitive lead?
I can't give details on how we do our selections (not nvidia but another gpu manufacturer). But we do have direct contacts into a lot of studios and we do try and help them fix their game if possible before ever putting something in the driver to fix it. Studios don't pay us, it's mutually benefital for us to improve the performance of the games. It also help the game run better on our cards by avoiding some of the really slow stuff.
In general if our logo is in the game, we helped them by actually writing code for them, if it's not then we might have only given them directions on how to fix issues in their game or put something in the driver to tweak how things execute. From an outside perspective (but still inside on the gpu space) nvidia does give advice to keep their competitive advantage. In my experience so far ignoring barriers that are needed as per the spec, defaulting to massive numbers when the gpu isn't known ("batman and tessellation" should be enough to find that), and then doing out right weird stuff that doesn't look like something any sane person would do in writing shaders (I have a thought in my head for that one, but it's not considered public knowledge. )
AFAIK NVIDIA and AMD do this unasked for popular game releases because it gives them a competitive advantage if 'popular game X' runs better on NVIDIA than AMD (and vice versa). If you're an AAA studio you typically also have a 'technical liason' at the GPU vendors though.
It's basically an arms race. This is also the reason why graphics drivers for Windows are so frigging big (also AFAIK).
I think this is very accurate. The exception is probably those block buster games. Those probably get direct consultancy from NVIDIA during the development to make them NVIDIA-ready from day 1.
The other commenter makes it sound a bit more crazy than it is. "Intercept shaders" sounds super hacky, but in reality, games simply don't ship with compiled shaders. Instead they are compiled by your driver for your exact hardware. Naturally that allows the compiler to perform more or less aggressive optimisations, similar to how you might be able to optimise CPU programs by shipping C code and only compiling everything on the target machine once you know the exact feature sets.
Graphics drivers on Windows definitely do plenty of 'optimizations' for specific game executables, from replacing entire shaders to 'massaging/fixing' 3D-API calls.
And AFAIK Proton does things like this too, but for different reasons (fixing games that don't adhere to the D3D API documentation and/or obviously ignored D3D validation layer messages).
I don't know -- if that other commenter is correct, it does sound pretty crazy.
Improving your compiler for everybody's code is one thing.
But saying, if the shader that comes in is exactly this code from this specific game, then use this specific precompiled binary, or even just apply these specific hand-tuned optimizations that aren't normally applied, that does seem pretty crazy to me.
Finger printing based on shaders is quite rare, really most of the time we detect things like the exe name calling us or sometime, very rarely they will give us a better name through an extension. (unreal engine does this automatically). From there all the options are simple, but full shader replacements are one. In the api I work on the shaders have a built in hash value, so that along with the game identified means we know exsactly what shader it is. Most of the replacements aren't complicated though, it's just replacing slow things with faster things for our specific hardware. In the end we are the final compiler so us tweaking things to work better should be expected to a degree.
In the case of the Minecraft mod Sodium, which replaces much of Minecraft's rendering internals, Nvidia optimisations caused the game to crash.
So the mod devs had to implement workarounds to stop the driver from detecting that Minecraft is running... (changing the window title among other things)
And plain Minecraft did the opposite by adding -XX:JavaHeapDumpPath=MojangTricksIntelDriversForPerformance_javaw.exe_minecraft.exe.heapdump to the command line, when they changed the way the game started so that it wasn't detected as minecraft.exe any more. (Side effect: if you trigger a heap dump, it gets that name)
Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.
The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.
The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code.
And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].
Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.
[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.
[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)
So why isn't the compiler smart enough to see that the 'optimised' version is the same?
Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?
This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)
The other part of the optimization issue is that you can't take to long to try anything and everything. Most of the optimizations happen on the driver side, and anything that takes to long will show up as shader compilation stutter. I can't say currently if this is or isn't done, it's just always something you have to think about.
It may very well be, it is the type of optimisation where it is quite possible that some compilers may do it some of the time, but it is definitely also possible to write a version that the compiler can't grok.
Vulcan opcode shader lang is not executed. It’s a platform neutral intermediate language, so won’t have the special purpose optional instructions most GPUs do since GPUs aren’t required to.
It likely compiles down on the relevant platforms as the original article did.
The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:
float step( float x, float y )
{
return x < y ? 1.0 : 0.0;
}
How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?
The only way is do what OP did – compile your shader, disassemble, and read the assembly.
I do that quite often with my HLSL shaders, learned a lot about that virtual instruction set. For example, it’s interesting GPUs have instruction sincos, but inverse trigonometry is emulated while compiling.
Because you care about performance?
step being implemented as a libray function on top of a conditional doesn't really say anything about its performance vs being a dedicated instruction. Don't worry about the implementation.
Because you are curious about GPU architectures?
Look at disassembly, (open source) driver code (including LLVM) and/or ISA documentation.
I’ve never seen a GPU with special primitives for any functions than you’d see in pc style assembly. Every time I’ve looked at a decompiled shader, it’s always been pretty much what you think of in C.
Aldo specs like OpenGL specify many intrinsic behavior, which is then implemented as the spec, using standard assembly instructions.
Find an online site that decompiles to various architectures.
EDIT: I see my source of confusion must be that "branch" must have a well-understood hardware-specific meaning that goes beyond the meaning I grew up with, which is that a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives?
To me the problem was obvious, but then again I'm having trouble with both your and author's statements about it.
The problem I saw was, obviously by going for a step() function, people aren't turning logic into arithmetic, they're just hiding logic in a library function call. Just because step() is a built-in or something you'd find used in mathematical paper doesn't mean anything; the definition of step() in mathematics is literally a conditional too.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous function that resembles your desired outcome (which in the problem in question isn't step() but the thing it was used for!), and tune its parameters to get as close as it can to your target. I.e. typically you'd pick some polynomial and run the standard iterative approximation on it. Then you'd just have an f(x) that has no branching, just a bunch of extra additions and multiplications and some "weirdly specific" constants.
Where I don't get the author is in insisting that conditional move isn't "branching". I don't see how that would be except in some special cases, where lack of branching is well-known but very special implementation detail - like where the author says:
> also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.
That's because we standardized on two's complement representation for ints, which has the convenient quality of isolating sign as the most significant bit, and for floats the representation (IEEE-754) was just straight up designed to achieve the same. So in both cases, abs() boils down to unconditionally setting the most significant bit to 0 - or, equivalently, masking it off for the instruction that's reading it.
step() isn't like that, nor any other arbitrary ternary operation construct, and nor is - as far as I know - a conditional move instruction.
As for where I don't get 'ttoinou:
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives
The basics like abs() and sqrt() and basic trigonometry are standard knowledge, the rest... does it even matter? step() obviously has to branch somewhere; whether you do it yourself, let a library do it, or let the hardware do it, shouldn't change the fundamental nature.
> the meaning I grew up with, which is that a conditional is a branch
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.
Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.
But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?
On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.
The primary concern is usually over the masking you’re talking about, the issue being simply that you’re proportionally cutting down the number of threads doing useful work. Using Nvidia terminology, if only one thread in a warp is active during a branch, the GPU throughput is 32x slower than it could be with a full warp.
Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.
Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.
A "branch" here is a conditional jump. This has the issues the article mentions, which branchless programming avoids:
Again, there is no branching - the instruction pointer isn't manipulated, there's no branch prediction involved, no instruction cache to invalidate, no nothing.
This has nothing to do with whether the behaviour of some instruction depends on its arguments. Looking at the Microsoft compiler output from the article, the iadd (signed add) instruction will get different results depending on its arguments, and the movc (conditional move) will store different values depending on its arguments, but after each the instruction pointer will just move onto the next instruction, so there are no branches.
> a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
s/conditional is a/conditional jump is a/
Problem solved.
Non-jump conditionals have been a thing for decades.
Branching is different instruction paths, so it requires reading the instructions from different memory that causes a delay jumping to those new instructions rather than plowing ahead on the current stream of instructions. So a conditional jump is a branch but a conditional move is just an instruction that moves one of two values into a register but doesn’t affect what code is executed next.
It kinda does when you’re wondering what’s going on in backstage and working with shaders on multiple OS, drivers and hardware.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous
I suspect that we shaders authors really like Clean Math and that’s also why we like to think such “optimizations” with the step function is a nice modification :-)
This is a great question that I see everywhere in programming and I think it is core to why you measure first when optimizing.
You generally shouldn’t know or care how a built in is implemented. If do care, you’re probably thinking about optimization. At that point the answer is “measure and find out what works better.”
I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.
Is that weird? LLMs will just repeat what is in their training corpus. If most of the internet is recommending something wrong (like this conditional move "optimization") then that is what they will recommend too.
I think the core problem is that when writing code like this you need experience be sure that it won’t have a conditional branch. How many operations past the conditional cause a branch? Which operations can the compiler elide to bring the total below this count? I’m all for writing direct code and relying on smart compilers but it’s often hard to know if and where I’m going to get bitten. Do I always have to inspect the assembly? Do I need a performance testing suit to check for accidental regressions? I find it much easier if I can give the compiler a hint on what I expect it to do, this would be similar to a @tailcall annotation. That way I can explore the design space without worry that I’ll accidentally overstep a some hard to reason about boundary that will tank the performance.
Yes, that would be great. It's not always benefical, but in some (rare, but for me important) cases it's better. Currently, the only way to ensure a conditional move is used, is to use inline assembly. This is not portable and also less maintainable than a "proper" solution.
Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).
Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.
Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.
Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.
Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.
I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.
Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.
Not always - https://www.godbolt.org/z/zYxeahf3T. And for any modern (as in, made in the last two decades) x86 processor the branchless version will be hilariously better if the condition is unpredictable (which is a thing the compiler can't know by itself, hence wanting to have an explicit way to request a conditional move instr) and the per-branch code takes less than like multiple dozens of cycles.
Worse, doing one of the idioms for a conditional move ends up getting gcc to actually produce a conditional move, but clang doesn't, even with its __builtin_unpredictable: https://www.godbolt.org/z/bq9axzvjG
You want to pass -mllvm -x86-cmov-converter=false. I assume LLVM has a pass to undo conditional moves on x86 whenever a heuristic determines skipping a calculation by branching is cheaper than doing the calculation and using a conditional move.
Unfortunately, the heuristic that calculates the expense often gets things wrong. That is why OpenZFS passes -mllvm -x86-cmov-converter=false to Clang for certain files where the LLVM heuristic was found to do the wrong thing:
The issue explains why __builtin_unpredictable() does not address the problem. In short, the metadata is dropped when an intermediate representation is generated inside clang since the IR does not have a way to preserve the information.
Yeah, the need of the "-mllvm -x86-cmov-converter=false" hack is stupid; forgot to check with it. In my mind I guess I equivocated that being fixed with https://reviews.llvm.org/D118118 (and indeed __builtin_unpredictable does work with ?:), but, no, that flag still improves things today for the rest of the cases.
It just occurred to me that Clang and GCC did not necessarily fail to use conditional moves in your examples. While they failed to use explicit cmov instructions, cmp/jmp 1 instruction/mov is actually an idiom for an implicit cmov. Some CPU instruction decoders are able to turn it into a cmov without an explicit cmov instruction. In the case of RISC-V, the designers are philosophically opposed to explicit cmov instructions and expect compilers to generate this idiom for CPUs that support cmov. I asked them to implement cmov virtual instructions to be nice to people reading RISC-V assembly, but I am not sure if anything will come of it:
I do not know if any x86 CPUs recognize the implicit cmov idiom offhand, but if any do, then while an extra instruction was used, the conditional move would still be done on those that recognize the idiom.
By the way, I just noticed a case where you really don’t want the compiler to generate a cmov, explicit or otherwise since it would risk division by zero:
Interestingly, Clang correctly does not generate a cmov (implicit or explicit) for the outer ternary operation, while it does generate an explicit cmov for the inner ternary operator in MIN() without -mllvm -x86-cmov-converter=false. Passing -mllvm -x86-cmov-converter=false to Clang does not change the output, which makes Clang’s behavior correct.
GCC will not generate cmov for either ternary operator, which while also technically correct, is slow. This could still have been an implicit conditional move had GCC not avoided the implicit cmov idiom.
Using GCC’s __builtin_expect_with_probability() in MIN() does not cause GCC to change its output. If we remove the outer ternary, GCC will happily generate a cmov instruction. Given that GCC generally assumes that undefined behavior is not invoked to make code faster and will happily generate the cmov when there is a division by 0 bug, it is odd that upon seeing a check that verifies the assumption GCC made is true, GCC decides to stop generating a cmov. I am sure the way GCC does things is much more complicated than my interpretation of the output, but the behavior is odd enough to merit a comment.
I haven't heard of anything outside RISC-V having jump-over-mov as an idiom (though I've heard of potentially some CPUs having the ability to unwind only necessary parts on mispredictions over small bits of code or something; still some misprediction penalty though I believe; and even with -march=haswell behavior doesn't change).
I find the RISC-V solution (which fwiw I mentioned in a sibling thread[0]) rather sad; there's no way to check whether it's implemented, and even where it is I could imagine it being problematic (i.e. if the instructions cross a fetch block or cacheline or something and it gets ran as a branch, or some instrs around it break the fusion pattern checking), and where it's unsupported or otherwise doesn't work properly it'll "work" but be horrifically slow.
fwiw I haven't ever seen __builtin_expect_with_probability actually do anything for unpredictable branches; I just included it in my compiler explorer link for completeness.
Using a version of MIN that caches the X/Y computations gets gcc to produce a cmov, but makes clang's output longer: https://www.godbolt.org/z/6h8obxKG8
You might want to give feedback to the risc-v developers (although it might be too late at this point). What is the way to check if implicit cmov instructions are implemented in the CPU instruction decoder?
If AMD did not implement this in Zen 5, maybe we could ask them to add it in Zen 7 or 8. I assume it would be too late to ask them to add this in Zen 6.
There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.
I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Oh, another thing I don't like about fusing condjump+mv - it makes it stupidly more non-trivial to intentionally use branches on known-predictable conditions for avoiding the dependency on both branches.
> There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.
I was afraid the answer to my question would be that, but since my read of your previous comment “there's way to check whether it's implemented” seemed to suggest you knew a way I did not, I had my fingers crossed. At least, it had been either that you knew a trick I did not, or that a typo had deleted the word “no”.
> I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding logic to detect the idiom to the instruction decoder is the way to go. x86 cores spend enormous amounts of silicon on situational tricks to make code run faster. I doubt spending silicon on one more trick would be terrible, especially since the a number of other tricks to extract more performance from things likely apply to even more obscure situations. As for what happens in the x86 core, the instruction decoder would presumably emit what it emits for the explicit version when it sees the implicit version. I have no idea what that is inside a x86 core. I suspect that there are some corner cases involving the mov instruction causing a fault to handle (as you would want the cpu to report that the mov triggered the fault, not the jmp), but it seems doable given that they already had to handle instruction faults in other cases of fusion.
Also, if either of us were sufficiently motivated, we might be able to get GCC to generate better code through a plugin that will detect the implicit cmov idiom and replace it with an explicit cmov:
Note that I have not confirmed whether their plugins are able to hook the compiler backend where they would need to hook to do this.
Of course, such plugins won’t do anything for all of the existing binaries that have the implicit idiom or any new binaries built without the plugins, but they could at least raise awareness of the issue. It is not a full solution since compilers don’t emit the implicit cmov idiom in all cases where a cmov would be beneficial, but it would at least address the cases where they do.
> since my read of your previous comment seemed to suggest you knew a way I did not, I had my fingers crossed.
Whoops, typo! edited.
> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go
From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).
Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).
No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)
I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.
I had suggested the virtual instructions to the RISC-V developers to eliminate branchy cmov assembly, as I am not happy with it either. It is surprising to realize that x86 cores are not making more use of macro-ops fusion, contrary to my expectation, but I guess it makes sense now that I think about it. Their designers have plenty of other knobs for tuning performance and the better their branch predictor becomes, the less this actually matters outside of the cases where developers go out of their way to use cmov.
A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.
Ah, your gcc example does have the branchful branch that could be done branchlessly; I was thinking about my original example with a load, which can't be transformed back.
It's funny because I rarely seen this (wrong approach) done anywhere else but I pick it up by myself (like a lot did I presume) and still am the first to do it everytime I see the occasion, not so for optimizations (while I admit I thought it wouldn't hurt) but for the flow and natural look of it. It feels somehow more right to me to compose effect by signals interpolations rather than clear ternary branch instructions.
Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.
At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.
I've been doing this from day 1 becausee I just assumed you are not supposed to have loops or if else blocks in your shader code. Now I know better, thanks iq ur g.
Using `mix()` isn't necessarily bad. Using boolean operations like `lessThan()` is probably better than `step()`. I just tested two ways of converting linear RGB to sRGB. On AMD, they compile to the same assembly.
I believe the conclusion is correct in 2025, but the article in a way just perpetuates the 'misinformation', making it seem like finding if your code will compile to a dynamic branch or not is easier than it is.
The unfortunate truth with shaders is that they are compiled by the users machine at the point of use. So compiling it on just your machine isn't nearly good enough. NVIDIA pricing means large numbers of customers are running 10 year old hardware. Depending on target market you might even want the code to run on 10 year old integrated graphics.
Does 10 year old integrated graphics across the range of drivers people actually have running prefer conditional moves over more arithmetic ops.. probably, but I would want to keep both versions around and test on real user hardware if this shader was used a lot.
So, if you ever see somebody proposing this
float a = mix( b, c, step( y, x ) );
The author seems unaware of
float a = mix( b, c, y > x );
which encodes the desired behavior and also works for vectors:
The variants of mix where a is genBType select which vector each returned component comes from. For a component of a that is false, the corresponding component of x is returned. For a component of a that is true, the corresponding component of y is returned.
The author doesn't seem to say that mix should be avoided. Just that you shouldn't replace a ternary with step+mix. In your quote, you left out the 2nd half of the sentence: "as an optimization to [ternary]".
please correct them for me. The misinformation has been around for 20 years
But his education will fail as soon as you're operating on more than scalars. It might in fact do more harm than good since it leads the uneducated to believe that mix is not the right tool to choose between two values.
If you only pass a boolean 0 or 1 for the “a” mix parameter, when is using mix better than a ternary? Can you give an example? I’m not sure mix is ever the right tool to choose between two values. It’s a great tool for blending two values, for linear interpolation when “a” is between 0 and 1. But if “a” is only 0 or 1, I don’t think mix will help you, and it could potentially hurt if the two values you mix are expensive function calls.
Does ternary work for that? Inigo is saying "if ternary is working, leave it as ternary". I think you're talking about a situation where ternary wouldn't work, so Inigo isn't saying anything about that situation.
So what? That’s a language convenience, not a performance feature. Calling mix with a vec4 is no faster than using 4 scalar mix calls, and possibly much slower than using 4 ternaries. If your x & y params are expensive functions, and you use a vector of booleans, then you might be really shooting yourself in the foot.
I love seeing the codegen output, makes it easy to understand the issue, but claiming that it's faster or slower without actual benchmarks is a bit disappointing.
why waste brain cells on theory when you should simply bench both versions and validate without buying into any kind of micro-optimization advice at face value.
I don’t know enough about these implementations to know if this can be interpreted as a blanket ‘conditionals are fine’ or, rather, ‘ternary operations which select between two themselves non-branching expressions are fine’.
Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.
A GPU/SIMT branch works by running both sides, unless all threads in the thread group (warp/wavefront) make the same branch decision. As long as both paths have at least one thread, the GPU will run both paths sequentially and simply set the active mask of threads for each side of the branch. In other words, the threads that don’t take a given branch sit idle while the active threads do their work. (Note “sit idle” might involve doing all the work and throwing away the result.)
If you have two branches, and one is trivial while the other is expensive, and if the compiler doesn’t optimize away the branch already, it may be better for performance to write the code to take both branches unconditionally, and use a conditional assignment at the end.
It’s worth knowing that often there are clever techniques to completely avoid branching. Sometimes these techniques are simple, and sometimes they’re invasive and difficult to implement. It’s easy (for me, anyway) to get stuck thinking in a single-threaded CPU way and not see how to avoid branching until you’ve bumped into and seen some of the ways smart people solve these problems.
A real branch is useful if you can realistically skip a bunch of work, but this requires all the lanes to agree, on a GPU that means 32 to 64 lanes need to all agree, also for something basic like a few arithmetic ops there is no point.
My understanding was that they don't. All executions inside a "branch" always get executed, they're simply predicated to do nothing if the condition to enter is not true.
That's only if execution is incoherent. If all threads in a warp follow the branch the same way, then all of the instructions in the not taken branch are skipped.
I’d expect most vendors do, at least in their closed source drivers.
You could also check in the mesa project if this is implemented but it’s definitely possible to do
Shader compilers mostly use LLVM even though runtime is a constraint, if the pattern is common enough it’s definitely easy to match (it’s just two intrinsics after all) meaning you can do it for cheap in instcombine which you’re going to be running anyway
For some reason, I feel like this is harder to implement than you expect. The way to find out would be to get a bunch of examples of people doing this “optimizations in shader code, look at the IR generated compared to the optimal version and figure out a set of rules to detect the bad versions and transform it into a good versions. Keep in mind that in the example, the addition operators could be replaced with logical OR operators, so there are definitely multiple variations that need to be detected and corrected.
It is weird how long misinformation like this sticks around, the conditional move/select approach has been superior for decades on both CPU & GPU, but somehow some people still write the other approach as an "optimization".
"Conditional move is superior on CPU" is an oversimplification, in reality it was explained in 2007 by Linus[1] and nothing changed since then. Or in fact, branch predictors are constantly improving[2], while cmov data dependency problem can't be solved. Yes, cmov is better in unpredictable branches, but the universal truth is "programmers are notoriously bad at predicting how their programs actually perform"[3]
I really only care about performance in a SIMD sense, as not using SIMD means you aren't targeting performance to begin with(GPU == SIMD, CPU SSE/AVX == SIMD)
Branches fall off vs cmov/select as your lane count increases, they are still useful but only when you are certain the probability all lanes agreeing is reasonably high.
> For the record, of course real branches do happen in GPU code
Well, for some definition of "real". There are hardware features (on some architectures) that implement semantics that evaluate the same way that "branched" scalar code would. There is no branching at the instruction level, and can't be on SIMD (because the other parallel shaders being evaluated by the same instructions might not have taken the same branch!)
There is real branching on the hardware level, but yes it needs to take the same branch for the whole workgroup and anything else needs to be "faked" in some form.
Yeah, exactly: I argue that a global backwards-only branch used to implement loops by checking all lanes for an "end" state is not actually a "real" branch.
It's a semantic argument, but IMHO an important one. Way, way too many users of GPUs don't understand how the code generation actually works, leading to articles like this one. Ambiguous use of terms like "branch" are the problem.
I'm not going to second guess IQ, who is one of the greatest modern authorities on shaders, but I do have some counterarguments.
- Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash
- Most importantly, if statements look nasty on the screen. Having an if statement means a discontinuity in visuals, which means jagged and ugly pixels on the output. Of course having a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.
Why does this keep getting downvoted? This is fundamentally true, and good advice borne of experience. At least somebody would care to weight in as to why they disagree?
"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"
—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.
reply