One of my favourite things in programming is getting rid of the conditional branches. They're great to keep the brain fresh, because sometimes you have to dig ten layers outward to rewrite your code just to get rid of that fucking condition that slows down the code.
Or a good one I recall, was my 2D-box-check. I've made the mistake believing that not writing the output to RAM, when the result yields "not in box" would be better. So naturally I checked for every output if the respective bit was 1 or 0 and a 0 restarted the loop early.
Nope.
It was faster to just always write to memory, but linking the result to a multiplication which increments the pointer based on the above mentioned bit. "Not in box" increased the pointer by 0 (n * 0 = 0).
Boom, one billion 2d-box-checks per second on an i5 7500HQ on a single core without using SIMD. Good enough.
I ... guess I should stop now.
(PS: this was used for a sprite engine for a super-high resolution text mode, which compiled sprites into the code that wrote them to the screen. That's a LOT faster than checking if a character/pixel needs not-to-be-written AND it allowed for potentially eight pixels/characters to be drawn at once, too, using pop!
Getting rid of conditional branches is also a major brain-teaser when you're writing safe constant-time cryptography code. In assembly code, a common trick is exploiting the side-effect of the carry/borrow flag. You write the code in such a way that the condition will set the carry flag. Perhaps it's an overflow during an addition, or perhaps it comes from a bitwise shift. The carry flag can then be expanded into an integer by subtracting a register from itself via "subtract with borrow". If the carry flag was not set, the register becomes 0x00000000, if it was set, the register underflows to 0xFFFFFFFF. Then one can use it as a "yes/no" bitmask in some bitwise expressions.
>Getting rid of conditional branches is also a major brain-teaser when you're writing safe constant-time cryptography code.
It is not possible to write constant time code on a modern machine and OS. The important thing is that the time taken should not depend on any of the inputs.
It's absolutely possible (though not necessarily easy) to write constant-time code on a modern machine and OS. In cryptography, "constant-time" does not mean the code literally takes the same amount of time, it's a technically term which a clear technical definition, "code that does not leak secret information through timing analysis." I didn't invent the terms, argue with cryptographers instead.
Furthermore, many cryptography applications run on simple embedded hardware. In these cases, it can literally be "constant-time" code, which is likely where the term originated from. But it's no longer the definition.
Yes. The main problems are dynamic frequency scaling, because it changes the rate instructions are processed, preemptive multitasking, which can potentially destroy various caches in the middle of your code, and hyperthreading, where the processor can switch between executing two different things.
The use of pipelining, multi-issue, out-of-order execution, branch prediction, and caching on modern processors also means that, timing is already somewhat unpredictable at the machine code level - unlike simple 8-bit or 16-bit microcontrollers (which are often used in for real-time control of hardware, there can be a 1:1 relation between an assembly instruction in firmware and a signal pulse output that you can measure microsecond by microsecond).
On a modern CPU, even if you're running bare-metal with no operating system, and you're writing plain assembly, it's still impossible to calculate the wall-clock time it takes solely by adding the CPU cycles up from each instruction. Different instructions can be executed simultaneously if there are no data dependency. A if/else branch in a loop can get faster over time when the CPU "learns" the pattern after a few iterations. Exactly how depends on the implementation detail at micro-architectural level - how many instructions can the CPU decode, how many ALUs and execution ports that can be scheduled for execution at the same time, the branch prediction algorithm and hardware, etc. This changes from CPU model to CPU model.
Repeatably accessing the same data in memory or data in the same cache line causes the code to run faster, sometimes by orders of magnitude, because the data is already in L1/L2/L3 cache. Conversely, randomly accessing memory causes pipeline stalls for hundreds of clock cycles. This is why array indexing, lookup table, and hash table cannot simply be understood as O(1) on practical hardware. This is also a major target of timing side-channel attacks against cryptography code - depending on how the encryption algorithm is designed, it can be sometimes extremely difficult or nearly impossible to defend from - AES is a notorious example because it uses table lookup (hardware-accelerated AES instruction set is both a performance boost and also a security fix to this problem). Thus, many newer crypto algorithms are intentionally designed to be inherently immune - designed in a way that there's no secret-dependent memory accesses (like table lookup) during the encryption process.
Code that uses min, max, clip, clamp, saturate, etc. when that is straightforwardly what operation is needed is clearer, more explicit, more concise and potentially faster. When used this way they are just straightforwardly better than equivalent if then statements.
Occasionally you see code that uses combinations of these in less obvious ways to do some more complex operation and I can see how one could argue that in rare cases that ends up less clear than if then statements but I've never really encountered that in many years of reading shader code.
There are other reasons to prefer functions over if then in shader code not discussed here as well, like replacing step with smoothstep for antialiasing.
Or a good one I recall, was my 2D-box-check. I've made the mistake believing that not writing the output to RAM, when the result yields "not in box" would be better. So naturally I checked for every output if the respective bit was 1 or 0 and a 0 restarted the loop early.
Nope.
It was faster to just always write to memory, but linking the result to a multiplication which increments the pointer based on the above mentioned bit. "Not in box" increased the pointer by 0 (n * 0 = 0).
Boom, one billion 2d-box-checks per second on an i5 7500HQ on a single core without using SIMD. Good enough.
I ... guess I should stop now.
(PS: this was used for a sprite engine for a super-high resolution text mode, which compiled sprites into the code that wrote them to the screen. That's a LOT faster than checking if a character/pixel needs not-to-be-written AND it allowed for potentially eight pixels/characters to be drawn at once, too, using pop!
... I really should stop now. ^_^)