

Branchless Conditionals (2011) - djulius
http://www.blueraja.com/blog/285/branchless-conditionals-compiler-optimization-technique

======
zwieback
It's interesting that ARM32 has conditional execution and I like them a lot
for writing readable assembly code. Short jumps that result from a simple if
can be encoded in three successive instructions, no branches.

However, it's now falling out of favor (mostly gone from ARM 64) and
apparently it's due to the relative cost of putting conditional execution on
the die vs. relying on smarter compilers.

~~~
strictfp
Nice! How do you do this on ARM exactly?

~~~
nezza-_-
The first 4 bit of each (32 bit long..) instruction can be used to check for
conditions:
[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/Chdehgih.html)

------
chipsy
One simple branchless optimization form I've used is collision detection
across an array of values: instead of testing each one, i add their value to a
counter(perhaps with some mapping of data to collision value). After iterating
over a lot of them, I can do just one test. This is very cpu-friendly as the
pipeline gets to crunch all the numbers in one go.

~~~
1_player
Sounds interesting, can you provide a code example?

~~~
chipsy
The example I was going to provide [0] it turns out was reduced into a memory
comparison.

I wrote a little C program for you instead: [1]

[0] [https://github.com/triplefox/three-
packer/blob/master/packer...](https://github.com/triplefox/three-
packer/blob/master/packer/packer/pack.nim#L375)

[1]
[https://gist.github.com/triplefox/47d620fc556e3f7da9bb](https://gist.github.com/triplefox/47d620fc556e3f7da9bb)

------
vardump
One example seems a bit odd.

    
    
      if(LocalVariable & 0x00001000)
          return 1;
      else
          return 0;
    
      mov eax, [ebp - 10]
      and eax, 0x00001000
      neg eax
      sbb eax, eax
      neg eax
      ret
    

Hmm... wouldn't this be faster? Two instructions less:

    
    
      mov eax, [ebp - 10]
      and eax, 0x00001000
      shr eax, 12
      ret
    

Well, who knows. Didn't bother to analyze this case. Maybe the article's
example is faster somehow?

~~~
astrange
The neg/sbb/neg operation is 'x = x != 0' or 'x = !!x', I think.

You're right that yours should work too, because after the and the value can
only have 1 bit set. But it only works for this particular and mask.

------
strictfp
Nice article. It inspired me to look around for some more straightforward way
of optimizing, and I found the setcc class of instructions:
[http://www.nynaeve.net/?p=178](http://www.nynaeve.net/?p=178)

I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the
same, right?

Something like (pseudo):

Comparewith(4)

Ifequalstore(54)

Ifnotequalstore(2)

Return

~~~
1_player
Aren't setcc/cmov* instructions effectively similar to a branch? To compute
the result you need to execute the previous instruction.

I suppose that these instructions do not cause the instruction pipeline to be
flushed, compared to an incorrectly predicted jump, but they still stall until
the previous instruction has been executed.

jmp < setcc/cmov* < branchless conditionals

~~~
Scaevolus
Conditional moves have data dependencies on their input arguments, but so do
the "branchless" versions presented in the article.

------
kstenerud
I know that gcc and clang both have __builtin_expect(). If you tell the
compiler the more likely path, wouldn't that make the branching version
faster?

Actually, I've always wondered how __builtin_expect translates to something
the CPU's branch prediction engine can use...

~~~
CHY872
I think in general, most CPU architectures pick branch not taken for forward
branches and branch taken for backwards branches on the first try. So I feel
like builtin_expect gives weighting to let the compiler shuffle code around to
make it fit that pattern.

There are ways of doing it in hardware, I remember a supervisor discussing it
with respect to MIPS. I also remember them saying they went through the entire
code generation stage of GCC and found that every single point at which GCC
would try to use it was somewhere where it would be actively unhelpful.

~~~
astrange
__builtin_expect is good for error handling code, because gcc can avoid size-
increasing optimizations in that path, and it can even move all the unlikely
code out into another section so it won't take up space in your caches.

But its code generation is better for a 99%/1% case than a 60%/40% case,
because Intel doesn't listen to branch hints anymore nor really give advice on
how to tune for them.

------
mlindner
x86 branch predictors are not 60% correct... Any decent branch predictor is
over 90% correct and I believe modern ones are over 96% correct.

~~~
marcosdumay
That's completely dependent on your algorithms.

Branch prediction in some search in a hash will fail 50% of the time, however
it's done. Branch prediction on a long for loop was more than 99% correct by
the end of the 90's already.

Intel claims their prediction algorithms are over 96% correct on an average
program, whatever that beast is. (To be fair, you'll find a definition for it
at their papers. That's a perfectly legit claim, it just does not mean what
you think it means.)

