
SIMD / GPU Friendly Branchless Binary Search - Atrix256
https://blog.demofox.org/2017/06/20/simd-gpu-friendly-branchless-binary-search/
======
0xfaded
I was stumped as to how this was possible, until I saw that it relied on
memory access relative to a computed offset, for which as far as I know there
are no SIMD instructions. I guess it could be argued that a texture in a
shader would provide similar functionality.

Also of interest are sorting networks
[https://en.wikipedia.org/wiki/Sorting_network](https://en.wikipedia.org/wiki/Sorting_network)
which can be used to perform branchless sort on parallel hardware.

~~~
nhaehnle
Since the title mentioned GPU-friendliness: GPUs are built all around
scatter/gather memory accesses. The actual performance of memory accesses of
course still depends on locality. The first load in a parallel binary search
will be fast, since all threads will load the same element, later loads can
get progressively worse.

That said, the code in the article doesn't actually improve anything
specifically for GPUs, because standard binary searches are already
"branchless" for practical purposes. Really, the only difference between the
article and a standard binary search is that the code in the article has
manually unrolled the search loop. That, plus some (not all) binary search
implementations can do an early-out if the comparison happens to hit equality.

None of these changes make a SIMD- or GPU-specific difference, because loop
unrolling doesn't make a difference in terms of control-flow divergence.

------
flgr
I think in order to really reap the benefits of this you'll want to actually
reorganize the in-memory layout of the trees in order to make sure all
elements needed for the comparison end up in the same cache line.

I haven't been following this closely, but the last time I checked scatter-
gather loads were really really slow.

Chapter 3.3 from page 27 (PDF page 43) on of this be interesting:
[https://www.researchgate.net/profile/Florian_Gross/publicati...](https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf)

Also contains a survey of some other related data structures and algorithms.

~~~
0xfaded
Thank you, scatter-gather was the google term I was looking for, though I
couldn't immediately find any architectures that implement a vector scatter-
gather. Are you aware of any?

~~~
pandaman
Not sure what could be a scalar scatter/gather, but NV and AMD GPUs do the
vector one (you can read write n values from n offsets with a single
instruction).

------
lukego
Is it reasonable to estimate that the performance of this code will be bounded
by log2(n) memory round-trips per lookup?

If so then could you double performance to log2(n)/2 lookups by issuing
parallel memory requests for location n and also both candidates for location
n+1?

~~~
jsnell
Have a look at
[https://arxiv.org/abs/1509.05053](https://arxiv.org/abs/1509.05053) from
pkhuong. They discuss this issue and other variants that paper. So e.g.
branchless binary search, the branchy binary search + speculative execution as
a form of prefetching, branchless binary search with explicit prefetching.

~~~
lukego
Seriously awesome paper!

The answer seems to be "yes" judging by their dramatic conclusion:

> Our conclusion is contrary to our expectations and to previous findings and
> goes against conventional wisdom, which states that accesses to RAM are
> slow, and should be minimized. A more accurate statement, that accounts for
> our findings, is that accesses to RAM have high latency and this latency
> needs to be mitigated.

In other words: to optimize memory access you should think like a network
engineer - treat RAM as a "long fat pipe" and do the usual optimizations like
pipelining that you would in TCP or HTTP or ....

Related: "Computer architecture for network engineers."
[https://github.com/lukego/blog/issues/18](https://github.com/lukego/blog/issues/18).

------
roel_v
Does anyone know of any comprehensive resources that cover various techniques
to do things branchless or otherwise optimize for gpu? I learned about masking
the other day, but it seems this sort of stuff is only covered in dispersed
blog posts and institutional knowledge.

~~~
ctchocula
Professional CUDA C Programming by Cheng, Grossman and McKercher would be my
recommendation. They show you how to do many optimization, motivating them
using real-world examples like: matrix addition, reduction of n elements, and
matrix transpose among others. They give the code, then show you what the code
looks like after optimization 1, 2, 3, etc. along with what kind of speed-up
they were able to obtain over the naive implementation using each
optimization.

~~~
roel_v
Oh that's great - I actually have it laying here next to me, but I haven't
opened it yet because I went with OpenCL and I haven't even tried all options
in OpenCL yet, but then I guess I need to reprioritize. Thanks.

------
maxk42
This is not branchless.

It's if-less. But the ternary operator (?:) is a branching instruction.

~~~
colejohnson66
Not necessarily. A compiler can optimize ?x:0 into a multiply. For example,

    
    
      f(x) ? y : 0
    

can also be written as

    
    
      (int)(bool)f(x) * y

~~~
Atrix256
Here's a godbolt to play with as well, showing how it does actually compile
down to being branchless in clang (you can switch it to use other compilers
too) [https://goo.gl/bywp8b](https://goo.gl/bywp8b)

~~~
pjscott
In case anybody was wondering, the x86 assembly is doing the multiply trick.
The rather confusing "lea" instruction -- Load Effective Address -- is used as
a space-efficient way of saying "take one register, multiply by 1, 2, 4, or 8,
add another register, and put it in a third register". The thing in the
register that gets multiplied by that constant is either a 1 or a 0.

If the constants had been less convenient numbers, like 42, the compiler could
have instead used a conditional move instruction -- more general-purpose, but
I believe slightly larger and often ever-so-slightly slower.

------
monocasa
Huh, neat. This is really similar to the underlying mechanism used in the
compiler that only used x86 mov instructions.

[https://github.com/xoreaxeaxeax/movfuscator](https://github.com/xoreaxeaxeax/movfuscator)

~~~
partycoder
Not really:

1) Doesn't obfuscate

2) Comparisons operators compile very similarly to a conditional statement
(e.g: if, ternary operator, etc)

[https://godbolt.org/g/xXecW2](https://godbolt.org/g/xXecW2)

I rewrote the OP code using ifs, and it compiles to exactly the same
instructions. You can then add -O3 -msse -msse2 as compiler flags, code will
be optimized about the same way.

~~~
monocasa
Yes really.

movfuscator doesn't really obfuscate all that much, other than just being
weird.

It's whole shitck is to compute offsets into it's memory space differently
based on the input data, but not change the instruction stream based on the
data. That way computations that shouldn't contribute to the final result are
simply given a scratch offset that isn't read back, but the instruction stream
never deviates.

This SIMD code is also ultimately computing different offsets into memory
space (contributing 0 to that lane's offset on each iteration if that SIMD
lane has already found the node it's looking for), but keeping the instruction
stream consistent across the 'separate' lanes.

------
falcolas
If I'm doing my mental compilation correctly, these can't actually be
parallelized, due to 'ret' being referenced and set by each line. I'm curious
if this is actually the case, and if an actual parallel implementation is
possible (and how fast).

~~~
alexlarsson
Yeah, the data dependencies means this will never be fast.

Additionally, "(haystack[4] <= needle) ? 4 : 0" is basically a branch. I mean,
a smart compiler _could_ make it a conditional move, but that is in no way
guaranteed.

~~~
MereInterest
I think the branching could be avoided by using a cast from boolean to
integer, "(haystack[4] <= needle)*4", instead of a ternary operator. It
doesn't help at all with the data lookups, though, which is the bigger
problem.

~~~
pacman128
It depends on the optimization level. Using no optimization, gcc 5.4 will use
branches for this case, but using -O2, it avoids the branches (on x86_64).
Verified by using gcc -S to show the assembly generated.

------
gwern
Reminds me a little of
[https://en.wikipedia.org/wiki/Sorting_network](https://en.wikipedia.org/wiki/Sorting_network)

------
FrozenVoid
Try GCC vector intrinsics which allows comparing scalar to vector.

------
partycoder
Technically a comparison operator would be a form of branching, because you
have comparisons and jumps based off flags, same as what happens when you
compile an if statement.

It becomes more evident when you see the generated code.
[https://godbolt.org/g/Mu9KBY](https://godbolt.org/g/Mu9KBY)

e.g:

    
    
        cmp     rax, QWORD PTR [rbp-24] ; compare a and b
        ja      .L2                     ; is a above b?, if so jump to .L2, otherwise proceed
    

In fact this statement:

    
    
        size_t ret = (haystack[4] <= needle) * 4;
    

...is 100% equivalent to this once compiled:

    
    
        register size_t ret;
        if (haystack[4] <= needle) {
            ret = 4;
        } else {
            ret = 0;
        }
    

So the code is not really branchless, just if-less.

~~~
partycoder
To the trolls that downvoted my post without explaining why, here is the
proof: [https://godbolt.org/g/xXecW2](https://godbolt.org/g/xXecW2)

I rewrote the function using if statements making it compile to exactly the
same instructions.

If you optimize the code with compiler flags (e.g: -msse, -msse2 for SSE/SSE2,
-O3 for optimization) the generated instructions are exactly the same.

My point being: the code in the blog post is not branchless. The absence of if
statements doesn't mean it's branchless.

If you care to disagree here by downvoting I dare you to reply explaining why
you think this is wrong.

~~~
grenoire
You didn't read the post properly. There is a subsection for a version without
ternary operators.

~~~
partycoder
The version I compared against was the one using numeric comparison ( <= ) not
ternary operators. It's in the link I shared.

So I hope you revert your downvote or provide proof of what I am saying is
incorrect.

~~~
Atrix256
You are completely right, I'll mention this in the post (In that case do a
branchless comparison based on subtraction or similar?). Fwiw when doing simd,
you'd use intrinsics which wouldn't have this issue, but it is a valid point
and completely true. Thanks!

~~~
partycoder
"You are right!" \+ no upvote = bad

~~~
Atrix256
I'm not the one who downvoted you, but I am the one who wrote the article :P

