
Binary Search Eliminates Branch Mispredictions (2012) - luu
http://www.pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/
======
userbinator
It would be so _very_ awesome if Intel/AMD decided to make REPNZ/REPZ
SCAS(B/W/D/Q) search, in parallel, an entire cache line at once (and in the
case of multiple matches, a priority encoder would select the first one) -
they're already doing this with associative memory for the cache tags. Much
like the vectorised SSE version, but smaller (can't get better than a single
instruction) and could even be put inside a binary search of the cache-line-
sized blocks for more speed, basically a vectorised binary search. Would
certainly be interesting to compare with the other approaches here.

------
wfunction
In my experiments conditional moves have been actually slower than predicted
branches.

~~~
nkurz
That's not a terrible rule of thumb, but one can often do better. If you (the
programmer) know that consistently correct prediction is impossible (as it is
often is for binary search) then it's often beneficial to convince the
compiler to generate conditional moves rather than branches. This is what the
article shows for the test results.

Conditional moves (CMP/CMOVcc) will always be slower than correctly predicted
branches (CMP/JMPcc) if the branch is predicted correctly. If the prediction
accuracy if 50%, the conditional move will always be faster. The threshold for
prediction accuracy differs for different processors, but 85% is a reasonable
approximation. This necessary threshold is moving higher with modern
processors (CMOV is getting relatively faster).

------
corysama
Someone tell me if there's an error in my takeaway:

For short sequences, linear searches can be faster than binary searches
because a short sequence of mostly predictable branches (no,no,no,no,yes) is
usually faster than an even shorter sequence of unpredictable branches
(yes,no,yes). To work around this, the author restricts his searches to
compile-time constant sequence lengths so that GCC can unroll the fixed-size
binary searches into short, fixed, linear series of branchless conditional
move instructions. That setup executes reliably faster than optimized linear
searches for in-cache sequences of size 8, 16, 32 and 64.

However, if the data is uncached, then memory latency dominates ALU execution
time and the memory prefetcher's preference for forward, linear scans leads to
linear searches being faster than branches binary because of the difference in
memory access patterns. Interestingly, the behavior of the prefetcher on the
author's machine appears to prefer forward scans over backward scans at a sub-
cache-line level.

~~~
nkurz
Yes, I think your first paragraph is the correct takeaway, but I'm not sure
about the second.

Using Haswell as the example (because it's current and I'm more familiar with
it right now), the linear search comparisons are really fast: the processor
can handle 2 correctly-predicted branches-not-taken per cycle. But the single
branch misprediction at the end costs about 15 cycles, plus the latency of
reloading from L1 any registers where the load had already been speculatively
executed (about another 5 cycles).

The conditional moves are slower than simple comparisons, but still very fast:
the CMP/CMOVcc pair can have a throughput of 1 per cycle if unrolled, so half
as fast as a CMP/JMPcc that is correctly predicted as not taken. But there is
no penalty at the end of the loop, so the overall speed is better for small
sizes out to a 64-byte cacheline or so. I haven't tested, but I think the
branchless approach has extended his lead in recent years. He's using
Westmere, which is pre-Sandy Bridge, which is pre-Ivy Bridge, which is pre-
Haswell.

But the question of how this changes for data that needs to be brought into a
cacheline from memory is probably more complicated than just the memory
latency. I think the issue is more that the memory latency swamps the reading,
and it isn't (or shouldn't be) that linear searches can take advantage of
prefetching but binary can not. It just might take some 'trickery'. For the
small sizes he's dealing with, you should be able to just load it all without
waiting to see what you need. And I don't think that there is currently any
advantage of forward versus reverse scans regarding prefetch, although one or
the other might allow the loop to be structured differently.

------
jhallenworld
Cool! I wonder if you can improve things even more by running several binary
searches at once, by interleaving several of them together, then combine the
results. This way there is work for the CPU to do during the data load
latencies (or write back if it's not fully bypassed).

~~~
jhallenworld
Well I tried it on an IvyBridge Xeon E5-2680-V2, gcc version 4.4.3 with -O3,
table size = 32. Single table at a time ends up being faster:

    
    
        155 M searches / sec for one table at a time.
    
        108 M searches / sec for two tables at a time.
    
        88 M searches / sec for four tables at a time.
    

The interleaved version of the code looks like this (for two at a time):

    
    
        #define LOGSIZE 5
    
        unsigned dsearch (unsigned key, unsigned * vector)
        {
            unsigned size = (1 << LOGSIZE) / 2;
            unsigned * low1 = vector;
            unsigned * low2 = vector + ((1 << LOGSIZE) / 2);
            unsigned i;
    
            for (i = LOGSIZE - 1; i != 0; i--) {
                    size /= 2;
                    unsigned mid1 = low1[size];
                    unsigned mid2 = low2[size];
                    if (mid1 <= key)
                            low1 += size;
                    if (mid2 <= key)
                            low2 += size;
            }
    
            if (*low2 == key)
                    low1 = low2;
    
            return low1 - vector;
        }
    

The compiled code ends up branch free (gcc does a good job), but is 30
instructions instead of 18.

Oh well, was worth a shot..

~~~
tjpoutanen
I think the reason it's slower for you is that the CPU pipeline is starved for
data... even when it is in the same cache line.

Since we are geeking out on binary search on a Saturday night, here's an
approach I developed for traversing multiple binary trees in parallel with
SIMD instructions, no branches and data prefetching. It's used at Bing to
calculate the document score that ranks Web search results, and traverses
thousands of Machine Learning derived decision trees. It resulted in a 4x
speedup over the naive approach.

[http://web.cse.ohio-
state.edu/~ren/papers/CGO13_BinRen.pdf](http://web.cse.ohio-
state.edu/~ren/papers/CGO13_BinRen.pdf)

~~~
vtuulos
This is very cool. Thanks for sharing!

