
Trying to speed up binary search - jrbn
http://databasearchitects.blogspot.com/2015/09/trying-to-speed-up-binary-search.html
======
geophile
What about stopping when the ends of the range are "close enough" and
switching to a linear search? All the data should be in the cache, and it
should be possible to avoid branches.

~~~
trhway
this is why in real db the indexes aren't in pure binary tree, it is a
variation of B-tree instead. So a billion rows table will have only 4 levels
deep index (ie. 4 disk reads worse case).

------
acqq
"Apparently the conditional move is good to avoid branch mispredictions, but
cannot hide memory latency as much as the regular implementation."

Anybody knows what actually happens there? For a real analysis I'd like to see
the generated assembly in a classic and conditional move case, and also an
example of the indexes accessed in one and another algorithm.

~~~
nkurz
I did some preliminary testing on this a few months ago, which we might pick
up again someday and try to publish as a short paper. I haven't looked closely
yet at what the author did, and I've forgotten some details of what I did, but
I can speak generally to our implementation comparing branching and
branchless.

For repeated lookups, the first couple levels will all be hit in L1 cache
regardless of which way the comparison goes. The branch prediction penalty is
about 15 cycles, and the L1 access is about 5. The next level might be in L2
at 12 cycles, then the next few levels are in L3, with a ~40 cycle access
time, then RAM with 100+ cycles of access. The last few accesses will be
within the same 64B cacheline (128B with buddy prefetch), and thus will be in
L1 after the first access.

The branchless conditional move approach has a data dependency, while the
branching "if" approach is a control dependency. Modern processors "run ahead"
with speculative execution past control dependencies (executing the
instructions for one branch but not retiring them), while the conditional
moves are issued but cannot be executed until the corresponding comparison has
been made.

Because of speculative execution, the branching approach effectively has a 50%
accurate automatic prefetcher that runs several iterations ahead. The math
works out so that for some access patterns this can be a significant
advantage. The speed gap can be closed (and if I remember correctly, reversed)
by adding explicit prefetch instructions to branchless approach. The branching
approach can also benefit from judicious use of prefetch, so that each branch
fetches acts as a prefetch for the opposite branch, which makes for faster
recover after a branch prediction error.

As the author concluded, we also found that a batch approach could be
beneficial. You can mitigate latency from RAM (and even from L3) if you can
arrange to have about 10 outstanding requests at a time. For a single core,
batch and prefetch approaches had similar top speeds. For multicore (untested)
presumably the excessive memory bandwidth of the "wrong" prefetches would give
the advantage to batch. Similar to the author's experience on Broadwell, we
found that on Haswell the SIMD gather had minimal advantage over repeated
scalar loads. We have a Skylake machine coming soon, and are hoping the
hardware gather approach might finally take the lead.

~~~
acqq
OK, but binary search shouldn't have predictable patterns? So in a classic
example, if the search isn't "obvious" 50% of the time we'd have an
"unconditional" MOV, removing dependencies but also 50% of the time we'd have
wrongly predicted branch. Maybe the testing was on a too obvious example?

~~~
nkurz
Sorry, my phrasing was poor. The searched for elements are indeed random, but
the first few (and last few) accesses will hit cache. Depending on the size of
the total array, the relative importance of the different access times vary.
We were actually comparing some different memory layouts besides the standard
"sorted"
([http://cglab.ca/~morin/misc/arraylayout/](http://cglab.ca/~morin/misc/arraylayout/)).
Our conclusion was essentially that with proper implementation, there wasn't
much if any advantage to alternative layouts.

But there certainly were some subtle things about testing. In particular,
there is a significant difference between preparing a list of random searches
in advance (and loading them from memory), versus calculating a random number
on the fly. Since the performance advantage of the branching approach depended
on how far ahead the speculative execution would get, the additional µops for
making the number sometimes slowed down enough to remove the speculative
advantage.

But to clarify my earlier answer: I think the author is seeing real effect,
and the difference has to do with prefetching due to speculative execution of
the branching approach. This doesn't mean that branching is the best approach,
though, only that a "naive" approach can indeed beat branchless at certain
sizes. Both the branching and branchless approaches can be improved
significantly with either batching or explicit prefetching.

~~~
gpderetta
Another way to see this is that the 'branchy' version can have multiple loads
pending at the same time. This is a big win for non L1-cached loads.

Cmov forces each load to depend on the previous one, so you will have only one
load pending at any time.

With a 50% misprediction rate, the Nth load has only 0.5 __N probability of
actually committing, but it is still better than just having a single load in
flight.

------
gfody
You can improve the best case to log log n by choosing optimal cut points
instead of always cutting in half.

~~~
ryan-c
This is called an interpolation search[0]. Works well on data with a known
distribution and is useful especially useful when reading is expensive.

0\.
[https://en.wikipedia.org/wiki/Interpolation_search](https://en.wikipedia.org/wiki/Interpolation_search)

------
yorhel
I'm surprised that even the fastest implementation needs 20+ms to search a
1000-element array. I would expect even a linear search to finish within 1 or
2ms with such a small data set. How large are the array elements? How were the
times measured?

EDIT: Oh the time measured is the total of 1,000,000 random lookups? Nevermind
my confusion then, that would certainly explain it.

~~~
acqq
Yes, it's very strange article. No asm code, not presented what's actually
accessed.

------
imaginenore
If you're looking up 4-byte integers, you can simply create a hash table, or
even a dumb array as an "index", which gives you 1-operation lookup. Can't get
much faster than that.

