
Binary Search: A new implementation that is up to 25% faster - scandum
https://github.com/scandum/binary_search
======
ghj
The readme doesn't describe why it's faster. Looking at the code of the
variant that is the fastest:
[https://github.com/scandum/binary_search/blob/master/binary-...](https://github.com/scandum/binary_search/blob/master/binary-
search.c#L201)

It seems like it's a quaternary search which seems like an arbitrary "magic
number" of interior points. It's easy to understand what it does if you
already know other variants like ternary search (cut search space into 3 to
pick 2 interior points) or golden section search (same thing as ternary except
in golden ratio). Here, quaternary search is just picking 3 interior points
after dividing into 4 parts.

So the speed up is the same as how b-trees get their speedup: increase
branching factor which costs more comparisons but reduces the depth. I might
be wrong, but instead of quaternary it could also be 5-ary or 8-ary or any
B-ary and any of these variants can also have the potential to perform better.

Just a tradeoff between: cost_of_divide * log_B(N) + cost_of_compare * (B - 1)
* log_B(N)

EDIT: Thinking about it more, the divison doesn't seem like it should be the
most expensive operation (especially relative to compares/branching). Anyone
have any better ideas on why you would prefer more compares? Is it some
pipelining thing?

~~~
BeeOnRope
Wider branching factors (3-ary, 4-ary, etc) are less efficient in the total
number of comparisons, but give you more memory level parallelism, and large
searches are dominated by memory access behavior and critical paths, not total
comparisons or instruction throughput. So better MLP can make up for the
inefficiency of higher arity searches... to a point.

E.g., with 4-ary search, your search tree is half the depth of binary search
(effectively, you are doing two levels of binary search in one level), but you
do 3x the number of comparisons, so 1.5x more in total.

However, the comparison (1 cycle) is very fast compared to the time to fetch
the element from memory (5, ~12, ~40, ~200+ cycles for L1, L2, L3 and DRAM hit
respectively). The 4-ary search can issue the 3 probes in parallel, and so if
the total time is dominated by the memory access, you might expect it to be
~2x faster. In practice, things like branch mispredictions (if the search is
branchy) complicate the picture.

~~~
viraptor
Sounds right. I think the fastest solution (of this approach) would do
something like: get the Lx-cache-row sized batch; check upper/lower end;
depending on result: choose next batch by binary division, or do linear walk
to find the match. Not sure if doing it precisely would be an improvement over
the magic numbers in the example though.

Then again, I'd like to experiment with prefetch here as well. It may be
possible to squeeze out even more performance.

~~~
BeeOnRope
I don't think the trick of checking each end of the cache line helps much,
except perhaps at the very end of the search (where there are say low 100s of
bytes left in the region).

When the region is large it just doesn't cut the search space down almost at
all: it's a rounding error.

Now you might think it's basically free, but clogging up the OOOE structures
with instructions can hurt MLP because fewer memory accesses fit in the
window, even if the instructions always completely execute in the shadow of
existing misses.

There is a similar trick with pages: you might try to favor not touching more
pages than necessary, so it might be worth moving probe points slightly to
avoid touching a new page. For example, if you just probed at bytes 0 and
8200, the middle is 4100, but that's a new 4k page (bytes 4096 thru 8191), so
maybe you adjust it slightly to probe at 4090 since you already hit that page
with your 0 probe.

Making all of these calculations dynamically is a bit annoying and maybe slow,
so it's best if the whole search is unrolled so the probe points according to
whatever rules can be calculated at compile time and embedded into the code.

Much more important than either of these things is avoiding overloading the
cache sets. Some implementations have this habit of choosing their probe
points such that only a very small number of sets in the L1, L2 are used so
the caching behavior is terrible. Paul Khuong talks about this in detail on
pvk.ca.

Doing a linear walk at the end definitely helps a bit though: SIMD is fastest
for this part.

~~~
londons_explore
If you want to optimize a data structure for binary search, it sounds like it
might be best to reorder the data itself in memory to make caching more
effective.

For example the first access in a binary search will be the middle element,
followed by the lower quartile or upper quartile. If you store all of those
together in memory, a single cache line fetch can serve all those requests.

~~~
aktau
See user `fluffything`'s comment on this same thread:
[https://news.ycombinator.com/item?id=23895629](https://news.ycombinator.com/item?id=23895629),
describing this approach (Eytzinger layout).

------
rav
I checked the plot against an old school project where I investigated binary
search, and I got similar numbers for the "standard" binary search
(std::lower_bound in my project): For N=1000,10000,100000 it took 713,925,1264
us (compare to the author's 750,960,1256 us).

In my project I didn't test "quaternary search", but instead the winner was
the Eytzinger layout that others have discussed, with 570,808,1044 us (compare
to author's quaternary search: 557,764,1009 us).

It would be interesting to try a 4-way Eytzinger layout.

My project report, if anyone's interested: [https://dl.strova.dk/ksb-manuel-
rav-algorithm-engineering-20...](https://dl.strova.dk/ksb-manuel-rav-
algorithm-engineering-20200317.pdf)

EDIT: I reuploaded the linked PDF since the first one I uploaded was missing
the source code.

------
eloff
Binary search can be implemented using conditional move instructions instead
of branching. Given the branches are unpredictable, this yields a huge
speedup. It can be implemented like this in pure C, if written in a way the
compiler can recognize. In my experience this is the fastest binary search.

[1] [https://pvk.ca/Blog/2012/07/03/binary-search-star-
eliminates...](https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-
star-branch-mispredictions/)

~~~
BeeOnRope
Actually cmov-based searches often (usually?) result in a slowdown unless non-
obvious measures are taken.

A cmov-based search has the problem that the probes for each level are data-
depedent on the prior level. So you cannot get any memory-level parallelism,
because the address to probe at level n+1 isn't known until the comparison at
level n completes.

Branchy search is at least right half the time (in the worst case) on the next
direction, 25% of the time for 2 levels down, etc. So it gets about 1
additional useful access in parallel, on average in the worst case, compared
to cmov search.

Except for very small searches which fit in the L1 cache, where misprediction
cost dominate overall time, branchy is competitive and often faster.

Now you can take countermeasures to this problem with the cmov search. One is
to increase the arity of the search as in the OP, another is explicit software
prefetching, etc, which might put cmov back on top.

This has all been assuming the worst case of totally unpredictable branches.
Real world searches might be expected to be less-than-random as some elements
are hotter than others, etc. Any predictability of the branches helps out for
branchy search.

~~~
eloff
As usual, it depends. I was doing binary search within btree nodes, which
means small arrays that fit into the cache. Branchless search performed best
there. Actually even the btree search itself was branchless (if nodes were in
memory) until the end, as the pointer to the next node was just loaded via an
index calculation and the loop continued with a binary search of the next
node. I did a little software prefetching as well once I had the pointer to
the next node.

~~~
BeeOnRope
Yes, tightly packed nodes or nodes which are all in L1 is one place where
plain cmov can win.

The sweet spot is pretty small though, because it is not only squeezed from
above by branchy which wins for less cached data, but also from below by
linear vectorized search. You can't always apply vectorized search (e.g. if
the keys are not contiguous), but when you can it can win up to dozens of
elements.

When you throw the possibility of predictability on top, where branchy wins,
that's why I usually reject unconditional statements that "cmov gives a huge
speedup".

------
pbiggar
This is really cool!

About a decade ago I did research on sorting and searching, which was the same
type of work going on here ([https://github.com/pbiggar/sorting-branches-
caching](https://github.com/pbiggar/sorting-branches-caching)). I found it was
extremely difficult to accurately say whether one algorithm is faster than
another in the general case. I found a bunch of speed improvements that
probably only apply in processors with very long pipelines (like the P4)).

Execution speed is probably the right metric here, but it makes it hard to
understand _why_ it's faster.

PS. Looking at "boundless interpolated search"
([https://github.com/scandum/binary_search/blob/master/binary-...](https://github.com/scandum/binary_search/blob/master/binary-
search.c#L234)), it seems it's missing a `++checks`. I initially thought this
could be the cause of a misreporting it as being faster, but I see the
benchmark is run-time based so that wouldn't cause it unless the incrementing
itself is the bottleneck.

------
utopcell
It is nice to see benchmarks of elementary algorithms, but none of these is
new. Benchmarking would be more convincing if more distributions were used.
For example, we know that interpolated search is O(lglgn) for uniform
distributions that are used here, but can be linear in pathological cases.

~~~
klyrs
Exponentially distributed data is a great example of that. It's fun that the
author picks 32-bit ints; I suspect that you just don't have enough dynamic
range to really hobble interpolated search.

------
diehunde
I've seen most real-world applications that use in-memory search, use balanced
binary search trees. Are any of these improvements applicable to those data
structures? AVL and Red-black trees come to mind.

~~~
adrianN
AVL and Red-black trees are pretty terrible and can easily be improved upon by
cache-friendlier data structures like for example B-trees.

------
dehrmann
On the topic of binary searches, I was wondering if they can be done without
any branching so you can avoid branch misses and possibly parallelize it with
SIMD instructions. I think I was able to get it with some optimizer flags and
GCC built-ins.

[https://github.com/ehrmann/branchless-binary-
search](https://github.com/ehrmann/branchless-binary-search)

~~~
BeeOnRope
Yes you can do it branch free, with only a single branch at the end (so your
search actually terminates), but the benefits are not as obvious as you might
thing per my other reply:
[https://news.ycombinator.com/item?id=23894709](https://news.ycombinator.com/item?id=23894709)

~~~
dehrmann
You can avoid the branch at the end by computing your iteration count
beforehand and using switch statement jump table where each case is a step in
the search, and you iterate by falling through.

~~~
BeeOnRope
Yes, but this just swaps the terminating conditional branch for an indirect
branch at the start, no?

In general my guideline is: for an algorithm which has variable input size, do
you want to do a variable amount of work? If yes, you will have at least one
branch, and this branch will be unpredictable with at least some distributions
of unpredictable input sizes.

In this case I think you definitely want "variable work" because a search over
four elements should be shorter than search over one million.

~~~
dehrmann
> Yes, but this just swaps the terminating conditional branch for an indirect
> branch at the start, no?

Depends how you write it. The number of iterations is ceil(log2(n)). GCC's
__builtin_clz essentially computes this, and it has support on most major
architectures.

~~~
BeeOnRope
Yes, you can calculate the number of levels, but with a dynamic size that
isn't going to let you avoid a branch. Say you calculate there are 10 levels
in the search. What do you do now?

~~~
dehrmann
Do you consider a jump a branch? My impression was always the point of
avoiding branches is so you don't have mispredictions, but that shouldn't be a
problem for unconditional branches.

The other option is running all 32 (or 64) steps.

~~~
BeeOnRope
Perhaps I've been a bit loose in my language (and not everyone is consistent
with branch vs jump). I really should have said something like "non-constant
jump". That is, in my theorem (haha) above, I mean you cannot have non-
constant work without at least one non-constant jump.

A non-constant jump is anything that can jump to at least 2 different
locations in a data-dependent way. On x86, those would be something like [2]:

\- conditional jumps (jcc) aka branches

\- indirect calls

\- indirect jumps

Basically anything that either goes one of 2 ways based on a condition, or an
indirect jump/call that can go to any of N locations.

Without one of those, you will execute the same series of instructions every
time, so you can't do variable work [1].

So when you say "unconditional branches" I'm not sure if you are talking about
_direct_ branches, which always go to the same place (these are usually called
jumps), or _indirect branches_ , which (on x86) don't have an associated
condition but can go anywhere since their target is taken from a register or
memory location.

If you are talking about the former, I don't think you can implement your
proposed strategy: you can't jump to the correct switch case with a fixed,
direct branch. If you are talking about the latter (which I had assumed), you
can – but it is subject to prediction and mispredicts in basically the same
way as a conditional branch.

\---

[1] Here, "work" has a very specific meaning: instructions executed. There are
meaningful ways you can do less work while executing the same instructions:
e.g., you might have many fewer cache misses for smaller inputs. However, at
some point the instructions executed will dominate.

[2] There are more obscure ways to get non-constant work, such as self-
modifying code, but these cover 99% of the practical cases.

------
peter_d_sherman
If the data to be binary searched is static, that is, if it doesn't change,
and if it fits entirely in RAM, then what I would do is as follows:

1) Pre-compute the first mid/center element, C1.

2) Move the data for this mid/center element to the first item in a new array.

3) Pre-compute the next two mid/center elements, that is, the ones between the
start of the data and C1 (C2), and the one between C1 and the end of data
(C3).

4) Move the data from C2 and C3 positions to the next positions in our array,
the 2nd and 3rd position.

5) Keep repeating this pattern. For every iteration/split there are 2 times
the amount of midpoints/data than the previous iteration. Order these linearly
as the next items in our new array.

When you're done, two things will occur.

1) You will use a slightly different binary search algorithm.

That is because you _no longer need to compute a mid-point at every iteration_
, those are now _pre-computed_ in the ordered array.

2) Because the data is now ordered, it becomes possible to load the tip of
that data into the CPU's L1, L2, and L3 cache. If let's say your binary search
takes 16 iterations to complete, then you might get a good headstart of 5-8
iterations (or more, depending on cache size and data size) of that data being
in cache RAM, which will make those iterations MUCH faster.

Also, (and this is just me), but if your program has appropriate permissions
to temporarily shut off interrupts (x86 cli sti -- or OS API equivalent), then
this search can be that much faster (well, depends on what the overhead for
cli/sti and API calls are, but test, test, test! (also, always shut off the
network and other threads when you're testing, as they can skewer the
results!) <g>)

[https://en.wikipedia.org/wiki/Memory_hierarchy](https://en.wikipedia.org/wiki/Memory_hierarchy)

"Almost all programming can be viewed as an exercise in caching." \--Terje
Mathisen

"Assume Nothing" \--Mike Abrash

Also, there is no such thing as the fastest binary search algorithm... there's
always a better way to do them...

To paraphrase Bruce Lee:

 _" There are no mountain tops, there is only an endless series of plateaus,
and you must ever seek to go beyond them..."_

~~~
peter_d_sherman
Also, before I forget, if let's say english words were being stored, you could
have an array of 1..26 pointers (representing letters 'A'..'Z') where each
pointer points to one of 26 other similar arrays, representing the second
character, etc. This pattern could repeat in memory. Would it be time/space
efficient? Depends on the data. Also, this could be combined with the above.
Maybe the first few letters of words are stored this way, and the rest are
ordered such that a binary search can be performed, as above. Again, depends
on data, storage, and speed requirements. Sure, you could use one or more
hashing techniques, but then what's the speed/memory penalty, and what's the
penalty for collisions? So, there are a lot of considerations to be made when
selecting a technique for storing/searching data... there is no one-size fits
all technique, as I said above, there's always a better way to do things...

~~~
rurban
You search the strings wordwide then, by 8 not by 1. You just need to
represent the strings little or big endian, and construct the nested switches
offline. About 20x faster than linear search via memcmp.
[http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-
an...](http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-
than-memcmp.html)

