
Speeding up independent binary searches by interleaving them - ingve
https://lemire.me/blog/2019/09/14/speeding-up-independent-binary-searches-by-interleaving-them/
======
zawerf
I was expecting this to be about fractional cascading which is a really clever
trick: [http://blog.ezyang.com/2012/03/you-could-have-invented-
fract...](http://blog.ezyang.com/2012/03/you-could-have-invented-fractional-
cascading/)

------
js8
I am not sure why is he talking about branching prediction when this is really
about caching. The next integer to look at is unlikely to be in the CPU data
cache, and so each time we have to go to main memory. If we interleave the
searches, we interleave the memory requests, and get higher throughput.

Or am I wrong?

~~~
cshenton
If I had to hazard a guess, I’d say the conditional move means that both next
values are prefetched into cache, whereas a branch means that fetch doesn’t
happen until the comparison is made.

~~~
nkurz
Unfortunately I think that's backwards on both sides.

Conditional moves create a data dependency, and thus do not involve
speculative execution. While the program can explicitly try to preload both
possible values (and can benefit from doing so) the processor doesn't ever do
this automatically. Other instructions can still be executed (out-of-order)
but the path of execution does not change.

With a branch, rather than waiting for the results of the comparison to be
known, the processor guesses at the result and chooses a path of execution. It
thus executes the fetch immediately, but (about) half the time it's the wrong
fetch. If after the original load completes it realizes that it guessed the
wrong side of the branch, an 'exception' occurs, the speculative work is
thrown away, and the execution begins again with the correct branch.

------
boulos
I think I’m missing something. Under what conditions do you want to search
through several _identically sized_ lists independently? (I had assumed from
the title that this would be about performing many independent queries against
the same list, optionally with SIMD, instead of one query against many lists).

It wouldn’t be a huge change to also make the N-value “per query”, but then
you get a lot more branches in the inner loop. Even then though, I’m
struggling to remember situations where I thought “I have a dozen lists where
I need to do binary search for each of them”.

~~~
kjeetgill
It definitely comes up in search-index style posting lists. For dense postings
you'll usually deal with bitmaps, but you'll have sorted integers for sparse
ones and you usually have at least one list per search term.

There's all sorts database indexing problems that can use this too, I'm sure.

------
dehrmann
You could also use SIMD instructions, and there are bitwise operator tricks
that might be able to mostly avoid branching.

~~~
nkurz
I'll ask you an open question: do you need to use SIMD to get maximum speed on
this problem? I think the answer is yes, but only because with a certain level
of MLP you run out of architectural registers and your compiler starts
swapping things to stack. Daniel thinks (hopes?) the answer is no, and that
with sufficient finesse you can get the same speed without SIMD.

The problem (well, at least very similar problems that we've thought about
more) ends up running at very low IPC, with most of the time being spent
waiting for data from RAM or L3. You can fit a lot of extra instructions in
without seeing a slowdown. If there is a benefit to SIMD, both of us think it
will be a fairly small effect, and only at very high levels of MLP. But we'd
be excited to be proven wrong!

------
amelius
Shouldn't the CPU just switch to a different thread while it is waiting for a
memory request to be fulfilled? So you could just run binary searches in
different threads.

~~~
nkurz
It's a good question, but I think that the problem is that the overhead of
switching between threads is much greater than the number of cycles wasted by
waiting for the memory request. I think the fastest context switches on Linux
are around 1 µs (1/1000000 of a second, or about 4,000 cycles)[1], whereas a
request from RAM is around 25 ns (1/40 of a µs, or about 100 cycles)[2].

It's possible my numbers are off by a bit, but I think this means that on a
standard CPU, you really need to parallelize the search more explicitly rather
than letting the OS handle it for you. GPU's on the other hand do take a
similar "brute force" approach to context switching, and clearly have good
results on some similar problems.

[1] [https://eli.thegreenplace.net/2018/measuring-context-
switchi...](https://eli.thegreenplace.net/2018/measuring-context-switching-
and-memory-overheads-for-linux-threads/)

[2] [https://www.anandtech.com/show/9483/intel-skylake-
review-670...](https://www.anandtech.com/show/9483/intel-skylake-
review-6700k-6600k-ddr4-ddr3-ipc-6th-generation/7)

------
wmf
This looks similar to a technique used in Cisco VPP to process batches of
packets very efficiently; it's a cool way to cheat the pipeline.

------
yxhuvud
What I don't understand though is why it wouldn't make sense to sort the keys
that are looked for first and look them up in order while traversing the tree.
Seems much better from a cacheing perspective..

EDIT: Ah, each look-up was in different trees. That takes away that
optimization.

------
bhouston
Sounds like at the surface like ray bundles that are used in ray tracing
acceleration.

------
WhitneyLand
The article doesn't seem to consider real life practicality enough. For
example, what speed up do you get with this, over caching the most frequently
accessed million data items using a constant time lookup?

The answer of course, is it heavily depends on your problem space and data
size and characteristics.

Caching is somewhat discussed, and I get that the topic is not data structures
and algorithms; However it seems impossible to say when this approach would be
useful in general without more practical considerations.

How easy would it be maintain or enhance this kind of code? Would you get
emails like "...we have a few bugs in the 16-way interleaved multithreaded
binary search code, who wants quickly get those cleaned up...".

~~~
asdfasgasdgasdg
> The article doesn't seem to consider real life practicality enough.

It considers it plenty for an article in the category "neat trick/wrinkle of
processor caches and pipelining you may not have known/thought about to its
limit." If you're actually doing binary searches on sets large enough to where
this would matter, most often some kind of hash based solution is going to be
preferable anyway, because of its much higher degree of cache efficiency and
better asymptotic performance. That's not this article's purpose, though. Not
every article is a tutorial.

~~~
WhitneyLand
Who said it should be a tutorial? I agree on that point it's not a how to, and
a tutorial is probably a contradiction for the topic.

That doesn't mean it can't discuss practicalities. Plenty of advanced papers,
blog posts, documentation, experiments, whatever, discuss engineering
realities because they feel like it.

The reality is some things are less intuitive and require more outside the
problem thinking than others.

If someone is writing about how binary trees work alone, not concurrent
programming as well as is done here, there's little point to considering
applications because it's simply useful to know and apply the concept so
often.

However this one is an edge case, and even when you need it I would doubt many
seasoned engineers wouldn't still benefit from contextual considerations from
the author.

