

Cache-friendly binary search - joaquintides
http://bannalia.blogspot.com/2015/06/cache-friendly-binary-search.html

======
kwantam
A "simple" memory layout for cache friendliness (assuming you don't have to do
updates on your tree; there exists a more complicated cache-friendly dynamic
tree structure, too) is a recursive van Emde Boas layout. This memory layout
will achieve performance no more than a (small) constant multiple worse than
an omniscent cache in terms of the number of cache misses, and it requires
almost no tuning to work in any caching hierarchy.

[http://supertech.csail.mit.edu/cacheObliviousBTree.html](http://supertech.csail.mit.edu/cacheObliviousBTree.html)

More generally: in the last 10-15 years there has been a lot of work on
developing algorithms that are compatible with caching, but which do not need
to be tuned to the details of a particular cache hierarchy on which they run.
For more info, see

Demaine, Erik D. "Cache-Oblivious Algorithms and Data Structures."
[http://erikdemaine.org/papers/BRICS2002/paper.pdf](http://erikdemaine.org/papers/BRICS2002/paper.pdf)

Erik Demaine's 6.046 lectures on this topic are available on YouTube:
[https://www.youtube.com/watch?v=cJOHERGcGm4](https://www.youtube.com/watch?v=cJOHERGcGm4)
and
[https://www.youtube.com/watch?v=zjUDy6a5vx4](https://www.youtube.com/watch?v=zjUDy6a5vx4)

One cache-oblivious dynamic tree structure is also due to Demaine:
[http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/](http://erikdemaine.org/papers/CacheObliviousBTrees_SICOMP/)

~~~
lorenzhs
van Emde Boas trees need quite a few tricks in order for an implementation to
be truly competitive, but then they can be really quick: "Engineering a Sorted
List Data Structure for 32 Bit Keys", Roman Dementiev, Lutz Kettner, Jens
Mehnert, Peter Sanders.
[http://algo2.iti.kit.edu/dementiev/files/veb.pdf](http://algo2.iti.kit.edu/dementiev/files/veb.pdf)
\- code: [http://people.mpi-inf.mpg.de/~kettner/proj/veb/](http://people.mpi-
inf.mpg.de/~kettner/proj/veb/)

------
enigmo
Taken to it's logical extreme: k-Ary Search on Modern Processors

[https://people.mpi-
inf.mpg.de/~rgemulla/publications/schlege...](https://people.mpi-
inf.mpg.de/~rgemulla/publications/schlegel09search.pdf)

As SSE/AVX registers get wider and wider you might as well compare an entire
cache line (or two) at a time. But the overhead of building the level order
k-tree means you need to do a whole lot of lookups for each insert... so it
doesn't apply to that many problems. Unless you're building a search engine.
Then it applies a lot.

------
deathy
Somewhat related maybe? "Memory Layouts for Binary Search"
[https://news.ycombinator.com/item?id=9511939](https://news.ycombinator.com/item?id=9511939)

I also really liked the nice package there for testing it out yourself,
creating results package and clear email instructions for contributing the
results.

Is there any good way of running these kinds of experiments on a variety of
hardware possibly with other people's help?

------
serialx
Or you can use binary search with slightly off-center binary or quaternary
(four-way) searches to provide more consistent lookup times:

[http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-
patholo...](http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-
case-for-caches/)

------
TheLoneWolfling
Binary search doesn't make much sense on modern processors anyways.

We tend to end up comparing keys that are much smaller than the cache lines -
and the memory access takes so long compared to the actual calculation that
you may as well check everything in the cacheline "while you're at it".

Which ends up meaning that in practice, you may as well do a k-ary search.

I wonder how much, if any, this can be improved by prefetching the indexes you
might access next step?

~~~
rakoo
> Which ends up meaning that in practice, you may as well do a k-ary search.

So, basically, use b-trees ?
([https://en.wikipedia.org/wiki/B-tree](https://en.wikipedia.org/wiki/B-tree))

>
> [https://queue.acm.org/detail.cfm?id=1814327](https://queue.acm.org/detail.cfm?id=1814327)

Poul-Henning Kamp already wrote about it
([https://web.archive.org/web/20150523191845/http://queue.acm....](https://web.archive.org/web/20150523191845/http://queue.acm.org/detail.cfm?id=1814327)).
Long story short: with the correct data structure, and assuming you are going
to be limited by your storage, you can expect from little wins if your RAM is
empty to a 10x win when your RAM is full (and you have to fault pages)

~~~
TheLoneWolfling
Not _quite_ the same thing, because you want to pull in a fixed number of
_bytes_ at a time, which can be a variable number of _keys_. But yes, pretty
close.

And yes, there is nothing new in the field of CS, ever, it seems. Although
note that that is talking about a B-heap (with insertion / etc), whereas this
is a b-tree on straight lookups.

------
opcvx
Sure this is marginally better if you have a static structure, in all other
cases you lose.

~~~
the8472
static or read-mostly.

