

Memory Layouts for Binary Search - rcfox
http://cglab.ca/~morin/misc/arraylayout/

======
flgr
Sorry for the blatant plug, but it might be relevant — in my M.Sc. thesis I
surveyed main memory optimized index techniques and also provided some
background for why traditional binary search is not very optimized for modern
hardware. There's lots of illustrations and I cite some great material. :)

See here:
[https://www.researchgate.net/profile/Florian_Gross/publicati...](https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf)

Along those lines:

* CSS Trees: Pointerless b-Trees with a layout optimized for cache lines ([http://www.vldb.org/conf/1999/P7.pdf](http://www.vldb.org/conf/1999/P7.pdf))

* Intel & Oracle's fast architecture-sensitive tree search (combines huge pages, cache line blocking, and SIMD in an optimal layout): [http://www.researchgate.net/profile/Jatin_Chhugani/publicati...](http://www.researchgate.net/profile/Jatin_Chhugani/publication/221213860_FAST_fast_architecture_sensitive_tree_search_on_modern_CPUs_and_GPUs/links/0c96051f5d2990770d000000.pdf)

* Adaptive radix trees ([http://codematch.muehe.org/~leis/papers/ART.pdf](http://codematch.muehe.org/~leis/papers/ART.pdf))

~~~
misframer
> _ResearchGate is currently down for maintenance, but we 'll be back online
> very soon._

Do you have a mirror?

~~~
flgr
Since I haven't set up any other domain to put it right now (should get to
that), I put it here:

[http://webclonk.flgr.me/index-search-modern-
cpus.pdf](http://webclonk.flgr.me/index-search-modern-cpus.pdf)

------
danbruc
Here is a really good article on the topic [1], »Binary Search Is a
Pathological Case for Caches«.

[1] [http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-
patholo...](http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-
case-for-caches/)

------
vvanders
Data locality matters _so_ much.

This talk from Herb Sutter at 29:00 shows this wonderfully:
[http://channel9.msdn.com/Events/Build/2014/2-661](http://channel9.msdn.com/Events/Build/2014/2-661)

~~~
melling
The table in this Coding Horror blog helps to show the point:

[http://blog.codinghorror.com/the-infinite-space-between-
word...](http://blog.codinghorror.com/the-infinite-space-between-words)

1 CPU cycle 0.3 ns 1 s

Level 1 cache access 0.9 ns 3 s

Level 2 cache access 2.8 ns 9 s

Level 3 cache access 12.9 ns 43 s

Main memory access 120 ns 6 min

~~~
vvanders
That's not even the half of it though.

One of the main points from Herb Sutter's talk is your prefetcher is a cache
of _infinite_ size. This is why Radix destroys other in-memory sorting
algorithms and why the way you access data is the most critical performance
thing you should focus on.

It's also incredibly hard to retrofit when you find out you really need that
performance back.

~~~
wfunction
> One of the main points from Herb Sutter's talk is your prefetcher is a cache
> of infinite size. This is why Radix destroys other in-memory sorting
> algorithms

Can you elaborate? Radix sort seems like the most unpredictable sorting
algorithm out there last time I checked... it jumps all over memory.

~~~
vvanders
Radix sort on a fixed width key is actually one of the most predictable
sorting algorithm. In a traditional Radix sort the complexity is O(kn) where k
is how much bucketing you want to do on the key size. Buckets are fixed at 2^n
bit size so more buckets = less space, less buckets = fewer passes.

Say for a sorting 32bit values you might split your buckets by a factor of 4
aligning to a byte boundary. This would give you 4, 256 entry buckets.

The sort then becomes a linear walk of the data doing the standard Radix sort.
This is where the prefetcher and packed arrays really start working to your
advantage. Your memory access is trivially predictable and the prefetcher gets
to work eliminating all DRAM cache misses. Most algorithms will hint at the
fetches but it usually takes just a few loops for the prefetcher to do just as
well. If you want to some parts of the passes can be batched to get better
than O(kn).

Now if you want to get fancy you can take your key, and find out what bit
width for your buckets line up nicely with your cache line size and watch
Radix really scream.

This used a lot in computer graphics for determining draw order, intersection
tests, etc(in fact the book "Realtime Collision Detection" is one of the best
data structure literature I've read, it just happens to be about 2D/3D problem
spaces).

This is why understanding the implications of memory access is so important.
Paraphrased from the talk: Big O notation is really a blunt tool and sometimes
those constant factors can be incredibly painful.

[edit] They also have the nice property of being stable which can have some
useful benefits.

~~~
wfunction
> on a fixed width key

Okay well that changes things a bit. I was thinking on general data, like
strings.

> The sort then becomes a linear walk of the data doing the standard Radix
> sort.

Does it? What about all the buckets which you write randomly to?

~~~
vvanders
Strings are going to be bad anyway because there's a high chance you're
hitting a cache miss just to get access to the heap allocated string data.

Yes, only the histogram pass has random writes and if you pick your bucket
size appropriately(like I mentioned above) then the surface area for those
random writes can land within a cache line or two.

Summing the histograms are a very nice linear read/write pass.

------
codepie
Deciding memory layout is a sub-task of designing cache-oblivious algorithms.
Cache-oblivious algorithms optimize the number of memory transfers. I found
this very interesting
[http://erikdemaine.org/papers/BRICS2002/paper.pdf](http://erikdemaine.org/papers/BRICS2002/paper.pdf)

~~~
agumonkey
Everything Demaine works on look so new, simple and interesting. Succint DS,
origami/folds proofs, now this.

------
gsg
Interesting.

A while back I did some rough tests on search of arrays in depth first order.
The hope was that the better contiguity (the next element in the array is the
next element to test 50% of the time) would lead to better search performance,
but I wasn't able to observe much of a difference in practice.

I also found that while writing the search operation was very easy for arrays
in this order, insertion was much more difficult.

