
Memory Layouts for Binary Search – V2 - nkurz
http://cglab.ca/~morin/misc/arraylayout-v2/
======
derf_
For those who want a bit more than "we trade bandwidth for latency" and a bit
less than a 43-page paper (not to discourage anyone, it's very well-written):

In a binary-heap ("Eytzinger") layout, the next node to consider is at either
index (2i + 1) or (2i + 2). That means you can perfectly predict the next
cache line to load. This continues to be true up to four iterations in advance
(assuming proper alignment, for 4-byte data with 64-byte cache lines, holding
16 values each). Speculative execution will begin these loads early,
regardless of whether the branch prediction is ultimately successful. A
branch-free algorithm can do the same with explicit prefetch instructions to
avoid the pipeline stalls of mis-prediction. However, in each cache line, you
only look at one value (1/16th of the data transferred).

You can attempt to pack the data for several iterations into a single cache
line to increase utilization. However, because a complete binary tree has 2^n
- 1 values (i.e., because it is odd), there are some small inefficiencies. For
example, four iterations of a binary search require (1 + 2 + 4 + 8) = 15
possible values, of which you will look at four (1/4 of the data transferred,
given the same 16-element cache line size). But either you include an extra
element and thus sometimes need 5 comparisons instead of 4 (B-tree algorithm),
or you mis-align the subtrees with cache line boundaries (van Emde Boas
algorithm), or you waste space by only putting 15 values in a cache line (not
tested). And you can't issue the load for the next cache line until the
current subtree search completes.

Even though the latter approaches have up to four times better utilization of
the data they load, as long as the bandwidth-delay product is at least 4 cache
lines (i.e., you can issue 4 requests in parallel without slowing any of them
down), the Eytzinger approach will win. They measured the bandwidth-delay
product at 4.7 on their machine, and overall algorithm comparisons seem to be
consistent across over 100 different CPUs.

If the data fits in L2 cache (2^16 values), a branchless prefetching algorithm
on a simple sorted array is still slightly faster than using the Eytzinger
ordering, just because of constant factors. The difference is small (15%ish on
average, estimating from the plots).

Some broken CPUs (Phenom II, Celeron J1900, E3-1230) seem to cause large
slowdowns when prefetching invalid addresses. Masking the address fixes the
problem with no measurable performance impact on non-broken CPUs.

------
beagle3
The graphs are extremely interesting - there's a significant qualitative
difference between what works best: in-cache, "least branches, least work"
works best. out-of-cache, "least fetches" works best. Not at all surprising,
of course, but a reminder that there's no point discussing these issues
independent of the cache size. Also, no point optimizing without knowing the
approximate cache size and memory footprint of your real data.

------
corysama
> Spoiler: It has to do do with hiding latency by overusing bandwidth

Not a perfect win, but probably a good trade off given that hardware bandwidth
has been much easier to improve compared to the cost of developing hardware
with better latency (and will likely continue to do so for the foreseeable
future).

------
jbapple
See also "Optimal Hierarchical Layouts for Cache-Oblivious Search Trees", by
Peter Lindstrom and Deepak Rajan:
[http://arxiv.org/abs/1307.5899](http://arxiv.org/abs/1307.5899).

~~~
pkhuong
The layout is definitely interesting, but the experiments in that paper are of
questionable usefulness. The numbers are for simulated caches (cachegrind),
and don't take TLBs into account. They also only consider complete binary
trees, and that is a worst case for binary search. I pointed out that issue to
the authors and using ternary search cuts L1 and L2 misses by about 40-60% (VS
binary search), without any change in layout.

If I want to evaluate the Min-WEP layout for a real application, I have to run
my own experiment to weigh the impact of reducing cache misses versus constant
factors _knowing that even the cache miss numbers are way off for the simplest
layout (sorted array)_. The performance information in the paper can't help me
make even an educated guess.

That's why I really liked Pat Morin's approach of going for real time on a
variety of microarchitectures… which helped us realise that cache misses
aren't that good of a metric these days.

------
FullyFunctional
The OP asks for people to try the code, but the provided C++ code requires a
fairly recent g++. I could have tried a couple of targets, but I don't want to
waste my time porting it. What about this requires C++?

~~~
pkhuong
Convenience stuff in the test harness.

------
uxcn
Does anyone know if there are any similar studies for hash tables (e.g.
algorithm, bucket layouts, probe sequences)?

~~~
nkurz
Not quite what you are asking for, but I think this is a similarly smart paper
about Cuckoo Hashes that takes the right approach for modern processors, and
includes some background on how other approaches fare:
[http://www.cs.columbia.edu/~orestis/sigmod15.pdf](http://www.cs.columbia.edu/~orestis/sigmod15.pdf)

~~~
uxcn
Interesting results. Thanks, I'll have to take a deeper look through the
paper.

Cuckoo hashing seems to be the standard, but I still haven't seen concrete
numbers comparing different variations or other schemes like hopscotch
particularly for latency and throughput. Hopscotch seems like it might give
better latency. For simple stuff, I've found that a good hash and linear probe
generally outperforms other simple schemes up to roughly 50% load, but I
honestly haven't really had a need to dig further into other schemes so far.

