Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For those who want a bit more than "we trade bandwidth for latency" and a bit less than a 43-page paper (not to discourage anyone, it's very well-written):

In a binary-heap ("Eytzinger") layout, the next node to consider is at either index (2i + 1) or (2i + 2). That means you can perfectly predict the next cache line to load. This continues to be true up to four iterations in advance (assuming proper alignment, for 4-byte data with 64-byte cache lines, holding 16 values each). Speculative execution will begin these loads early, regardless of whether the branch prediction is ultimately successful. A branch-free algorithm can do the same with explicit prefetch instructions to avoid the pipeline stalls of mis-prediction. However, in each cache line, you only look at one value (1/16th of the data transferred).

You can attempt to pack the data for several iterations into a single cache line to increase utilization. However, because a complete binary tree has 2^n - 1 values (i.e., because it is odd), there are some small inefficiencies. For example, four iterations of a binary search require (1 + 2 + 4 + 8) = 15 possible values, of which you will look at four (1/4 of the data transferred, given the same 16-element cache line size). But either you include an extra element and thus sometimes need 5 comparisons instead of 4 (B-tree algorithm), or you mis-align the subtrees with cache line boundaries (van Emde Boas algorithm), or you waste space by only putting 15 values in a cache line (not tested). And you can't issue the load for the next cache line until the current subtree search completes.

Even though the latter approaches have up to four times better utilization of the data they load, as long as the bandwidth-delay product is at least 4 cache lines (i.e., you can issue 4 requests in parallel without slowing any of them down), the Eytzinger approach will win. They measured the bandwidth-delay product at 4.7 on their machine, and overall algorithm comparisons seem to be consistent across over 100 different CPUs.

If the data fits in L2 cache (2^16 values), a branchless prefetching algorithm on a simple sorted array is still slightly faster than using the Eytzinger ordering, just because of constant factors. The difference is small (15%ish on average, estimating from the plots).

Some broken CPUs (Phenom II, Celeron J1900, E3-1230) seem to cause large slowdowns when prefetching invalid addresses. Masking the address fixes the problem with no measurable performance impact on non-broken CPUs.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: