Hacker News new | past | comments | ask | show | jobs | submit login
Binary Search: A new implementation that is up to 25% faster (github.com)
171 points by scandum 19 days ago | hide | past | favorite | 54 comments



The readme doesn't describe why it's faster. Looking at the code of the variant that is the fastest: https://github.com/scandum/binary_search/blob/master/binary-...

It seems like it's a quaternary search which seems like an arbitrary "magic number" of interior points. It's easy to understand what it does if you already know other variants like ternary search (cut search space into 3 to pick 2 interior points) or golden section search (same thing as ternary except in golden ratio). Here, quaternary search is just picking 3 interior points after dividing into 4 parts.

So the speed up is the same as how b-trees get their speedup: increase branching factor which costs more comparisons but reduces the depth. I might be wrong, but instead of quaternary it could also be 5-ary or 8-ary or any B-ary and any of these variants can also have the potential to perform better.

Just a tradeoff between: cost_of_divide * log_B(N) + cost_of_compare * (B - 1) * log_B(N)

EDIT: Thinking about it more, the divison doesn't seem like it should be the most expensive operation (especially relative to compares/branching). Anyone have any better ideas on why you would prefer more compares? Is it some pipelining thing?


Wider branching factors (3-ary, 4-ary, etc) are less efficient in the total number of comparisons, but give you more memory level parallelism, and large searches are dominated by memory access behavior and critical paths, not total comparisons or instruction throughput. So better MLP can make up for the inefficiency of higher arity searches... to a point.

E.g., with 4-ary search, your search tree is half the depth of binary search (effectively, you are doing two levels of binary search in one level), but you do 3x the number of comparisons, so 1.5x more in total.

However, the comparison (1 cycle) is very fast compared to the time to fetch the element from memory (5, ~12, ~40, ~200+ cycles for L1, L2, L3 and DRAM hit respectively). The 4-ary search can issue the 3 probes in parallel, and so if the total time is dominated by the memory access, you might expect it to be ~2x faster. In practice, things like branch mispredictions (if the search is branchy) complicate the picture.


Sounds right. I think the fastest solution (of this approach) would do something like: get the Lx-cache-row sized batch; check upper/lower end; depending on result: choose next batch by binary division, or do linear walk to find the match. Not sure if doing it precisely would be an improvement over the magic numbers in the example though.

Then again, I'd like to experiment with prefetch here as well. It may be possible to squeeze out even more performance.


I don't think the trick of checking each end of the cache line helps much, except perhaps at the very end of the search (where there are say low 100s of bytes left in the region).

When the region is large it just doesn't cut the search space down almost at all: it's a rounding error.

Now you might think it's basically free, but clogging up the OOOE structures with instructions can hurt MLP because fewer memory accesses fit in the window, even if the instructions always completely execute in the shadow of existing misses.

There is a similar trick with pages: you might try to favor not touching more pages than necessary, so it might be worth moving probe points slightly to avoid touching a new page. For example, if you just probed at bytes 0 and 8200, the middle is 4100, but that's a new 4k page (bytes 4096 thru 8191), so maybe you adjust it slightly to probe at 4090 since you already hit that page with your 0 probe.

Making all of these calculations dynamically is a bit annoying and maybe slow, so it's best if the whole search is unrolled so the probe points according to whatever rules can be calculated at compile time and embedded into the code.

Much more important than either of these things is avoiding overloading the cache sets. Some implementations have this habit of choosing their probe points such that only a very small number of sets in the L1, L2 are used so the caching behavior is terrible. Paul Khuong talks about this in detail on pvk.ca.

Doing a linear walk at the end definitely helps a bit though: SIMD is fastest for this part.


If you want to optimize a data structure for binary search, it sounds like it might be best to reorder the data itself in memory to make caching more effective.

For example the first access in a binary search will be the middle element, followed by the lower quartile or upper quartile. If you store all of those together in memory, a single cache line fetch can serve all those requests.


See user `fluffything`'s comment on this same thread: https://news.ycombinator.com/item?id=23895629, describing this approach (Eytzinger layout).


A levelwise traversal usually performs well without much effort if you want to optimize for cache friendliness. It's not optimal iirc, but it's dead-simple and much easier on the prefetcher than an inorder traversal.

Edit: Oops, TIL the eytzinger layout other comments have mentioned _is_ a levelwise traversal. Anywho, it's usually a good starting point for this kind of thing.


Yes, although the design space explodes once you can reorder, so it's also interesting to consider the restricted problem where you can't reorder. This is still realistic since in many scenarios the layout may be fixed by an external constraint.


Binary search is used when you can't optimize the data structure. You have a sorted array, and have to work with that.

If you store the middle element together with its first two quartiles, you're making a B-tree. You will need indirection (pointers) to locate the other chunks that are not stored together and so it goes.

"Binary search" usually refers to the sorted array algorithm, though of course in the abstract sense of subdivision, any subdivision-by-two search is binary and in that sense we have the term "binary search tree".


Not really.

On a GPU for example, if you have an array with 100.000 elements and want to find the index of one value, you are probably best of by using 100.000 threads to find that value.

So that's 100.000-ary search.

On a CPU, using an Eytzinger layout will probably also give you better performance than any of the N-ary approaches.


I've never heard of the eytzinger layout but it seems like it's just a static binary tree on a array (children are at 2i and 2i+1), commonly used for heaps and segment trees.

Very readable blog post on how it helps with binary search: https://algorithmica.org/en/eytzinger

With follow up post on N-ary approaches: https://algorithmica.org/en/b-tree


Yes, its essentially a heap. These just happens to be cache oblivious for performing binary searches.


Random idea: a replacement function library that accepts hardware tuning parameters like this and a corresponding tool which sets them once during OS installation/upgrade and provides a config file or environment variable setup, so apps can share the setup...

Bonus if this library also detected/leveraged specialized hardware (GPU, AVX, etc) and then leveraged it appropriately.

Obviously, these parameters would need optional per-app-invocation overrides, e.g. so random apps didn't hog shared resources, e.g. for benchmarking, etc.

(I'm guessing this exists but not widely deployed...?)


This was somewhat the approach used in the ATLAS[1] linear algebra library. In the ATLAS case, systems were characterized through testing and a set of parameters generated. Then at installation time code was generated to match the system type using the given parameters.

[1] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...


So why is it called binary if it’s not binary?


to grab headlines.


If you have some sort of chip that can do B comparisons in parallel, you can search in log(N)/log(B) time.


Any modern chip with SIMD can do comparisons in parallel. For example, almost recent Intel chip can do 16x 32-bit comparisons in a single cycle.

The bottleneck is not usually the comparison itself, however, but the data movement needed to gather all the elements to be compared. This cannot be vectorized efficiently since SIMD gathers tend to be inefficient.

The story is probably different on GPUs.


I checked the plot against an old school project where I investigated binary search, and I got similar numbers for the "standard" binary search (std::lower_bound in my project): For N=1000,10000,100000 it took 713,925,1264 us (compare to the author's 750,960,1256 us).

In my project I didn't test "quaternary search", but instead the winner was the Eytzinger layout that others have discussed, with 570,808,1044 us (compare to author's quaternary search: 557,764,1009 us).

It would be interesting to try a 4-way Eytzinger layout.

My project report, if anyone's interested: https://dl.strova.dk/ksb-manuel-rav-algorithm-engineering-20...

EDIT: I reuploaded the linked PDF since the first one I uploaded was missing the source code.


Binary search can be implemented using conditional move instructions instead of branching. Given the branches are unpredictable, this yields a huge speedup. It can be implemented like this in pure C, if written in a way the compiler can recognize. In my experience this is the fastest binary search.

[1] https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates...


Actually cmov-based searches often (usually?) result in a slowdown unless non-obvious measures are taken.

A cmov-based search has the problem that the probes for each level are data-depedent on the prior level. So you cannot get any memory-level parallelism, because the address to probe at level n+1 isn't known until the comparison at level n completes.

Branchy search is at least right half the time (in the worst case) on the next direction, 25% of the time for 2 levels down, etc. So it gets about 1 additional useful access in parallel, on average in the worst case, compared to cmov search.

Except for very small searches which fit in the L1 cache, where misprediction cost dominate overall time, branchy is competitive and often faster.

Now you can take countermeasures to this problem with the cmov search. One is to increase the arity of the search as in the OP, another is explicit software prefetching, etc, which might put cmov back on top.

This has all been assuming the worst case of totally unpredictable branches. Real world searches might be expected to be less-than-random as some elements are hotter than others, etc. Any predictability of the branches helps out for branchy search.


As usual, it depends. I was doing binary search within btree nodes, which means small arrays that fit into the cache. Branchless search performed best there. Actually even the btree search itself was branchless (if nodes were in memory) until the end, as the pointer to the next node was just loaded via an index calculation and the loop continued with a binary search of the next node. I did a little software prefetching as well once I had the pointer to the next node.


Yes, tightly packed nodes or nodes which are all in L1 is one place where plain cmov can win.

The sweet spot is pretty small though, because it is not only squeezed from above by branchy which wins for less cached data, but also from below by linear vectorized search. You can't always apply vectorized search (e.g. if the keys are not contiguous), but when you can it can win up to dozens of elements.

When you throw the possibility of predictability on top, where branchy wins, that's why I usually reject unconditional statements that "cmov gives a huge speedup".


That's why the fastest is branchless + eager prefetch + Eytzinger layout.


The fastest depends largely on predictability. If it is totally unpredictable I find that prefetch roughly ties with increasing the arity (indeed, the effect is similar when you consider how it all plays out).

If there is a moderate amount of predictability I don't think you can beat branchy.

If you can change the layout, it's a while different game. You could write a paper or two on it and people have.


The more recent results (by the same author) is this paper: https://arxiv.org/pdf/1509.05053.pdf

The conclusion is still that CMOV is best, but once the data stops fitting in the cache there's alternatives worth considering.


I just stumbled over this one independently while researching this topic some more, it's an interesting read, thanks for sharing.

> For readers only interested in an executive summary, our results are the following: For arrays small enough to be kept in L2 cache, the branch-free binary search code is the fastest algorithm. For larger arrays, the Eytzinger layout, combined with the branch-free prefetching search algorithm is the fastest general-purpose layout/search algorithm combination.


This is really cool!

About a decade ago I did research on sorting and searching, which was the same type of work going on here (https://github.com/pbiggar/sorting-branches-caching). I found it was extremely difficult to accurately say whether one algorithm is faster than another in the general case. I found a bunch of speed improvements that probably only apply in processors with very long pipelines (like the P4)).

Execution speed is probably the right metric here, but it makes it hard to understand _why_ it's faster.

PS. Looking at "boundless interpolated search" (https://github.com/scandum/binary_search/blob/master/binary-...), it seems it's missing a `++checks`. I initially thought this could be the cause of a misreporting it as being faster, but I see the benchmark is run-time based so that wouldn't cause it unless the incrementing itself is the bottleneck.


It is nice to see benchmarks of elementary algorithms, but none of these is new. Benchmarking would be more convincing if more distributions were used. For example, we know that interpolated search is O(lglgn) for uniform distributions that are used here, but can be linear in pathological cases.


Exponentially distributed data is a great example of that. It's fun that the author picks 32-bit ints; I suspect that you just don't have enough dynamic range to really hobble interpolated search.


It would be interesting to see an analysis of interpolated search with other known but non-uniform distributions.


I've seen most real-world applications that use in-memory search, use balanced binary search trees. Are any of these improvements applicable to those data structures? AVL and Red-black trees come to mind.


AVL and Red-black trees are pretty terrible and can easily be improved upon by cache-friendlier data structures like for example B-trees.


On the topic of binary searches, I was wondering if they can be done without any branching so you can avoid branch misses and possibly parallelize it with SIMD instructions. I think I was able to get it with some optimizer flags and GCC built-ins.

https://github.com/ehrmann/branchless-binary-search


Yes you can do it branch free, with only a single branch at the end (so your search actually terminates), but the benefits are not as obvious as you might thing per my other reply: https://news.ycombinator.com/item?id=23894709


You can avoid the branch at the end by computing your iteration count beforehand and using switch statement jump table where each case is a step in the search, and you iterate by falling through.


Yes, but this just swaps the terminating conditional branch for an indirect branch at the start, no?

In general my guideline is: for an algorithm which has variable input size, do you want to do a variable amount of work? If yes, you will have at least one branch, and this branch will be unpredictable with at least some distributions of unpredictable input sizes.

In this case I think you definitely want "variable work" because a search over four elements should be shorter than search over one million.


> Yes, but this just swaps the terminating conditional branch for an indirect branch at the start, no?

Depends how you write it. The number of iterations is ceil(log2(n)). GCC's __builtin_clz essentially computes this, and it has support on most major architectures.


Yes, you can calculate the number of levels, but with a dynamic size that isn't going to let you avoid a branch. Say you calculate there are 10 levels in the search. What do you do now?


Do you consider a jump a branch? My impression was always the point of avoiding branches is so you don't have mispredictions, but that shouldn't be a problem for unconditional branches.

The other option is running all 32 (or 64) steps.


Perhaps I've been a bit loose in my language (and not everyone is consistent with branch vs jump). I really should have said something like "non-constant jump". That is, in my theorem (haha) above, I mean you cannot have non-constant work without at least one non-constant jump.

A non-constant jump is anything that can jump to at least 2 different locations in a data-dependent way. On x86, those would be something like [2]:

- conditional jumps (jcc) aka branches

- indirect calls

- indirect jumps

Basically anything that either goes one of 2 ways based on a condition, or an indirect jump/call that can go to any of N locations.

Without one of those, you will execute the same series of instructions every time, so you can't do variable work [1].

So when you say "unconditional branches" I'm not sure if you are talking about direct branches, which always go to the same place (these are usually called jumps), or indirect branches, which (on x86) don't have an associated condition but can go anywhere since their target is taken from a register or memory location.

If you are talking about the former, I don't think you can implement your proposed strategy: you can't jump to the correct switch case with a fixed, direct branch. If you are talking about the latter (which I had assumed), you can – but it is subject to prediction and mispredicts in basically the same way as a conditional branch.

---

[1] Here, "work" has a very specific meaning: instructions executed. There are meaningful ways you can do less work while executing the same instructions: e.g., you might have many fewer cache misses for smaller inputs. However, at some point the instructions executed will dominate.

[2] There are more obscure ways to get non-constant work, such as self-modifying code, but these cover 99% of the practical cases.


If the data to be binary searched is static, that is, if it doesn't change, and if it fits entirely in RAM, then what I would do is as follows:

1) Pre-compute the first mid/center element, C1.

2) Move the data for this mid/center element to the first item in a new array.

3) Pre-compute the next two mid/center elements, that is, the ones between the start of the data and C1 (C2), and the one between C1 and the end of data (C3).

4) Move the data from C2 and C3 positions to the next positions in our array, the 2nd and 3rd position.

5) Keep repeating this pattern. For every iteration/split there are 2 times the amount of midpoints/data than the previous iteration. Order these linearly as the next items in our new array.

When you're done, two things will occur.

1) You will use a slightly different binary search algorithm.

That is because you no longer need to compute a mid-point at every iteration, those are now pre-computed in the ordered array.

2) Because the data is now ordered, it becomes possible to load the tip of that data into the CPU's L1, L2, and L3 cache. If let's say your binary search takes 16 iterations to complete, then you might get a good headstart of 5-8 iterations (or more, depending on cache size and data size) of that data being in cache RAM, which will make those iterations MUCH faster.

Also, (and this is just me), but if your program has appropriate permissions to temporarily shut off interrupts (x86 cli sti -- or OS API equivalent), then this search can be that much faster (well, depends on what the overhead for cli/sti and API calls are, but test, test, test! (also, always shut off the network and other threads when you're testing, as they can skewer the results!) <g>)

https://en.wikipedia.org/wiki/Memory_hierarchy

"Almost all programming can be viewed as an exercise in caching." --Terje Mathisen

"Assume Nothing" --Mike Abrash

Also, there is no such thing as the fastest binary search algorithm... there's always a better way to do them...

To paraphrase Bruce Lee:

"There are no mountain tops, there is only an endless series of plateaus, and you must ever seek to go beyond them..."


> That is because you no longer need to compute a mid-point at every iteration, those are now pre-computed in the ordered array.

This is called the Eytzinger layout [1].

[1] https://arxiv.org/abs/1509.05053


Did not know that!

Interesting!

Upvoted!


Also, before I forget, if let's say english words were being stored, you could have an array of 1..26 pointers (representing letters 'A'..'Z') where each pointer points to one of 26 other similar arrays, representing the second character, etc. This pattern could repeat in memory. Would it be time/space efficient? Depends on the data. Also, this could be combined with the above. Maybe the first few letters of words are stored this way, and the rest are ordered such that a binary search can be performed, as above. Again, depends on data, storage, and speed requirements. Sure, you could use one or more hashing techniques, but then what's the speed/memory penalty, and what's the penalty for collisions? So, there are a lot of considerations to be made when selecting a technique for storing/searching data... there is no one-size fits all technique, as I said above, there's always a better way to do things...


You search the strings wordwide then, by 8 not by 1. You just need to represent the strings little or big endian, and construct the nested switches offline. About 20x faster than linear search via memcmp. http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-an...


I think you're describing a very inefficient trie


I think that you've invented a heap array.


indeed, that's how you'd embed a balanced binary search tree in an array.


You aren't really binary searching anymore. If it all fit into the RAM, you might as well just put it into a hashset.


You are correct, hashes are tremendously useful for some types of data...

But, hashes, while tremendously useful for some types of data, are not necessarily without problems for other data, most notably if the data has the possibility of collisions:

https://en.wikipedia.org/wiki/Pigeonhole_principle

https://en.wikipedia.org/wiki/Collision_(computer_science)

https://danlark.org/2020/06/14/128-bit-division/

>"When it comes to hashing, sometimes 64 bit is not enough, for example, because of birthday paradox — the hacker can iterate through random 2^{32} entities and it can be proven that with some constant probability they will find a collision, i.e. two different objects will have the same hash. 2^{32} is around 4 billion objects and with the current power capacity in each computer it is certainly achievable. That’s why we need sometimes to advance the bitness of hashing to at least 128 bits. Unfortunately, it comes with a cost because platforms and CPUs do not support 128 bit operations natively."

So, it depends on the data, and what's being done with it, if the possibility of collisions exist, and what the implications of those possibilities are (perhaps hashes are great for speed, and harmless, if not used with respect to passwords, that is, used in some other place in software infrastructure, they might be the right choice for that place...)


Are binary searches that much better than hashmaps if a malicious attacker gets to interact with the system?


Hashmaps are resistant against timing attacks, while binary searches are not.


With possible timing attacks you usually do a full linear search, without early return. This search has always constant time, twice as slow. When the attacker gets to know the seed somehow even hashmaps can be exploited.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: