Binary search doesn't make much sense on modern processors anyways. We tend to e...

rakoo · on June 25, 2015

> Which ends up meaning that in practice, you may as well do a k-ary search.

So, basically, use b-trees ? (https://en.wikipedia.org/wiki/B-tree)

> https://queue.acm.org/detail.cfm?id=1814327

Poul-Henning Kamp already wrote about it (https://web.archive.org/web/20150523191845/http://queue.acm....). Long story short: with the correct data structure, and assuming you are going to be limited by your storage, you can expect from little wins if your RAM is empty to a 10x win when your RAM is full (and you have to fault pages)

TheLoneWolfling · on June 25, 2015

Not quite the same thing, because you want to pull in a fixed number of bytes at a time, which can be a variable number of keys. But yes, pretty close.

And yes, there is nothing new in the field of CS, ever, it seems. Although note that that is talking about a B-heap (with insertion / etc), whereas this is a b-tree on straight lookups.

nkurz · on June 25, 2015

I wonder how much, if any, this can be improved by prefetching the indexes you might access next step?

I haven't tested it directly on binary search, but I've done some synthetic tests of the same approach. The answer seems to be: the results can be really good!

There are several issues, though.

First, your gains will usually be in increased throughput rather than reduced latency. And this is usually a tradeoff that requires greater consumption of memory bandwidth, as you're loading data you don't need. If you are running on a single-core, this is usually a good tradeoff. If you are already parallel on all cores, it may not be.

The second is that you need to be able to "batch" your requests. That is, you need to be prefetching for the next request while you are handling the next level for the current request. The shorter the final processing time once you have the answer, the larger your batches need to be.

The last is that each CPU has a limit on the number of outstanding requests for cachelines from RAM: usually 10. This limits the size of your batches in accordance with Little's Law. The question then becomes whether that batch size is possible, and how much benefit it provides.

But there is a sweet spot where you can get large gains (2-3x single processor throughput) if you have enough requests available, if you can tolerate a slightly slower response time, and if the processing required once you have a result is significant (comparable to memory latency).

Oh, and you also need to be willing to write the critical sections in assembly. Even if you figure out the combination of fragile syntax and compiler optionss that produces the branch-free logic that you need, convincing an optimizing compiler to make your apparently unnecessary prefetches at exactly the point in time you need them to be is next to impossible.

This paper isn't quite on this exact topic, but leans heavily in the same direction:

Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou, Arun Raghavan, Kenneth A Ross http://www.cs.columbia.edu/~orestis/sigmod15.pdf