Efficiently Searching In-Memory Sorted Arrays: Revenge of Interpolation Search? [pdf]

scrubs · on June 3, 2023

Broadly related: https://news.ycombinator.com/item?id=25899286 PGM Indexes: Learned indexes that match B-tree performance with 83x less space

I might suggest, slightly pushing back on this article's PDF, binary search is not the de-facto in-memory search algorithm. Cache consciousness btrees, and various optimizations of tries, or radix trees are better. Eytzinger arrays are a good BST option if data is pre-sorted and/or requires rare resort passes. Which brings us back to btrees: keeping data sorted is usually 51% of the problem. Search is the other 49%.

OK, if you don't need range search, which needs sorted keys, then hashmaps are darn hard to beat.

The PDF author however makes a strong point: if FLOPS is going up way faster than memory BW, the obvious thing to do is spend a little more code looking into memory smartly. And the PDF even gives algos to do it. So my that's why I say BST as de-facto algorithm, while I don't think that's true, is only small push back. The paper goes way further.

pvansandt · on June 3, 2023

I (author) agree that certainly at the higher end of element counts, it would be a very strange decision to not use an index. One of my favorite related papers is RadixSpline by Kipf, which shows an index based approach.

However, the inner loop of indexes usually is some other kind of search algorithm. Graefe mentions in BTree techniques that interpolation search is sometimes applicable to inner nodes in btrees. One of the SIP evaluations was using SIP as part of the inner loop in LevelDB, which doesn't displace the use of an LSM for managing the larger data set.

If you don't need a lot of range queries, eytzinger, or cache-line optimized btrees do get a lot more interesting, but the iteration logic gets a lot worse. SIP returns an element if it's present and a sentinel if absent, which is an odd choice since the index is usually what's needed. No clue, but hopefully you'll have some sympathy for revisiting your old code and wondering who wrote it. :)

I do think that optimizing search algorithms is still worthwhile in an indexed setting for two reasons: 1. The size of a node within an index like a btree depends on the efficiency with which that node can be searched, so if you are able to decrease the cost of searching a given node in the index, that might motivate larger node sizes. I know that a common heuristic is to use a multiple of a page size, but that is based on the assumption that the whole node will be used as part of the search. 2. Pushing algorithmic decisions to just in time can be more costly than the time saved from choosing a better algorithm. You lose icache efficiency, stall on the decision, etc. I think it's still good to steel man, as best you can, that your index is actually buying you improved performance for additional complexity or update overhead.

scrubs · on June 3, 2023

Your paper is a valid add. And I'm gonna read/digest it. My point above while true, is more in the margins. Better to keep the focus on engineering know-how in the PDF. Thank you for submitting it.

Dwedit · on June 4, 2023

What's stopping eytzinger for being used for ranges? You can still iterate sequentially on an eytzinger tree, just like a binary search tree.

rurban · on June 4, 2023

Because it's slower

lucgagan · on June 3, 2023

Wow, this sounds like a really fascinating study! Binary Search has long been the go-to method for sorted, in-memory data search due to its effectiveness and simplicity. However, given the limitations of Binary Search in certain scenarios, it's exciting to see new algorithms being proposed that could potentially perform better under specific circumstances.

It's intriguing that Interpolation Search, despite its historically lackluster performance, is being given a second life through SIP and TIP. I love the fact that they're leveraging hardware trends and advanced calculations to optimize search performance.

The fact that SIP shows up to 4x faster performance on uniformly distributed data and TIP is 2-3x faster on non-uniformly distributed data is quite promising. It seems these approaches could be really useful in practice, given that real-world data often follows non-uniform distributions.

The introduction of a meta-algorithm to dynamically select the optimal search algorithm based on the data distribution is a clever touch. It helps to enhance the overall search performance by not sticking to a one-size-fits-all approach.

I'd love to read more about these techniques and their practical implementations. Do they have any benchmarks or plans to test these new methods in large-scale, real-world systems?

switchbak · on June 3, 2023

Serious question: did you use an LLM tool to craft some or all of this comment?

finexplained · on June 3, 2023

The five paragraph format will forever be suspicious.

adamisom · on June 3, 2023

I’d be willing to bet money at 5-1 against that they didn’t. since that’s impossible to prove, I guess also on running an llm a lot of times and getting anything close to it.

lucgagan · on June 7, 2023

uh, HN doesn't do a great job surfacing responses to comments. Had to no clue about this whole conversation!

tuukkah · on June 4, 2023

Nah, that's just what academic correspondense looks like.

maxbond · on June 4, 2023

What suggests to you it may have been?

refulgentis · on June 4, 2023

It’s repeating trivialities that would be well-understood and are somewhat hyperbolic for both the audience and the real meat of the content. C.f. “Binary search has long been…”

maxbond · on June 4, 2023

I don't really see it, my experience with other LLMs is limited, but ChatGPT tends to use the rhetorical tricks you're gesturing at to speak in generalities and avoid expressing a definite opinion - it's the shallow even-handedness of a bullshitted essay, where you avoid saying something definite because you don't actually know what you're talking about.

But the parts of the comment you're highlighting are specific and opinionated. And I don't find them particularly hyperbolic.

pvansandt · on June 3, 2023

This was an interesting journey. My original target was at L1 sized arrays with constant sized error, but that limits its applicability to the target audience. My initial exploration with spline based indexes were at this target data size, so they weren't competitive.

SIP is very sensitive to quality of implementation. I did my best to steel man binary search, but if either is not implemented carefully, then you can easily get inverted results. I tried several different axes of variation for implementation of binary search and probably spent more time optimizing binary search than SIP or TIP, but binary search is very sensitive to cache misses and how that relates to branch prediction and conflict misses. Quality of linear search is also critical and relies heavily on array padding. SIMD was actually a major pessimization since for SIP, it's only worth using if linear search is < 100 elements or so.

I spent a lot of effort trying to model the performance of SIP to predict from statistics of the data to how it would perform. You can consider the cost of each interpolation as a random access cache miss, and the cost of a linear search based on its distance, but the cost of a linear search has decreasing marginal cost with more distance before you reach steady state, which is exactly where you are using it in SIP. So, that basically led nowhere, but it does point out that an L2 norm is the wrong metric for identifying if interpolation would be useful on your data.

I think the bigger insight of TIP is around its use of efficient approximations for hyperbolic interpolations. There's a lot of research around polynomial interpolation, but I think that hyperbolas are also worth examining. I don't expect TIP to be used in a database engine, and while data might be commonly zipf distributed, I'm not sure that people are putting that kind of data into an array and doing lookups on them where the values at the bottom of the array have a lot of duplicate values.

SIP, I think, is more about increasing amounts of spatial cache locality, and making the most of out-of-order processors. It is more a variation on interpolation-sequential search, so it might be more practically used in a SIP-2 or SIP-3 variation which a hard-coded, unrolled number of iterations. (SIP-1 would just be interpolation sequential.) The downside of an interpolation method is that pairs of sequential interpolations don't surround the target, which leaves the search unbounded. I don't find hybrid approaches very interesting because I don't expect an implementation to be able to pipeline useful work which has a time per iteration that is sufficiently fast. For perspective, on billion element arrays, I think typical number of interpolations was around 3. We would expect number of interpolations to not increase to 4 until you hit a quintillion elements which suggests that the gap in efficiency between log N and log log N grows slowly enough that algorithmic gains have a nearby, practical upper bound to potential improvement.

One of my favorite related papers is RadixSpline by Kipf et al. The amount of additional memory across a number of scenarios would be tiny, you could write a very efficient inner loop, and it would have basically been tailor fit to work perfectly on the FB dataset (one of the real world distributions).

mlochbaum · on June 3, 2023

The paper draws the analogy to continuous root-finding ("Interpolation search is the regula falsi method"), nice. One thing I've been wondering lately is whether the ITP method[0], which is actually more recent than this paper, is a good fit for sorted searching. It's an elegant method that guarantees a maximum number of queries, for example one more than binary search, while taking advantage of interpolation for better-behaved distributions. I wrote a short section about this at [1]. Which is likely to link to the OP soon too.

Since the author's here: the "Optimized binary search" algorithm looks good, in that it has an obvious branchless implementation. But it'll depend on how the code's written and compiled, and clang in particular will turn it into a branch without -mllvm --x86-cmov-converter=0. Making it branchless was the intention, right? Was the output code checked to make sure it uses cmov?

[0] https://en.wikipedia.org/wiki/ITP_Method

[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

pvansandt · on June 3, 2023

Big fan on your presentation on binary search that amortised multiple concurrent searches.

I did several variations on binary search and always compared against the best one. [0] From what I remember, some variations were compiled to a branchless implementation and some were compiled to implementations that used a branch. I had many passes of analysis by pasting into godbolt with identical compilers and flags. Power of 2 binary search did better on small arrays iirc, but are the first to hit conflict misses. For larger arrays, I believe the branchy search algorithms start to do better since half their branches get predicted correctly.

I haven't previously looked at ITP, and I would need to study further to get a more clear idea on what its aiming for. Some hybrid methods, unlike ITP require additional memory accesses, but I don't think this does, just compute. On the other hand, since at a billion elements you're looking at roughly 3 interpolations or 30 bisections, there's a pretty narrow window of opportunity here, and in the batch case, there are indices to contend with. I'm on the BQN discord if you wanted a different form of communication.

[0] https://github.com/UWHustle/Efficiently-Searching-In-Memory-...

mlochbaum · on June 3, 2023

Sounds good on the basic binary search. I jumped through the paper a little too quick and missed the source code link, although thanks for the godbolt confirmation as well.

Yeah, the big deal about ITP is not really the interpolation but the graceful fallback to binary search. With of course that nasty division overhead. I'll have to study your paper to see which ideas there could be used.

Glad you liked the talk! And a small world, huh? I've been wanting to get an implementation of the vector binary search into CBQN for a while, so that an open source version of it exists at least. Always some other thing to do. Multiple searches are really a different world. Elijah Stone pointed out recently[0] that with searches that all have the same iteration count (sadly, not interpolation search!), several of them can be run at once just by copying the loop body. That was new to me at least. And for searches that don't fit in cache it's possible to partition the sought values, which does the accesses in a perfectly cache-friendly order for some extra compute. That's described in the page I linked before.

[0] https://news.ycombinator.com/item?id=33648924

moonchild · on June 4, 2023

> Multiple searches are really a different world. Elijah Stone pointed out recently[0] that with searches that all have the same iteration count (sadly, not interpolation search!), several of them can be run at once just by copying the loop body

They don't need to have the same iteration count; it suffices to mask out searches as they finish (like gpus) or ensure that repeated search iterations are idempotent, and just wait until all finish. My first implementation actually did this, and the iteration count could vary by as much as 1 when the search space size was not a power of two. I changed that after henry pointed out that it can cause mispredicts.

In the case of interpolation search, a naive application might be a bad idea (since a single pathological lane slows down all the other concurrent searches). But an alternative might be to detect and 'schedule out' pathological lanes; use binary search instead of interpolation search for them, and now you have nice worst-case asymptotics again.

(Fun anecdote: I asked an nvidia employee why they don't use this strategy for gpu branches in general. The response: mumble mumble not sure, probably tradeoffs—just isn't worth it. Fair enough. Not six months later, they release new gpus which do that for sparse threadgroups. I maintain they stole the idea from me :)

Alternately, it might be possible to swap out completed lanes as soon as they finish—superscalar cpu is a mimd, after all—getting this branch free might be too much overhead, but maybe if you do the check every k iterations or something...

mlochbaum · on June 4, 2023

There are some possibilities with uneven searches for sure. The major issue is that a search is so fast it doesn't leave a lot of room for overhead. For interpolation search, scheduling out bad cases sounds fine until you have a lot of them: then you have to search them one at a time, and the length's unpredictable, right? Grouping by search length doesn't really sound feasible: you'd have to store element index and value, as well as current search index, for each element that goes the long way.

moonchild · on June 4, 2023

> length's unpredictable, right? Grouping by search length doesn't really sound feasible

You just have to group by ceillog2, so there aren't many buckets. Can even discretise further (say, one bucket for every two exponents) with likely not too much penalty. Alternately, perhaps restart the search entirely for pathological elements (so don't try to keep around the information gained by the cursory iterations of interpolation search), but if there are too many pathological elements, just do binary search up front for all elements. I expect this would work pretty well in practice. (Maybe with exponential backoff or something, so you don't lose to changes in the distribution of elements being searched for.)

(This reminds me, I have a similar problem with adaptive sorting. Having identified a number of presorted runs, it's necessary to merge them in the right order, because if I merge them in the wrong order I get quadratic. Plan is similarly to sort them by lzcnt or some such, which is a curious recursive problem.)

xoranth · on June 4, 2023

> For larger arrays, I believe the branchy search algorithms start to do better since half their branches get predicted correctly.

(not sure if this is what you meant by "conflict misses") I believe the reason branchy binary search algorithms do better on large arrays is that the CPU's speculative execution will cause it to prefetch data to the caches. That means that, for around half of the steps, the data needed for the comparisons will not need to be fetched from memory.

You can get the best of both world by using a branchless implementation and manually prefetching data as in [^0].

[0^]: https://en.algorithmica.org/hpc/data-structures/binary-searc...

moonchild · on June 4, 2023

That's not what's meant by 'conflict misses', no. The issue is that, with pot, the hottest elements (logically the ones which represent the highest nodes in the tree) will be spaced at pot intervals. If the array is large, that means their addresses will differ only in the high bits; in the case of the higher-level caches, then, they'll map to the same cache set and kick each other out, even though the cache as a whole has enough capacity to fit them all.

rurban · on June 4, 2023

It still can be made much faster, esp. for constant data and data which fits well into ranges, if uniformly distributed or varying. You just calculate beforehand the best curve of the data and ranges, and then at run-time find the pivot, the starting point of your branchless search, and binary search from there.

Much faster than the SIP and TIP nonsense in the tight loop. Also faster than Eytzinger in most cases. They use only a linear equation to model the data, quadratic or logarithmic fit would be much better. And it's much smaller than a full db index, just 3-4 numbers, the type of the curve (linear, quadratic, logarithmic) and the curve parameters.

Without ranges a hashtable is still the best, esp. if constant then use a CMPH. I use that in unicode table searches, where the data varies every year.

mlochbaum · on June 4, 2023

Is this something you've benchmarked? If you're looking at Unicode, note that with just over a million possible code points, this is at the smaller end of speedups shown in the paper, which really gets going around 10 million. I don't know for sure but wouldn't be surprised if SIP and TIP typically only use one interpolation at such sizes.

How do you start a binary search at a single point exactly, search in the right direction doubling the interval size until you bracket it? The paper mentions what they call Interpolation-Sequential which does a sequential search instead and they claim this is faster if you get close enough. In any case, it doesn't seem right to call this "branchless" given that the number of steps is variable.

TIP uses rational functions, not linear. Still, I'm a bit skeptical that a global quadratic or logarithmic fit solves all problems: isn't piecewise data with one or two big gaps fairly common? With a distribution that's uniform only after taking a log, I see how a logarithmic fit would have some theoretical motivation; quadratic seems unmotivated.

pvansandt · on June 4, 2023

If your data is uniformly distributed, you wouldn't expect higher order approximations to reduce error. You would have to weigh the cost of that approximation against using an interpolation to get more local data. So, a linear equation is appropriate here.

If your data is non-linear, then polynomials will require a lot of terms to have low enough error, eg. on zipf. You also need to be able to fit the curve efficiently, so hyperbolic interpolation was a viable option.

With respect to finding the pivot beforehand, by choosing the end points, you can have confidence that you won't need to saturate at the end points in the first step. You might get better error by finding a better slope, but that is accepting a much higher cost to choosing the parameters.

I'd be curious to learn about related approaches which you've measured to be a big improvement, although I didn't understand all the terms you used.

rurban · on June 5, 2023

As I said, a non-linear distribution can easily be fit into a quadratic curve, if logarithmic.

Multidimensional curves, ie lots of holes in the data, do make no sense to model for the initial pivot calc. Binary search finds them much better then. The cost is similar to a hash table, almost constant with 2-4 branches, and mostly into cached data.

w0mbat · on June 4, 2023

I get that it's faster, but binary search is already so quick that it is almost free when you look at the big picture, even with large datasets. If standard library implementors want to switch to this method behind the scenes then that's fine, but I'm not re-writing my apps.

pvansandt · on June 4, 2023

This wouldn't go into a standard library since it's distribution dependent, but you might be interested to note a non-trivial (nor game changing) improvement to leveldb query time even with all the other work that it's doing besides search. Other examples would include things like pandas joins which use primitives like searching for lower bounds and partition points.

I agree because for datasets that fit into memory, the scope for improvement is pretty limited. From a billion elements to 128 billion, that's only around 23% (37/30) more comparisons for binary search, and keeping that constant at the expense of more linear search isn't necessarily all that interesting for cheap comparisons. LogN grows so slowly that log-logN is frequently not that practically interesting.

tdullien · on June 4, 2023

Wait, I thought piecewise-linear interpolation search was well-known, and known to be similarly performant to the Google "learnt indices" paper?

This paper seems to imply that previous approaches didn't use piecewise-linear approximations??

Can someone clear up if I got my history wrong? Did I just assume piecewise-linear is well-known when it isn't?

mlochbaum · on June 4, 2023

Haven't heard of it, and I couldn't find the phrase "piecewise-linear interpolation search" online at all. Care to share? In particular, how are the pieces determined?

The Google paper is "The Case for Learned Index Structures" I think? This paper discusses index structures to be stored along with the data, so it's not directly comparable to the OP, which starts with the sorted data alone. Google's use case is so dominated by lookups that as far as I can tell they don't even report on the time to construct the index. I'm not sure the same priorities hold in every domain.

pvansandt · on June 4, 2023

Piecewise linear was one of the first things that I experimented with, but it is getting near to index based approaches, which is a different topic. It also is inefficient for non polynomial distributions. On things that approach linear, the extra steps to find the right spline can overweight the advantage of reduced error, so I think both search optimizations and building indices can coexist, shifting around where search should be the base case for an indexed search.

Building and maintaining indices is also not free, so it's interesting to try to do your best with much less index maintenance.

nraynaud · on June 4, 2023

Weird coincidence, I interpolated my pivot in a binary search 3 days ago. (Didn’t improve things, my data is compressed, so only irregular stuff is in the array)

alwaystired · on June 3, 2023

New leetcode problem just dropped

marginalia_nu · on June 3, 2023

Commenting so I find this later.

ComputerGuru · on June 3, 2023

Tip: you can favorite the submission then un-favorite it after you've read it.

maxbond · on June 4, 2023

You can also upvote it, though you won't be able to undo it for long if that's a concern.