Hacker News new | past | comments | ask | show | jobs | submit login

It still can be made much faster, esp. for constant data and data which fits well into ranges, if uniformly distributed or varying. You just calculate beforehand the best curve of the data and ranges, and then at run-time find the pivot, the starting point of your branchless search, and binary search from there.

Much faster than the SIP and TIP nonsense in the tight loop. Also faster than Eytzinger in most cases. They use only a linear equation to model the data, quadratic or logarithmic fit would be much better. And it's much smaller than a full db index, just 3-4 numbers, the type of the curve (linear, quadratic, logarithmic) and the curve parameters.

Without ranges a hashtable is still the best, esp. if constant then use a CMPH. I use that in unicode table searches, where the data varies every year.




Is this something you've benchmarked? If you're looking at Unicode, note that with just over a million possible code points, this is at the smaller end of speedups shown in the paper, which really gets going around 10 million. I don't know for sure but wouldn't be surprised if SIP and TIP typically only use one interpolation at such sizes.

How do you start a binary search at a single point exactly, search in the right direction doubling the interval size until you bracket it? The paper mentions what they call Interpolation-Sequential which does a sequential search instead and they claim this is faster if you get close enough. In any case, it doesn't seem right to call this "branchless" given that the number of steps is variable.

TIP uses rational functions, not linear. Still, I'm a bit skeptical that a global quadratic or logarithmic fit solves all problems: isn't piecewise data with one or two big gaps fairly common? With a distribution that's uniform only after taking a log, I see how a logarithmic fit would have some theoretical motivation; quadratic seems unmotivated.


If your data is uniformly distributed, you wouldn't expect higher order approximations to reduce error. You would have to weigh the cost of that approximation against using an interpolation to get more local data. So, a linear equation is appropriate here.

If your data is non-linear, then polynomials will require a lot of terms to have low enough error, eg. on zipf. You also need to be able to fit the curve efficiently, so hyperbolic interpolation was a viable option.

With respect to finding the pivot beforehand, by choosing the end points, you can have confidence that you won't need to saturate at the end points in the first step. You might get better error by finding a better slope, but that is accepting a much higher cost to choosing the parameters.

I'd be curious to learn about related approaches which you've measured to be a big improvement, although I didn't understand all the terms you used.


As I said, a non-linear distribution can easily be fit into a quadratic curve, if logarithmic.

Multidimensional curves, ie lots of holes in the data, do make no sense to model for the initial pivot calc. Binary search finds them much better then. The cost is similar to a hash table, almost constant with 2-4 branches, and mostly into cached data.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: