It still can be made much faster, esp. for constant data and data which fits wel...

mlochbaum · on June 4, 2023

Is this something you've benchmarked? If you're looking at Unicode, note that with just over a million possible code points, this is at the smaller end of speedups shown in the paper, which really gets going around 10 million. I don't know for sure but wouldn't be surprised if SIP and TIP typically only use one interpolation at such sizes.

How do you start a binary search at a single point exactly, search in the right direction doubling the interval size until you bracket it? The paper mentions what they call Interpolation-Sequential which does a sequential search instead and they claim this is faster if you get close enough. In any case, it doesn't seem right to call this "branchless" given that the number of steps is variable.

TIP uses rational functions, not linear. Still, I'm a bit skeptical that a global quadratic or logarithmic fit solves all problems: isn't piecewise data with one or two big gaps fairly common? With a distribution that's uniform only after taking a log, I see how a logarithmic fit would have some theoretical motivation; quadratic seems unmotivated.

pvansandt · on June 4, 2023

If your data is uniformly distributed, you wouldn't expect higher order approximations to reduce error. You would have to weigh the cost of that approximation against using an interpolation to get more local data. So, a linear equation is appropriate here.

If your data is non-linear, then polynomials will require a lot of terms to have low enough error, eg. on zipf. You also need to be able to fit the curve efficiently, so hyperbolic interpolation was a viable option.

With respect to finding the pivot beforehand, by choosing the end points, you can have confidence that you won't need to saturate at the end points in the first step. You might get better error by finding a better slope, but that is accepting a much higher cost to choosing the parameters.

I'd be curious to learn about related approaches which you've measured to be a big improvement, although I didn't understand all the terms you used.

rurban · on June 5, 2023

As I said, a non-linear distribution can easily be fit into a quadratic curve, if logarithmic.

Multidimensional curves, ie lots of holes in the data, do make no sense to model for the initial pivot calc. Binary search finds them much better then. The cost is similar to a hash table, almost constant with 2-4 branches, and mostly into cached data.