

Beating Binary Search - boredandroid
http://sna-projects.com/blog/2010/06/beating-binary-search

======
RiderOfGiraffes
This depend entirely on being able to do the interpolation on your keys. He
computes:

    
    
        offset = (key - min) / (max - min)
    

but not all data can have arithmetic done on it like that. What if the data in
your array are, for example, names?

It's pretty trivial to beat binary search if your data are integers, or other
keys on which you can do arithmetic. You can make the uniform distribution
assumption, check it as you go, degrade to binary search if the assumption
seems flawed, and then proceed from there.

However, if you're given an array of unknown objects, a comparison function,
and no way to do arithmetic on your objects, you can't beat binary search.

~~~
cperciva
_However, if you're given an array of unknown objects, a comparison function,
and no way to do arithmetic on your objects, you can't beat binary search._

... if all you care about is operation count. If you're fetching pages from
disk, a B+Tree will perform dramatically better than a binary search.

And if you're fetching pages from disk _and have a large cache_ , a B+Tree can
do far better than interpolation search, too. Interpolation search will thrash
your cache, resulting in O(log log n) seeks, whereas a cache of size O(n/B)
will allow a B+Tree to find data in O(1) seeks.

~~~
RiderOfGiraffes
But now you're talking about different data structures - his assumption was
that you had a simple sorted array. We both know that there's a lot that can
be done, and when performance _really_ matters you don't use a sorted array,
you don't use binary search, and you do take into account things like access
pattern, access speed, caches, _etc._

But people don't, because they read simplistic articles like this and then
believe they know everything. At least the author could _hint_ that there's a
lot more that can be done. As it is, it just leaves the naive reader with a
sense that they've seen the light, and the experienced reader irritated.

~~~
cperciva
True. Although I'd be interested in seeing actual benchmarks for interpolation
vs. binary search too, since binary search has better cache locality...

------
colomon
I'm fascinated that so many of the commenters here seem to be focusing on the
fact that the algorithm mentioned doesn't work in every case. My reaction was
different -- I thought of a case where it would be less than ideal, and
allowed it to inspire me to find an algorithm that worked better than naive
binary search for that case, too.

In particular, I thought of the case of searching arrays of strings. If you
are doing repeated searches against the same array, you can easily build a
table breaking the array down into subarrays all sharing the same first
character. (I'm assuming ASCII there!) Using that can quickly eliminate 5 or
so comparisons that would be required with a naive binary search.

Okay, that's not brilliant, but might be useful in some circumstances. My
bigger point is I thought this article was a very useful reminder that even a
terrific general purpose algorithm like binary search can be beaten if you can
apply specific domain knowledge to the search.

~~~
tbrownaw
That sounds a bit like Burst Tries / burstsort.

~~~
huherto
You mean the Burst-colomon sort.

------
bnoordhuis
Caveat emptor: The article fails to mention that interpolation search's worst-
case performance is O(n), considerably worse than binary search's O(log n).

~~~
wheaties
And? Quicksort in general runs faster than Heapsort, yet the worst case
performance of the first is O(n^2) while the later only O(n log n). If they
implemented it, I'm sure they've found that there's a performance boost for
large datasets.

~~~
lliiffee
Note that properly implemented quicksort has expected O(n log n) time
independent of the data (dependent only on the random choices of pivots),
whereas interpolation sort's performance is data dependent. That's a pretty
important distinction.

~~~
pbiggar
Interestingly, that's not quite true: see
<http://www.scss.tcd.ie/~pbiggar/jea-2008.pdf>, section 8.3.

I'll point out too that no quicksort implementation uses a truly random choice
of pivot. Typically they use the median of three pivots to be pseudo-random.

~~~
lliiffee
Your second point may be valid, though not literally. I've implemented
quicksort, and I used (pseudo) random pivots, damnit! :)

The article you point to doesn't seem to address the complexity of quicksort
with random pivots. However, that is proven to be expected O(n log n) time for
any data in Motwani's Randomized Algorithms book, and I don't think the
correctness of that has ever been questioned.

~~~
pbiggar
> The article you point to doesn't seem to address the complexity of quicksort
> with random pivots.

That paper is about experimental algorithmics, and avoids discussions of
things which aren't practical (such as using truly random pivots).

> However, that is proven to be expected O(n log n) time for any data in
> Motwani's Randomized Algorithms book, and I don't think the correctness of
> that has ever been questioned.

Right, but since there are no truly random quicksorts, its hardly a concern.

~~~
lliiffee
Please clarify: Why are random pivots not practical? And why do you say there
are no truly random quicksorts? Because pseudorandom numbers aren't really
random, or for some other reason?

~~~
pbiggar
> Why are random pivots not practical?

The cost of true randomness is really high (in execution time)

> And why do you say there are no truly random quicksorts?

Because pseudo-random numbers aren't really random.

The only reason this is important is that the pseudo-randomness means that its
possible the construct an adversarial input, which could drive the quicksort
to take N^2 time.

------
Tichy
I have thought about this years ago, simply because it is also kind of the way
humans search in telephone books. (Advantage of being born in a time when
telephone books where still in use). I think humans would also do repeated
interpolation, as some letters would be "thicker" (more pages in the book)
than others.

Bionics for the win...

------
aplari
You can beat pretty much any algorithm if you make convenient extra
assumptions. In this case they have suitable data for the implication search
(good for them!), but the title promises too much.

You can't beat binary search without new assumptions.

------
oinopion
It works well only on uniformly distributed keys. Not everyone has pleasure to
work with such data.

~~~
seanos
Actually it will work with any distribution of keys, provided that you know
what that distribution is.

~~~
alextp
As long as they are integer keys, that is.

~~~
DrJokepu
It doesn't need to be an integer but you're right in that it requires an
additional property of the key type that most sorting and searching algorithms
do not need: being able to measure the distance between two keys. Most
algorithms only need to be able to decide whether a key is less, equal or
greater than another key.

~~~
sesqu
This is implicit in having access to the (cumulative) key distribution. Still,
the interpolation function can turn out to be pretty expensive, if your data
isn't uniform 0-100.

------
pascal_cuoq
I do not know this language (Java?), but doesn't the line

    
    
       int offset = (int) (((size - 1) * ((long) key - min)) / (max - min));
    

compute (max - min) as an int (potentially overflowing), and then convert it
to long, defeating the precautions obviously taken against this kind of event
with the cast to long elsewhere?

~~~
huherto
Let me try to break it down:

    
    
      (max - min) is int
      ((long) key - min) is long
      long / int is long since a long is involved.
      (size - 1) is int
      int * long is long
    

I think the purpose if the long cast is that the last multiplication produces
a long and it doesn't overflow. But then it is casted to an int anyway
defeating the purpose.

~~~
feijai
> (max - min) is int

Incorrect. These are signed values. If you subtract a large negative integer
from a large positive integer, the result is larger than an integer.

------
panic
Interesting article, but without actual numbers it's hard to tell whether the
change made any difference in performance.

~~~
bad_user
That's what mathematics is for ... you just need the growth rate.

O(lg lg N) is a smaller growth rate than O(lg N).

Of course, for smaller datasets it doesn't matter.

In fact if you do sorting, for small datasets (that are mostly sorted) a
bubble-sort is better than quick-sort, since it doesn't involve random disk
accesses and it might even require fewer steps.

I would be curious if this method for searching would apply to sorting (e.g.
sorting done in O(N lg lg N)) ... since I remember that sorting done with
comparisons can't be better than O(N lg N) (the height of a binary search tree
is lg N ... and I think the demonstration was based on that).

So to get back on-topic ... it really depends on the size of your data.

