

Beating Interpolation Search & Binary Search - DrJosiah
http://dr-josiah.blogspot.com/2010/06/beating-interpolation-search-binary.html

======
16s
Try boost::unordered_set. You'll average O(1) time. We just converted an app
that was using binary_search on a sorted std::vector to boost::unordered_set
and the time difference was dramatic, especially on larger containers, but
slightly slower on small ones:

1 string: binary_search (28 seconds) unordered_set (36 seconds)

1,000,000 strings: binary_search (127 seconds) unordered_set (60 seconds)

~~~
DrJosiah
boost::unordered_set is an in-memory hash table, not unlike std::hash_set
(which may save you some download/link time). Note that the blog post
discusses your options when your data is too large to fit in memory.

Most hash tables operate under the assumption (based on good hashing and
probing algorithms) that hashed key collisions are somewhat common. If one
were to apply hashing to this problem (data on disk), the data file would
necessarily grow (hash tables rely on empty space to reduce the probability of
collisions), and would still require at least 1 disk seek+read, but likely
more depending on the characteristics of the particular hashing algorithm
used.

Actually, since the original problem specified uses md5s as the key, arguably
the best hash probe sequence and chaining would be something similar to
interpolation search; which the index still beats.

------
kragen
This is more or less the approach used by Lucene and a few toy search engines
I've written over the years, including, most recently, dumbfts.

------
z92
tl;dr: use B-Tree indexes.

~~~
DrJosiah
I mention that in the post, get specific with the characteristics of the
particular B+Tree that we just constructed, and even link off to the relevant
Wikipedia entry. :)

~~~
z92
Then the tldr is right. Not sure why the down votes though.

