
Finding the top K items in a list efficiently - rcfox
http://stevehanov.ca/blog/?id=122
======
shazow
Ten million items is a lot for real-time on-demand code. I rarely deal with
more than a few hundred items. Let's compare:

    
    
        dataset_1mil = range(1000000)
        dataset_1k = range(1000)
        dataset_100 = range(100)
    
        [random.shuffle(d) for d in (dataset_1mil, dataset_1k, dataset_100)]
    
    
        %timeit sorted(dataset_1mil, reverse=True)[:10]
        # 1 loops, best of 3: 1.12 s per loop
    
        %timeit heapq.nlargest(10, dataset_1mil)
        # 1 loops, best of 3: 330 ms per loop
    
        %timeit heapSearch(dataset_1mil, 10)
        # 1 loops, best of 3: 268 ms per loop
    
    
        %timeit sorted(dataset_1k, reverse=True)[:10]
        # 1000 loops, best of 3: 302 us per loop
    
        %timeit heapq.nlargest(10, dataset_1k)
        # 1000 loops, best of 3: 248 us per loop
    
        %timeit heapSearch(dataset_1k, 10)
        # 1000 loops, best of 3: 346 us per loop
    
    
        %timeit sorted(dataset_100, reverse=True)[:10]
        # 10000 loops, best of 3: 22 us per loop
    
        %timeit heapq.nlargest(10, dataset_100)
        # 10000 loops, best of 3: 46.9 us per loop
    
        %timeit heapSearch(dataset_100, 10)
        # 10000 loops, best of 3: 128 us per loop
    
    

There you have it. Sorted is best for small datasets in the hundreds, break-
even is in the thousands, heapSearch wins in the millions.

Datastructures are great, but be aware of the strengths and weaknesses of
their implementations. Most recently I discovered that checking collisions in
many small sets is faster with tuples (creating sets is rather slow). Test
things with your data, not just "big data".

Bonus tip: Amortize when you can. If you're sorting your data anyways, then
even the cleverest heap won't undo work already done.

------
whackedspinach
Earlier this year I visited U of Illinois and sat in with a group of seniors
who were working on a homework problem. I printed off the homework problem for
myself but was not able to solve it since I had just learned about Big O
notation in my high school Calculus class. I think this algorithm, if modified
to select by rank instead of just top k items, might be the solution.

The problem in question is 2B:

    
    
        2. (30 pts.) You are given an array A with n distinct
        numbers in it, and another array B of 
        ranks i1 < i2 < : : : < ik. An element x of A has rank u
        if there are exactly u  1 numbers in
        A smaller than it. Design an algorithm that outputs the 
        k elements in A that have the ranks
        i1; i2; : : : ; ik.
        (B) (20 pts.) Describe a O(n log k) recursive algorithm  
        for this problem. Prove the bound
        on the running time of the algorithm
    

Source: <http://www.cs.illinois.edu/class/sp11/cs473/hw/hw_03.pdf>

Is that possible?

~~~
davidtgoldblatt
If you don't mind sharing your email I can send you a solution sketch. I don't
want to post one online since homework problems tend to get reused from year
to year.

~~~
whackedspinach
Sure, cg@colegleason.com.

------
Adrock
This can be further optimized with a custom heap implementation. Once there
are k elements in the heap, we know that whenever we remove an element from
the heap, we will immediately be adding a new one.

In a traditional heap implementation, popping is performed by replacing the
root with the bottom-right-most element and heapifying down. This is
potentially 2 * log(k) comparisons. Pushing is performed by adding the new
element to the bottom-right and heapifying up, which is up to log(k)
comparisons. We can combine the pop and push into a single operation and
reduce the 3 * log(k) comparisons to just the 2 * log(k) comparisons. Instead
of replacing the root with the bottom-right-most element, just replace it with
the new element and heapify down. This not only reduces the worst-case number
of comparison; it improves the expected number of comparisons since the
bottom-right element is likely to still belong at the bottom and will be
pushed down the entire tree, while the new element is more likely to belong
somewhere shallower.

------
riffraff
fun fact: mysql did not do this up to v5.6, afaict. This means a query like

    
    
        select * from ... order by non-index-column-or-calculated limit k
    

it would dump all the results into a temp table (possibly on disk) and then
sort them, instead of keeping a current top-k buffer.

------
veyron
if K is fixed, as in the example from the article, this can be done in O(N)
...

~~~
Kroem3r
I don't know what the protocol is on HN, but I would be interested to see it.

It seems to me that if you know the K'th largest, it's O(N); but otherwise,
it's harder.

~~~
davidtgoldblatt
I suspect your parent poster is being facetious (if we fix k, then just keep a
buffer of the k largest elements we've seen so far. For any new element we
see, check if it's larger than the smallest element in the buffer and if so
remove the smallest and insert the new one. This takes O(k) time, so the
algorithm overall takes O(kn), which, since k is constant, is O(n)).

In fact, this can be done in O(n) time even if k isn't constant: use the
selection algorithm to find the k'th largest element (in O(n) time) and then
go through the list again and output every number that's k or larger.

~~~
axman6
Doesn't this only work if the elements are unique? say you wanted the top
three from [1,2,3,2,3,3,4,3,3,3,3,4,3], then the kth largest element would be
3, and we'd end up outputting [3,3,3,4,3,3,3,3,4,3], which is obviously not k
elements.

~~~
davidtgoldblatt
For these types of problems you usually assume that every element is unique.
The fix is just making two passes through the array instead of one, so it's
still O(n).

It's typically easy to reduce solving such a problem with duplicates allowed
into a very similar one without duplicates. Just replace every element x by
the ordered pair (x, the position in the array where x came from) and then use
a modified comparison function that sorts according to the second element in
the pair if two ordered pairs are equal in the first element.

------
rhdoenges
I love discovering all these new python data structures. Last week, I realized
how handy namedtuples are, and I will definitely use this.

~~~
axman6
I just love learning about new data structures, and interesting ways to use
them.

------
thesz
When you have lazy evaluation in your tools, finding K extremal elements in
list will be one-liner:

    
    
       getKSmallest k list = take k (sort list)
    

or even shorter in point-free notation:

    
    
       getKSmallest k = take k . sort
    

[http://en.wikipedia.org/wiki/Selection_algorithm#Language_su...](http://en.wikipedia.org/wiki/Selection_algorithm#Language_support)

~~~
CJefferson
Does this really achieve O(n.k) for small k? I'm not sure I believe sort can
be that lazy, without being deficient when used in a non-lazy context.

Some benchmarks might convince me :)

~~~
thesz
Let us convince ourselves that sort sorts:

GHCi, version 6.12.1: <http://www.haskell.org/ghc/> :? for help Loading
package ghc-prim ... linking ... done. Loading package integer-gmp ... linking
... done. Loading package base ... linking ... done. Loading package ffi-1.0
... linking ... done. Prelude> :m Data.List -- for sort function Prelude
Data.List> :se +s -- to display timings.

Prelude Data.List> last $ sort [1..2^20] 1048576 (1.34 secs, 544634924 bytes)
Prelude Data.List> last $ sort [1..2^21] 2097152 (2.84 secs, 1125744400 bytes)
Prelude Data.List> last $ sort [1..2^22] 4194304 (5.51 secs, 2325062504 bytes)
Prelude Data.List> 1.34/(2^20)/20 6.389617919921875e-8 (0.02 secs, 1576728
bytes) Prelude Data.List> 2.84/(2^21)/21 6.44865490141369e-8 (0.00 secs,
1578108 bytes) Prelude Data.List> 5.51/(2^22)/22 5.971301685680042e-8 (0.00
secs, 1572220 bytes)

Looks like it really sorts, even strictly ascending sequence (2^22 time
problems could be attributed to GC or something).

Let us take first N from sorted list:

Prelude Data.List> sum $ take (2^1) $ sort [1..2^22] 3 (2.25 secs, 760208692
bytes) Prelude Data.List> sum $ take (2^2) $ sort [1..2^22] 10 (2.50 secs,
760292356 bytes) Prelude Data.List> sum $ take (2^3) $ sort [1..2^22] 36 (2.25
secs, 759678188 bytes) Prelude Data.List> sum $ take (2^4) $ sort [1..2^22]
136 (2.51 secs, 760209208 bytes) Prelude Data.List> sum $ take (2^5) $ sort
[1..2^22] 528 (2.36 secs, 759678236 bytes) Prelude Data.List> sum $ take (2^6)
$ sort [1..2^22] 2080 (2.31 secs, 759682904 bytes)

I don't know why this looks almost constant. But it is!

I expected to be able to show O(nlogk), but it performed even better. ;)

------
antirez
for more flexibility, like taking arbitrary ranges of elements, it is possible
to use partial qsort, that is pretty easy to implement (just don't recurse for
ranges that are outside the range you want to sort).

