Note that you shouldn't use the algorithm in practice (unless you need certain guarantees) since the average performance isn't that great. This is just like Mergesort vs Quicksort where Mergesort guarantees O(NlogN) yet Quicksort is usually much faster. The constant for the number of comparisons is quite high. Compare this to the average complexity for Quickselect with these pivot strategies:
- Random pivot (ie Median of 1) has 3.39N comparisons
- Median of 3 has 2.75N
If you want the fastest possible algorithm you should follow
which proposes to choose the median of N^{2/3}1/(4pi)^{1/3} elements as the pivot and then also uses a different strategy to choose the second pivot (close to the median yet "safe").
Another fast algorithm is Floyd & Rivest's SELECT (yes the same crypto Rivest) which is also asymptotically optimal and implementations exist. (Paper title: "Expected time bounds for selection")
> "Note that you shouldn't use the algorithm in practice"
You certainly can:
1. Run the randomized algorithm
2. If if the usual way takes too long (more than O(n) for whatever constant factor makes sense to you), just stop doing more work and switch to the deterministic algorithm
Now you get the best of both speeds, within a small constant factor (like 2).
This is one of my favorite algorithms. I use it frequently in my code (example: filtering noise in a real time system) and get giddy when I talk to others about it. One of the contributors to this algorithm was Manuel Blum [0], whose contributions (and his students') touch pretty much every corner of Computer Science.
His lecture notes from his course at Carnegie Mellon describe this algorithm quite clearly [1].
According to the man himself, this algorithm was found by Manuel Blum when trying to prove that it was impossible to do k selection in linear time deterministically
Pretty sure this is wrong. For example, what about the case where k=n? Then the lazy evaluation will have to fully sort the list before it can determine the kth element, resulting in an O(n log n) algorithm.
In the standard idiom,
smallest = head . sort
we're only looking for the case of k=1, so we can do it in O(n) time.
Note that if you used a lazy BST, then you could probably still get the last element in O(n) run time, but your example code doesn't do that. AFAIK, no standard library routine exists for this.
If your sort is lazy enough, the `(sort xs) !! k` approach will work fine. Haskell's default sort isn't lazy enough, but you could certainly make it so.
You don't have to fully sort the list before determining the kth element (again assuming a sufficiently lazy implementation). Once you know the element is in the second half of the list, the sort simply will never sort the first half.
It should be noted that the additional overhead from performing lazy quicksort (all those additional thunks) make it an unlikely choice for the standard sort function; you'd suffer a large constant factor overhead in the typical case of processing the entire list in sorted order.
I don't see how that follows from lazy evaluation. Here's my logic (corrections welcome):
- That 'sort' will produce the items of the list in some order ('third item is 2nd largest, 17th item is 34th largest,...).
- Lazy evaluation cannot stop before the sort produces 'x is i-th'.
- 'sort' does not know what is done with its output (i.e. it does not know the value of i)
If these are true, Evie can pick i in such a way that it is produced last, that is: such that sort must sort the complete list. That requires O(n log n) time.
The only way around this I can see is if the language recognizes the idiom and special-cases it to pick a sort algorithm, based on both the sequence xs and the requested index i. Does Haskell contain such a special case?
Repeated requests for the i-th element will not become 'faster and faster' but instead would simply be Θ(i) - they'd traverse i list pairs, then read the memoized data component.
Sorting the sub-lists to find the 5 medians by itself is at best a O(nlogn) as per this example. That would make the algorithm non O(n). Or am I missing something?
Yes you are. It's O(nlogn) where n is 5 - i.e. it's constant time (you can easily find the median by 4 comparisons to find the smallest of the 5, followed by 3 comparisons for the 2nd smallest, followed by 2 comparisons for the median - a total of 9 comparisons)
So embarrassing when a candidate busted this out on me in an interview - but couldn't quite explain it - so I was convinced he was wrong / out of his mind. Later I looked it up in the CLR and sure enough...
- Random pivot (ie Median of 1) has 3.39N comparisons
- Median of 3 has 2.75N
If you want the fastest possible algorithm you should follow
http://dx.doi.org/10.1109/TSP.2012.2197394
which proposes to choose the median of N^{2/3}1/(4pi)^{1/3} elements as the pivot and then also uses a different strategy to choose the second pivot (close to the median yet "safe").
Another fast algorithm is Floyd & Rivest's SELECT (yes the same crypto Rivest) which is also asymptotically optimal and implementations exist. (Paper title: "Expected time bounds for selection")