Hacker News new | comments | show | ask | jobs | submit login
The Median-of-Medians Algorithm (austinrochford.com)
115 points by MidsizeBlowfish 1423 days ago | hide | past | web | 31 comments | favorite

Note that you shouldn't use the algorithm in practice (unless you need certain guarantees) since the average performance isn't that great. This is just like Mergesort vs Quicksort where Mergesort guarantees O(NlogN) yet Quicksort is usually much faster. The constant for the number of comparisons is quite high. Compare this to the average complexity for Quickselect with these pivot strategies:

- Random pivot (ie Median of 1) has 3.39N comparisons

- Median of 3 has 2.75N

If you want the fastest possible algorithm you should follow


which proposes to choose the median of N^{2/3}1/(4pi)^{1/3} elements as the pivot and then also uses a different strategy to choose the second pivot (close to the median yet "safe").

Another fast algorithm is Floyd & Rivest's SELECT (yes the same crypto Rivest) which is also asymptotically optimal and implementations exist. (Paper title: "Expected time bounds for selection")

> "Note that you shouldn't use the algorithm in practice"

You certainly can:

1. Run the randomized algorithm

2. If if the usual way takes too long (more than O(n) for whatever constant factor makes sense to you), just stop doing more work and switch to the deterministic algorithm

Now you get the best of both speeds, within a small constant factor (like 2).

Or, since we have enough cores, run both in two parallel threads, keeping the original list read-only.

Or distribute out the median_5() work to many cores at once since each call is independent of the others.

This is one of my favorite algorithms. I use it frequently in my code (example: filtering noise in a real time system) and get giddy when I talk to others about it. One of the contributors to this algorithm was Manuel Blum [0], whose contributions (and his students') touch pretty much every corner of Computer Science.

His lecture notes from his course at Carnegie Mellon describe this algorithm quite clearly [1].

[0] http://en.wikipedia.org/wiki/Manuel_Blum

[1] http://www.cs.cmu.edu/afs/cs/academic/class/15451-s07/www/le...

Minsky -> Blum -> Adleman, Minsky -> Blum -> Goldwasser are 2 hat-tricks of Turing awards involving Blum.

According to the man himself, this algorithm was found by Manuel Blum when trying to prove that it was impossible to do k selection in linear time deterministically

Do you know if that's been proven, or is it still an open question?

The algorithm described in the article is deterministic and linear time -- exactly what Blum was trying to disprove as possible

This is so trivial!

Actually, if you use lazy data structures, then sorting the list lazily and taking i-th element will give you the best possible complexity: http://en.wikipedia.org/wiki/Selection_algorithm#Language_su...

In Haskell, it is nothing more than "ithSmallest xs i = sort xs !! i".

Pretty sure this is wrong. For example, what about the case where k=n? Then the lazy evaluation will have to fully sort the list before it can determine the kth element, resulting in an O(n log n) algorithm.

In the standard idiom,

   smallest = head . sort
we're only looking for the case of k=1, so we can do it in O(n) time.

Note that if you used a lazy BST, then you could probably still get the last element in O(n) run time, but your example code doesn't do that. AFAIK, no standard library routine exists for this.

If your sort is lazy enough, the `(sort xs) !! k` approach will work fine. Haskell's default sort isn't lazy enough, but you could certainly make it so.

You don't have to fully sort the list before determining the kth element (again assuming a sufficiently lazy implementation). Once you know the element is in the second half of the list, the sort simply will never sort the first half.

It should be noted that the additional overhead from performing lazy quicksort (all those additional thunks) make it an unlikely choice for the standard sort function; you'd suffer a large constant factor overhead in the typical case of processing the entire list in sorted order.

Yep, definitely should have noted that. It's an interesting mental exercise though.

I don't see how that follows from lazy evaluation. Here's my logic (corrections welcome):

- That 'sort' will produce the items of the list in some order ('third item is 2nd largest, 17th item is 34th largest,...).

- Lazy evaluation cannot stop before the sort produces 'x is i-th'.

- 'sort' does not know what is done with its output (i.e. it does not know the value of i)

If these are true, Evie can pick i in such a way that it is produced last, that is: such that sort must sort the complete list. That requires O(n log n) time.

The only way around this I can see is if the language recognizes the idiom and special-cases it to pick a sort algorithm, based on both the sequence xs and the requested index i. Does Haskell contain such a special case?

By lazy, thesz is referring to a property of the data structure, not lazy evaluation.

`sort xs` evaluates to a data structure which contains the original unsorted sequence.

When asked for the ith element, the datastructure uses quickselect to produce the ith smallest element, while also partially sorting the sequence.

Repeated requests for the ith element will become faster and faster, as the sequence becomes more and more sorted.

Repeated requests for the i-th element will not become 'faster and faster' but instead would simply be Θ(i) - they'd traverse i list pairs, then read the memoized data component.

If you like this, go look into the suffix array construction algorithm, called DC3 ("difference cover, modulo 3"). It's even more mindblowing.

Pang Ko's is cooler IMHO but then we are old lab mates.

The C++ standard library method std::nth_element states complexity; linear time on average. ( http://en.cppreference.com/w/cpp/algorithm/nth_element )

I wonder if there are other possible implementations?

I'm coding it in Ruby. Check back tomorrow and I should have it finished. https://gist.github.com/chadbrewbaker/7202412#file-median-rb

Sorting the sub-lists to find the 5 medians by itself is at best a O(nlogn) as per this example. That would make the algorithm non O(n). Or am I missing something?

Yes you are. It's O(nlogn) where n is 5 - i.e. it's constant time (you can easily find the median by 4 comparisons to find the smallest of the 5, followed by 3 comparisons for the 2nd smallest, followed by 2 comparisons for the median - a total of 9 comparisons)

Put otherwise: to sort them all it's O(n/5*(5 log 5)) = O(n).

Finished. Enjoy.

Nice writeup. There's a good source already though: http://www.cs.cmu.edu/~avrim/451f11/lectures/lect0908.pdf

You can use the same idea in quicksort while choosing the pivot (and make it O(n log n)) but you don't because it is expensive.

There is a very nice and simple probabilistic algorithm instead.

So embarrassing when a candidate busted this out on me in an interview - but couldn't quite explain it - so I was convinced he was wrong / out of his mind. Later I looked it up in the CLR and sure enough...

Well written. This was the first algo we were taught in our algorithms design course from CLRS. Blew my mind at that time.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact