

Fast median search: an ANSI C implementation - objectivefs
http://ndevilla.free.fr/median/median/index.html

======
jallmann
I've been using a variation of quickselect to find median rows of a matrix at
certain columns.

There's a problem with quickselect: while it will find the median, it doesn't
properly pivot the median around itself if the median occurs multiple times.
So you could end up with the median sprinkled around both sides, which may or
may not be a problem (it was for me).

One way to solve this is to take a second pass over the data to pivot -- which
is essentially the core of QS where you nibble from both ends and swap
low/high values.

Another nice property of QS is that it'll be approximately sorted, with values
generally growing closer to the median towards the middle. This helps if you
need to do repeated sub-selections on the partitions.

~~~
susi22
There is actually a paper about exactly this:

"Optimal Pivot Selection in Fast Weighted Median Search"

ieeexplore.ieee.org/iel5/78/6236322/06193457.pdf

It gives a formula for the optimal subset size to use to find the pivot and
also does some optimization after the first partitioning step.

------
andrewcooke
maybe this is included and i missed it, but it's asymptotically faster to
maintain a sorted tree of values plus a list of pointers to nodes in order
added. if you store "number of values to right" in each tree node then finding
the next median (moving the window one position) is O(log w) and total cost
for whole array is O(n log w) iirc.

this is used to median filter images in the IRAF package. i don't know if the
approach is published anywhere (it's pretty obvious once the idea of keeping
points within the window in a sorted tree "clicks"), but frank valdes did test
it against other approaches.

the main drawbacks are that the overhead/constant is pretty high, so you need
fairly large datasets (more exactly, large windows) for it to be a win, and
implementation in old fortran is a pain...

~~~
jallmann
If the window is the length of the data, that becomes O(n log n) --
essentially no better than a sort?

For images and other kernel-based filtering, there is a O(1) median algorithm
available: <http://nomis80.org/ctmf.html>

~~~
sesqu
Thanks for the link. Note, though, that the median algorithm used there uses a
different parameter for n than OP. It also is bucketsort-based and O(n), O(1)
only amortized (in both parameters).

Additional knowledge of your data is often useful, like using the fact that
all data are 8-bit integers above. Similarly, if you know your data is
uniformly distributed, you could try Torben (in TFA) for pivot selection in
quickselect.

------
vowelless
I would have liked to see a benchmark for the deterministically linear O(n)
BFPRT [1] algorithm as well. It plays out well in worst case scenarios if you
have a very large dataset.

[1]
[http://en.wikipedia.org/wiki/Selection_algorithm#Linear_gene...](http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-
_Median_of_Medians_algorithm)

------
cschmidt
That's funny. I just looked back at my old email, and I used this very
quickselect.c implemention back in April 2008. It worked well and was wicked
fast.

The algorithm by Blum, Floyd, Pratt, Rivest, and Tarjan takes 24n comparisons
in the worst case. A description here:

<http://www.ics.uci.edu/~eppstein/161/960130.html>

However, this algorithm has an average case performance of 4n comparisons:

<http://www.ics.uci.edu/~eppstein/161/960125.html>

------
twiceaday
Torbens method is very interesting. At each step it takes O(n) to half the
range of inputs. This doesnt say much about how many numbers in the array it
removes, but it does say that on average its about half, leading to O(nlogn)
expected time. In the worst case it is Quadratic, with one example being [1,
2, 4, 8, 16, 32, 64, ...]

