Here is a clever algorithm to find the median of a stream of integers by using j...

yummyfajitas · on May 22, 2012

It is easy to prove that for a large enough stream, assuming the stream is drawn from iid samples, this would converge to the median, by central limit theorem.

I'm not sure what proof you have in mind, but I don't think using the word "converge" is correct.

Near as I can tell, your algorithm is x[n+1]= x[n] + sign(sample[n]-x[n]).

If you are applying the c.l.t. to sign(sample[n]-x[n]), all this says is that your algorithm will output a gaussian random variable with mean=median(samples), but sigma=sqrt(n). But maybe that isn't the proof you were thinking of?

[edit: also, the clt doesn't apply to sign(sample[n]-x[n]), since that isn't an i.i.d. sequence of random variables (it's not identical).

One possible proof I thought of: consider probability vectors of the form p[n] = [P(x[n] = 1), P(x[n] = 2), ...]. You can write the evolution of the probability vectors as matrix problem, p[n+1]=A p[n]. I'll bet you can break A = Projection onto median + remainder and show the remainder has norm less than 1. Just a guess, but I think it might work.]

tel · on May 22, 2012

Yeah, this is more Martingale territory than clt. I wrote a proof below that gives a limited Martingale property of the process, but I don't know enough further theory to continue. (slash I'm lazy and have better things to do)

sesqu · on May 22, 2012

It's worth stressing that while the c.l.t. gives a reasonable justification to using that algorithm for the mean, that justification does not extend to the median, which can diverge from the mean by as much as range/2. You'd need to be able to assume something like a connected set.

bjornsing · on May 21, 2012

> It is easy to prove that...

Reminds me of a similar comment in the margin of a old book: http://en.wikipedia.org/wiki/Fermats_Last_Theorem. :)

But it does sound both plausible and brilliant. How do you prove it (in broad terms)?

tel · on May 22, 2012

I can't think of an easy proof but it is clear that such a sequence of estimators, {V_n}, has the median as a fixed point in the following sense.

Assume V_(n-1) = median(X) for the data stream X_n. Now we have

    E[V_n | V_(n-1)] = V_(n-1) + E[signum(V_(n-1) - X_n)]

but by the definition of the median, the right hand side expectation is 0, so the random process is fixed in expectation.

Proving it'll converge to that value would be harder, though. It'll probably depend on the data being much larger than the step size (which is 1).

njs12345 · on May 22, 2012

Slightly off topic, but looks like toemetoch has tripped one of the hellban filters, isn't a spammer, and has no contact info in his profile. Might want to sort that out if you're reading this toemetoch!

pg · on May 22, 2012

No, he just accidentally posted the same comment twice. The new one got autokilled. Then he deleted the first one, leaving a single dead comment.

toemetoch · on May 22, 2012

Thanks for the info. Account seems to be normal again, mental state is not.

geebee · on May 21, 2012

as we said in college, "the proof is trivial and is left for the grader"

derleth · on May 22, 2012

toemetoch, who is dead, said:

"In a normal distribution the probability of a value higher or lower than mean is 50%. So if you find mean, the average of your +1/-1 noise is zero -> if you're on mean you'll remain there.

If you're not on mean you have a (50 + error)% chance of moving towards mean and (50 - error)% chance of moving away from mean on the next random value -> the trend is towards mean."

toemetoch · on May 22, 2012

toemetoch, who is dead

... I'm not following. I can see my reply a few posts down. I accidentally hit submit twice, deleted one but the second one is still there.

Edit: The same technique can be found in some 1-bit ADCs, I believe it's delta-sigma ADC.

derleth · on May 22, 2012

Odd. I still only see one response from you with that content, and it is dead.

mturmon · on May 22, 2012

OTOH, if you know they're integers (as you assume), you might use a simple histogram. Not the same, because you'd have to range the histogram -- but we seem to not mind throwing out a lot of data to get in the ballpark anyway.

And if you don't know they're integers, you have to find a scaling rule for your increment that assures convergence. This is the same problem as bucket size selection for the histogram.