Hacker News new | past | comments | ask | show | jobs | submit login

Here is a clever algorithm to find the median of a stream of integers by using just one variable:

The idea is to maintain a floating median. Start with an arbitrary number, say 0. If the incoming number is greater than the estimate, num += 1, else num -= 1.

It is easy to prove that for a large enough stream, assuming the stream is drawn from iid samples, this would converge to the median, by central limit theorem.

Using two variables instead of one, you could converge faster. So instead of incrementing/decrementing by one, you store how big your leap is, and have a schedule to change it.




It is easy to prove that for a large enough stream, assuming the stream is drawn from iid samples, this would converge to the median, by central limit theorem.

I'm not sure what proof you have in mind, but I don't think using the word "converge" is correct.

Near as I can tell, your algorithm is x[n+1]= x[n] + sign(sample[n]-x[n]).

If you are applying the c.l.t. to sign(sample[n]-x[n]), all this says is that your algorithm will output a gaussian random variable with mean=median(samples), but sigma=sqrt(n). But maybe that isn't the proof you were thinking of?

[edit: also, the clt doesn't apply to sign(sample[n]-x[n]), since that isn't an i.i.d. sequence of random variables (it's not identical).

One possible proof I thought of: consider probability vectors of the form p[n] = [P(x[n] = 1), P(x[n] = 2), ...]. You can write the evolution of the probability vectors as matrix problem, p[n+1]=A p[n]. I'll bet you can break A = Projection onto median + remainder and show the remainder has norm less than 1. Just a guess, but I think it might work.]


Yeah, this is more Martingale territory than clt. I wrote a proof below that gives a limited Martingale property of the process, but I don't know enough further theory to continue. (slash I'm lazy and have better things to do)


It's worth stressing that while the c.l.t. gives a reasonable justification to using that algorithm for the mean, that justification does not extend to the median, which can diverge from the mean by as much as range/2. You'd need to be able to assume something like a connected set.


> It is easy to prove that...

Reminds me of a similar comment in the margin of a old book: http://en.wikipedia.org/wiki/Fermats_Last_Theorem. :)

But it does sound both plausible and brilliant. How do you prove it (in broad terms)?


I can't think of an easy proof but it is clear that such a sequence of estimators, {V_n}, has the median as a fixed point in the following sense.

Assume V_(n-1) = median(X) for the data stream X_n. Now we have

    E[V_n | V_(n-1)] = V_(n-1) + E[signum(V_(n-1) - X_n)]
but by the definition of the median, the right hand side expectation is 0, so the random process is fixed in expectation.

Proving it'll converge to that value would be harder, though. It'll probably depend on the data being much larger than the step size (which is 1).


Slightly off topic, but looks like toemetoch has tripped one of the hellban filters, isn't a spammer, and has no contact info in his profile. Might want to sort that out if you're reading this toemetoch!


No, he just accidentally posted the same comment twice. The new one got autokilled. Then he deleted the first one, leaving a single dead comment.


Thanks for the info. Account seems to be normal again, mental state is not.


as we said in college, "the proof is trivial and is left for the grader"


toemetoch, who is dead, said:

"In a normal distribution the probability of a value higher or lower than mean is 50%. So if you find mean, the average of your +1/-1 noise is zero -> if you're on mean you'll remain there.

If you're not on mean you have a (50 + error)% chance of moving towards mean and (50 - error)% chance of moving away from mean on the next random value -> the trend is towards mean."


toemetoch, who is dead

... I'm not following. I can see my reply a few posts down. I accidentally hit submit twice, deleted one but the second one is still there.

Edit: The same technique can be found in some 1-bit ADCs, I believe it's delta-sigma ADC.


Odd. I still only see one response from you with that content, and it is dead.


OTOH, if you know they're integers (as you assume), you might use a simple histogram. Not the same, because you'd have to range the histogram -- but we seem to not mind throwing out a lot of data to get in the ballpark anyway.

And if you don't know they're integers, you have to find a scaling rule for your increment that assures convergence. This is the same problem as bucket size selection for the histogram.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: