Find running median from a stream of integers

dude_abides · on May 21, 2012

Here is a clever algorithm to find the median of a stream of integers by using just one variable:

The idea is to maintain a floating median. Start with an arbitrary number, say 0. If the incoming number is greater than the estimate, num += 1, else num -= 1.

It is easy to prove that for a large enough stream, assuming the stream is drawn from iid samples, this would converge to the median, by central limit theorem.

Using two variables instead of one, you could converge faster. So instead of incrementing/decrementing by one, you store how big your leap is, and have a schedule to change it.

yummyfajitas · on May 22, 2012

It is easy to prove that for a large enough stream, assuming the stream is drawn from iid samples, this would converge to the median, by central limit theorem.

I'm not sure what proof you have in mind, but I don't think using the word "converge" is correct.

Near as I can tell, your algorithm is x[n+1]= x[n] + sign(sample[n]-x[n]).

If you are applying the c.l.t. to sign(sample[n]-x[n]), all this says is that your algorithm will output a gaussian random variable with mean=median(samples), but sigma=sqrt(n). But maybe that isn't the proof you were thinking of?

[edit: also, the clt doesn't apply to sign(sample[n]-x[n]), since that isn't an i.i.d. sequence of random variables (it's not identical).

One possible proof I thought of: consider probability vectors of the form p[n] = [P(x[n] = 1), P(x[n] = 2), ...]. You can write the evolution of the probability vectors as matrix problem, p[n+1]=A p[n]. I'll bet you can break A = Projection onto median + remainder and show the remainder has norm less than 1. Just a guess, but I think it might work.]

tel · on May 22, 2012

Yeah, this is more Martingale territory than clt. I wrote a proof below that gives a limited Martingale property of the process, but I don't know enough further theory to continue. (slash I'm lazy and have better things to do)

sesqu · on May 22, 2012

It's worth stressing that while the c.l.t. gives a reasonable justification to using that algorithm for the mean, that justification does not extend to the median, which can diverge from the mean by as much as range/2. You'd need to be able to assume something like a connected set.

bjornsing · on May 21, 2012

> It is easy to prove that...

Reminds me of a similar comment in the margin of a old book: http://en.wikipedia.org/wiki/Fermats_Last_Theorem. :)

But it does sound both plausible and brilliant. How do you prove it (in broad terms)?

tel · on May 22, 2012

I can't think of an easy proof but it is clear that such a sequence of estimators, {V_n}, has the median as a fixed point in the following sense.

Assume V_(n-1) = median(X) for the data stream X_n. Now we have

    E[V_n | V_(n-1)] = V_(n-1) + E[signum(V_(n-1) - X_n)]

but by the definition of the median, the right hand side expectation is 0, so the random process is fixed in expectation.

Proving it'll converge to that value would be harder, though. It'll probably depend on the data being much larger than the step size (which is 1).

njs12345 · on May 22, 2012

Slightly off topic, but looks like toemetoch has tripped one of the hellban filters, isn't a spammer, and has no contact info in his profile. Might want to sort that out if you're reading this toemetoch!

pg · on May 22, 2012

No, he just accidentally posted the same comment twice. The new one got autokilled. Then he deleted the first one, leaving a single dead comment.

toemetoch · on May 22, 2012

Thanks for the info. Account seems to be normal again, mental state is not.

geebee · on May 21, 2012

as we said in college, "the proof is trivial and is left for the grader"

derleth · on May 22, 2012

toemetoch, who is dead, said:

"In a normal distribution the probability of a value higher or lower than mean is 50%. So if you find mean, the average of your +1/-1 noise is zero -> if you're on mean you'll remain there.

If you're not on mean you have a (50 + error)% chance of moving towards mean and (50 - error)% chance of moving away from mean on the next random value -> the trend is towards mean."

toemetoch · on May 22, 2012

toemetoch, who is dead

... I'm not following. I can see my reply a few posts down. I accidentally hit submit twice, deleted one but the second one is still there.

Edit: The same technique can be found in some 1-bit ADCs, I believe it's delta-sigma ADC.

derleth · on May 22, 2012

Odd. I still only see one response from you with that content, and it is dead.

mturmon · on May 22, 2012

OTOH, if you know they're integers (as you assume), you might use a simple histogram. Not the same, because you'd have to range the histogram -- but we seem to not mind throwing out a lot of data to get in the ballpark anyway.

And if you don't know they're integers, you have to find a scaling rule for your increment that assures convergence. This is the same problem as bucket size selection for the histogram.

notaddicted · on May 21, 2012

This is kind of a weird problem, because it requires you to keep all the numbers in memory, so the "streaming" aspect is a lot less useful. And then once you have the list in memory there are linear median finding algorithms. The only way out that I can see is if you expect the incoming data to follow a normal distribution you can short-circuit the entire problem and just calculate the mean.

EDIT: One could construct a "Counted B-Tree" which is a pretty straightforward augmentation of the B-Tree (quick google result: http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtre...). The insertion cost would still be of complexity log(n), the median could also be retrieved cheaply, and the data is kept sorted, but it takes more space.

mturmon · on May 22, 2012

You're right, it's kind of asking the wrong question.

Because of the robustness properties of the median, it would be hard to argue that the first 1E6 samples weren't enough, and that you really need to hang on for the next 9E6 to get an accurate median.

In real problems, if you have that much data, you would more likely start to question the iid assumption anyway, and you'd start looking for drifts in the medians within sliding windows. Real problems being real problems, you'd probably find some drift, and this would point out the unhelpfulness of an "all-data" median. ("Since the data is drifting, what is this the median of?")

mbell697 · on May 21, 2012

Depending on the application, you could keep a limited number of samples in memory and "below" or "above" those samples would simply be a counter of how many samples fall into either range. This would only work for applications where you expected the median to only vary in a limited fashion, and only after the system had been "primed".

For example, at any point in time, only store 100-1000 samples in memory, tree'd into "greater than median" and "less than median", additionally in each side of the tree store the number of samples that are in that tree but were purged do to the samples in memory requirement. When each new sample comes in, balance the tree to find the median.

I think this would work as long as a the median didn't have drastic movements up or down e.g. 1000 samples in a row < median, if that happened, it would exhaust your in memory sample set and you'd run out of samples to use as the median.

Even if you did expect wide variations you could use a similar approach but instead of purging samples into a counter you could serialize them to disk, or even have 3 layers, memory -> disk -> purge to counter.

EDIT: now that I think about this for more than 10 seconds, you may need to track the first derivative of median movement to determine how many samples to keep on a given side also.

raymondh · on May 22, 2012

Too bad that a horrible answer was selected as the best. The minheap/maxheap solution requires that the entire stream of integers be stored in memory.

There is a fast and memory efficient solution using Indexable Skiplists: http://code.activestate.com/recipes/576930-efficient-running...

In a n-sized sliding window, only the most recent n elements need to be stored (not the entire data stream). The skiplist insertions, deletions, and indexed lookups are all O(log n).

sixothree · on May 21, 2012

Heap sounds like overkill to find the middle value in a running list. What's wrong with simple stack and keeping track of how many integers you've received?

xyzzyz · on May 21, 2012

What's wrong with simple stack and keeping track of how many integers you've received?

I'm sorry, but I have absolutely no idea what you are talking about. Perhaps you could describe your solution a bit more.

jbritto · on May 21, 2012

I'm not sure what you have in mind (perhaps finding the mean? but why the stack?) but that would not work for median.

Drbble · on May 22, 2012

Your solution is for median position, not median value, and it would require a queue, not a stack, so unneeded old values could be dropped.

sixothree · on May 29, 2012

Of course, a queue is correct.