
The Median of a Trillion Numbers - fogus
http://matpalm.com/median/index.html
======
kuya_noel
what about using histograms: a histogram is N bins, where N is the number of
values that an integer could assume and each bin stores the count of the
number of times that value is seen. assume an integer is 32 bits. 2^32 ~= 4
billion bins. to store counts of up to a trillion, we'd use a data type that
goes up to at least a trillion, we can use a 64 bit uint for that. so 2^32
bins * 2^3 bytes per count = 2^35 or ~32GB. my hard drive is bigger than that,
so we can potentially just break the bins down into a bunch of files, maybe
even 1 file per bin if the OS lets us. after we've stored all the numbers in
our bins, we just iterate from our first bin adding up all our counts till we
hit half a trillion. the bin that one is in is our median.

if we more than 1 computer, we could map the range among all the computers (so
if we had 2 computers, 1 computer would take the first 2^31 bins, the 2nd
computer would only care about the second 2^31 bins, etc). then you could
iterate through all the computers in order, just passing along the current
count so far, stopping when you hit half a trillion.

~~~
chmike
Note that the range of integer values to consider is not specified. We don't
know if they are signed and 32 or 64bit values.

The histogram is indeed the best algorithm and the one I would use. I call it
the hash select. Note that it may be applied recursively, by narrowing the
bins. Quick select is then in fact the hash select with only two bins.

Note also that since histogram filling is distributed, one needs to add up the
histograms to find the one containing the median value. This is a good reason
for not using histograms with 2^32 or 2^64 bins.

A 1024 bin histogram, would allow to find the median value of 32 bit integers
in at most 4 iterations, and 64bit integers in at most 7 iterations.

The algorithm to identify the bin containing the median value is then very
similar. One adds up the number of all lower values, until the bin containing
the n/2 th value is found. This bin contains the median value.

Instead of adding up all histogram, which may be done pairwise (OlogN), one
could do this for each bin, progressively, until the bin containing the median
value is found. I guess there is a nice optimal and distributed algorithm to
find out there. But we get the general picture.

Another optimization would be to adjust the histogram boundaries to fit the
biggest and lowest value in the value set.

------
rjprins
I liked the question and after thinking hard about potential algorithms, I
came to a different conclusion:

\- In general for data sets of this size, you don't want anything more then a
linear algorithm or less preferably.

\- Also, because a median is a statistic measure, it's use will be of a
predicting nature. (Why would anyone need the exact median of a trillion
numbers?)

\- Because of the uncertainty involved with predicting and the size of the
data set, I think a sample approach would be appropriate.

\- A sample size of 1000 - 10,000 numbers per machine is enough for
significantly recognizing any reasonable distribution.

\- We could take the median of those 1 - 10 million numbers directly or we
could apply curve fitting techniques to gain a lot more insight.

Of course if someone could provide a reason for needing the exact median I am
completely wrong.

~~~
roundsquare
Thats a good point. Two things come t mind though. One is that since its an
interview question they are probably just using this to test your knowledge of
(distributed) algorithms.

The second is that you might want the exact answer if you are using the
statistic to convince someone of something and you don't want to leave any
room for doubt.

~~~
eru
And getting the exact number is already a linear problem.

~~~
rjprins
In the ideal case you find the correct pivot immediately, but in worst case
you will have to sort the whole list. Thus O(n) = n log n.

~~~
eru
No. You can find every order statistic in O(n) [1]. And even if someone handed
you the pivot, you still had to prove that this is right.

[1] <http://en.wikipedia.org/wiki/Selection_algorithm>

------
plucas
Interesting article, but I kind of wish the author would use a few capital
letters.

------
patio11
This is the perfect answer to an interview question, with the proviso that it
is nearly impossible to demonstrate over the course of an interview. You can
sketch out the general nature of the approach but actually doing something
productive with it takes a wee bit longer.

For questions like this, you think between the phone screen and the serious
interview you could send someone an email saying "Here's homework: we have a
torrent here of a substantial data set. Find the median number, document your
approach, and we'll talk about it at the interview."

~~~
telegraph
That all depends on what one is actually looking to find in an applicant. If
the goal of the question is to actually determine if the interviewee has the
ability to research a problem, think it through, and implement the solution,
then the "homework" approach is definitely superior.

However, if the thrust of the question is to see the applicant think on
his/her feet and possibly apply some knowledge from a college class on
algorithms (I never covered a problem in this depth in my undergrad comp sci
education, but it's definitely standard to discuss computing order
statistics), presenting it in the interview is best. The idea of the solution
as presented here could easily be sketched out in the interview, and a skilled
interviewer could lead the applicant through fleshing out some details if that
was so desired.

That said, I think the latter is more often the goal of a final round
interview.

------
sweis
If you're familiar with the linear time deterministic selection algorithm,
this is straightforward to solve. It's usually taught in intro to algorithms
classes. See CLRS chapter 9.

Incidentally, the algorithm is from 1973 and four of the authors went on to
win the Turing award:
ftp://db.stanford.edu/pub/cstr.old/reports/cs/tr/73/349/CS-TR-73-349.pdf

------
jfoutz
Since 10^12 is way way bigger than a 2^32, it's either the average of the
numbers, or a strange degenerate case. (10^11 1's followed by 10^11 3s).

just calculate the average, and remember the item closest to the current
average.

~~~
roundsquare
Eh? Thats not true.

The only time this would work is if the distribution of the numbers was
symmetric (or close to symmetric).

If you have

1 --> 599,999,999,999 times

4 --> 1 time

10 --> 400,000,000,000 times

Your average will be about 4.6 (a little higher) and your algorithm will
return 4 as the answer when it should be 1.

~~~
jfoutz
Yeah, I commented right before bed. I thought of a bunch of problems with the
average while trying to get to sleep.

I was thinking since the size of an outlier is constrained by fitting in 4
bytes, there must be a way to show the median is in some band of numbers
around the average. I didn't think of a particularly useful algorithm though.

Thanks.

