
Algorithms Every Data Scientist Should Know: Reservoir Sampling   - Irishsteve
http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
======
raymondh
The core problem with Reservoir sampling is the huge number of calls to the
random number generator. That quickly drains the entropy from PRNGs.

Also, for some sampling applications, the calls to the PRNG are the slowest
part of the algorithm. So, you should prefer an algorithm that makes the
fewest possible number of calls.

Lastly, you should take care in how the algorithm is implemented. For example,
if the PRNG yields consecutive 32-bit values, computing randbits modulo n
leads to small selection biases whenever n doesn't evenly divide 2^32. There
are a lot of ways around this problem, but naive implementations won't give
equi-distributed results.

For the most part, you should choose some other sampling algorithm unless you
have exotic needs: not knowing the population size until you loop over it, AND
not caring about the huge number of calls to the PRNG, AND having a limitless
supply of entropy for your PRNG, AND having an easy means of getting
equidistributed values in range of 0 to n-1, OR the quality of your results
isn't important.

~~~
joe_the_user
Another thing about the algorithm and article is, this is _supposed to be_ a
"real world problem" that the interviewer asks because they're struggling with
it themselves. But what would a single randomly chosen item from a huge set
_mean_ to "real world" operations? That you would very occasionally get a very
usual value? - You'd have to run the scan many times for that and whatever
other thing you were thinking of.

Basically, the algorithm is "cool exercise in probability" that any deeper
look will discover has just about zero use or practicality in the actual "real
world". Which kind of throws into the zone of obnoxious interview brain
teasers despite claims to the contrary.

~~~
gizmo686
The problem was supposed to be based off of a real world problem. Almost
always, when you simplify a problem you have to its core, you end up with a
very simple question that appears to have no practical applications.

Normally, this simple question is also easy, in which case simplifing it is
the only skill you need. In other cases this question is hard, in which case
you need to be able to solve it. Both of these skills are valuable, but should
not be tested in the same question.

Also, if you are trying to pick a candidate's brains, their is no reason to
make them redo your work of simplification, when you are stuck on the actual
problem.

------
noelwelsh
I love reservoir sampling. When I first heard about the algorithm I had been
preparing a lab session for a course I was a TA on. (I had a lot of freedom to
create labs in this course -- a good and bad thing!) I made dinner in a bit of
a daze while I proved in my head that reservoir sampling worked. It's a simple
inductive proof, which is a good thing because if it was more complicated I
might not have been able to do it without paper, and I might not have eaten
that night.

I've since learned there are many other interesting sampling algorithms that
apply in a streaming setting. A few are give here:
[http://people.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf](http://people.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf)

~~~
gtani
Good link, thanks. Here's a couple people doing basic algo research

[http://www.cs.princeton.edu/~arora/publist.html](http://www.cs.princeton.edu/~arora/publist.html)

[http://www.cs.rutgers.edu/~muthu/streams.html](http://www.cs.rutgers.edu/~muthu/streams.html)

------
decklin
I've always loved the one-liner of this in Perl:
[http://learn.perl.org/faq/perlfaq5.html#How-do-I-select-a-
ra...](http://learn.perl.org/faq/perlfaq5.html#How-do-I-select-a-random-line-
from-a-file-)

~~~
tantalor
Let's unpack that,

    
    
        rand($.) < 1 && ($line = $_) while <>;
    

We all know _rand_ and _while_ , but if you don't know perl the rest is hard.

<> is a common way to read stdin, and the value is assigned to the $_ special
variable.

The one I didn't know was $., the current input line number. In other words,
its your loop index that automatically increments.

[http://www.kichwa.com/quik_ref/spec_variables.html](http://www.kichwa.com/quik_ref/spec_variables.html)

~~~
ceautery
That's good until your input stream exceeds 32768 lines [1]. There are better
generators available from CPAN, like Math::TrulyRandom.

[1] -
[http://www.perl.com/doc/FMTEYEWTK/random](http://www.perl.com/doc/FMTEYEWTK/random)

------
joe_the_user
Hmm,

Given the article's windup, I'd be a bit skeptical of the naive solution - _at
input n, with probability 1 /n replace the current choice_.

Once you've reached input number 10^8 or whatever, you've got all sorts of
chances to have arithmetic overflow and/or the weirdness of that many pseudo-
random operations screw you.

I'd rather keep a list of x's; let x1 be randomly chosen from the last 10, x2
randomly chosen from the last 100, x3 random chosen from the last 1000, etc
and when termination time come do a little fixup. That'd take O(log(n)) memory
instead of O(1) but if this matters, shouldn't you want some sample of what's
happening?

~~~
brendano
_Once you 've reached input number 10^8 or whatever, you've got all sorts of
chances to have arithmetic overflow and/or the weirdness of that many pseudo-
random operations screw you._

Why? Double floats don't overflow until 10^-300 or so, and RNG's give uniform
numbers from 0 to 1 just fine...

~~~
dbaupp
Most RNGs that generate numbers from 0 to 1 actually just generate a random
64-bit value, and divide by 2^64; which means that they do underflow
relatively easily. (This is relatively easily to overcome using a better (&
more expensive) random bits -> double algorithm.)

------
ericbb
I first saw that sampling technique in The Practice of Programming by
Kernighan and Pike. It's used in their Markov Chain example.

------
stewbrew
Just yesterday this blog post was on r-bloggers that shows how to use it with
R: [http://things-about-r.tumblr.com/post/53690776834/time-is-
on...](http://things-about-r.tumblr.com/post/53690776834/time-is-on-my-side-a-
small-example-for-text-analytics)

------
geebee
This is a great algorithm, entertaining to read about. I can say pretty safely
that I wouldn't have come up with it on the spot.

Maybe that's why I'm not so enthusiastic about it as an interview question? ;)

More seriously - I do think that questions with specific optimal answers (or
at least known answers that are considerably better than simple greedy
solutions) _can_ be good interview questions, as long as they contain a big
middle, with plenty of opportunity to think about the problem and show some
good problem solving skills.

Questions where you know the answer or don't are probably the worst. Questions
where you get the right answer or you get nowhere aren't great either.
Questions where you can get a good answer by thinking about it, even if you
don't get the most optimal answer, are probably best.

Another good way to approach this would be this: how much would an
interviewee/candidate's performance change if he or she were allowed to type
"how to randomly sample from a list of unknown length" into google. If that
would make a huge difference, then it probably isn't a great question unless
you're really testing to see if someone is already aware of an algorithm.

------
altrego99
I deduced this myself when I wrote an AI for Ludo, so that it chooses one of
the possible moves - equally likely - as the moves are getting generated :)

Some other similar logic - Selecting n distinct items out of N, all C(N,n)
selections being equally likely, passing through the data only once :)

And the next step - if the data can be accessed randomly, the above can be
done in O(n) time instead of O(N) through a modification of the algo, still
giving a statistically equivalent sampling scheme.

------
mililani
Any data scientists on HN? I just ask because I wonder if it would be a good
career move for me. Spent 10 years as a programmer and systems analyst. Got
real tired of that.

~~~
noelwelsh
"Data scientist" is a really vague term. It's really a matter of looking at
your current skill set and interests, and seeing where they fit.

For example, many CS grads don't have strong stats/maths backgrounds. However
there are still many "data science" jobs open to them. For example, you can
work on the tooling that goes into storing and processing data (e.g. Hadoop),
or you can build visualisations. If you want to work on, say, data analysis
you need to have a stronger background in stats.

~~~
UK-AL
Since machine learning and A.I is pretty much applied stats, there has got to
be quite a lot of cs grads with good stats skills.

~~~
noelwelsh
Some CS departments require undergrads to do the typical engineering maths
sequence (Calculus, Linear Algebra, etc.) but many don't. Lots of PhD students
in machine learning have undergrad degrees in physics, maths, or engineering.

Where I did my PhD (the top ranked CS department in the UK by a recent survey)
the undergrad students did one maths courses in their entire degree, with most
of it being discrete maths (IIRC). This really limited the ability of students
to undertake research in machine learning or theoretical CS.

~~~
UK-AL
Since your based in Birmingham I can take a good guess at where that is,
because I studied there and live in birmingham. And I agree, but the machine
learning/A.I modules themselves had to have maths placed within them.

Nice to see brummies on hacker news. Since your start up guy, I guess you hang
around faraday wharf?

~~~
noelwelsh
I was at Faraday Wharf a few months ago, but I mostly work from home now. Drop
me an email (address is in my profile) if you want to chat more.

