
Algorithms Every Programmer Should Know: Reservoir Sampling - timr
https://www.omniref.com/ruby/gems/stream_sampler/0.0.1/symbols/StreamSampler?#annotation=4094626&line=2
======
eridius
This algorithm was introduced with the premise

> _" I have a database table with #{17.kajillion} records, and I want to take
> a random sample of #{X} of them. How can I do this efficently?"_

And goes on to say that something like `YourBigTable.all.sample(X)` is bad in
time and space. And the link that explains why doing `ORDER BY RANDOM()` is
bad lists one reason being that generation of random numbers is relatively
expensive.

But these problems aren't solved by reservoir sampling. Reservoir sampling is
still O(N), and it requires generating a random number for every element. It
still beats `ORDER BY RANDOM()` because it doesn't have to sort the numbers
(or even keep them all in memory at the same time). But that doesn't make it a
good choice.

More generally, reservoir sampling is appropriate when you _don 't know the
size of the set beforehand_. But in the premise, we do know the size of the
set. And if you know the size of the set (and have some way to address
arbitrary elements), then randomly generating a set of X indices is going to
be much simpler and faster. Yeah, you have to deal with duplicate indices, but
that's an easy problem to solve.

~~~
timr
Author here.

Yes, reservoir sampling is O(N), but my point there was that _it doesn 't
require instantiating every record at once_. Also, the cost of the random
number generation isn't the big deal in an ORDER BY RANDOM() implementation --
it's that even the smart implementations end up doing multiple table scans. So
even at O(n) this probably beats most implementations of that approach, in
practice.

Reservoir sampling is appropriate with more than just a set of unknown size --
you very frequently know the _size_ of a set, but it's still too big to sample
directly. But yes, if your sets are small, you have a lot of options.

~~~
eridius
If you know the size of the set, then you can generate a random set of (non-
duplicate) indices, and use those to fetch data.

Offhand I'm not sure how to do that in SQL, since data isn't usually fetched
by index (unless the row has an explicit index key). Worst-case you could
iterate over the rows just as you do with reservoir sampling, and it would
still be faster (since you aren't generating tons of random numbers and doing
a lot of array manipulation). Or you could do one query fetching just the
primary key, record the keys of the indices you're interested in, and then do
a second query to fetch that set. Or maybe SQL does have some way to fetch by
index after all.

~~~
dlubarov
Yeah - unfortunately there isn't a great way to select n random rows in
standard SQL.

If you have an integral primary key, you could select its max and generate
random numbers up to that. Continuity isn't guaranteed so you would need to
check for nonexistent primary keys in addition to checking for duplicates.

If there's no integral primary key, then there's no standard SQL solution
faster than O(n).

~~~
eridius
If N is sufficiently larger than k, it might actually be faster to do
something like

    
    
      let count = query("SELECT COUNT() FROM table")
      let indices = computeIndices(count, k)
      for index in indices {
        results.append(query("SELECT * FROM table LIMIT 1 OFFSET ?", index))
      }
    

Yeah, you're doing a bunch of queries, but you're removing all of copying all
the data from all the rows.

~~~
dbenhur
OFFSET j is O(j) in most RDMSes. You made an o(N/2 * k) algorithm given j is a
random number in [0,N).

~~~
eridius
Depending on the size of k compared to N, and how much data is in each row,
it's still probably better than iterating over the entire contents of the
table.

Also, do you have any citation for OFFSET j being O(j)? I can believe that's
the case, but it also seems fairly obvious to optimize OFFSET for the case of
a non-filtered unordered SELECT (or one ordered in a way that matches an
index).

~~~
detrino
Assuming that the DB is backed by some kind of tree, it would need to maintain
extra metadata in the branches in order to achieve even O(log n) offsetting.
For this reason I could understand it being a linear operation.

------
thaumaturgy
Wow, OmniRef is cool. StackOverflow-style Q & A embedded right into source
code.

Reservoir sampling: I'm not sure that applying this algorithm to database
sampling is the right thing to do. By its nature, the algorithm has to touch
every single row in a database, and it does that because it's designed for
data streams where you don't know in advance the size of the stream -- which
isn't the case with database tables.

Assuming a uniform distribution in your database, you could instead do
something like:

    
    
        SELECT *
        FROM (
            SELECT
                @row := @row +1 AS rownum, [column, column, columns...]
            FROM (
                SELECT @row :=0) r, [table name]
            ) ranked
        WHERE rownum % [n] = 1
    

to get every nth record of your table, calculating n ahead of time for n ~=
(table size)/(sample size). That should be a little bit faster and in most
cases still provide acceptable results.

~~~
screwedup
Clever, but that gives you every nth row - you might want a truly uniform
sample.

------
willvarfar
I used to work on a hospital database, back before the internet.

It happened a few times that we were asked for a random sample of records. It
was quite common to want to recalculate the rehabilitation rates and other
stats.

This was a problem, as patient records filled several tapes and, being the
youngster, I had to sit loading tapes for hours.

Because our data set was bigger than memory, we had trouble deciding what to
sample. We'd read everything in, filtering the records, and recording just
their offsets. Then we'd pick some number, then re-read all the tapes to fetch
the actual records. Naive and tedious.

Luckily the IT manager knew all about reservoir sampling, and explained it to
us.

We made a little Turbo Pascal program that ran over the data files each night
before they were archived to tape and kept a large (but small enough to fit on
one machine) reservoir!

Every time we were asked after that for a random sample after that we just
handed out the reservoir.

The care managers, who thought they were still causing us to sit for hours
shuffling tapes, were terribly grateful at the quick turnarounds. We never did
let them in on our secret.

These days of course hospital records fit in RAM. Not the same kind of
problems.

~~~
jsprogrammer
So the hospital was expecting random samples each time, but kept getting the
same sample (with some appended records)?

~~~
eru
They were getting a new random sample each night. Same (or probably better!)
distribution as before.

------
practicalpants
I think in any real life scenario I would have an idea about the number of
records in my table, or could figure that out very cost effectively. So to
follow in OP's example with Rails, I would just do Table.where(id:
[array_of_random_ids]), which is going to be extraordinarily more efficient
than his batch processing Table.find_all.

Something should really only be filed under "Every Programmer Should Know" if
it will be encountered in the real world more than once or twice. So in this
case, the title feels like click-bait.

~~~
syntern
Reservior sampling does have legitimate use in monitoring: to track real-time
percentiles (e.g. over the last 15 minutes), without logging everything and
re-calculating on each new data entry. I do think it is important and useful
to know about it, but probably would not use it for the database sampling (as
it was in the article).

------
jandrewrogers
Just looking at the comments, I think their is a tacit assumption on the part
of some commenters that a stream of unordered records from a database has a
random distribution relative to the value you care about.

That is almost never true. In fact, I have seen clever performance
optimizations in database applications that exploit I/O scheduler and layout
induced bias in index-less record order. In practice, sampling 25% of very
large database tables will often not be representative of sampling 100% of
that database table even if the value is not explicitly indexed. Database
engines automatically do a lot of low-level optimizations that introduce bias
in a nominally random dump of records as a stream.

It is actually quite difficult to randomly sample a large database table.

~~~
radicality
Does this matter for this example? It scans every single record, and adds it
to the output with a certain probability. The ordering of the original stream
doesn't matter I think.

------
dude_abides
FWIW, the pseudo-code in the Wikipedia article is easier to grok:
[http://en.wikipedia.org/wiki/Reservoir_sampling](http://en.wikipedia.org/wiki/Reservoir_sampling)

~~~
tedunangst
I don't know... Can you really trust a wikipedia page that has needed
attention from an expert for five years?

------
paragraft
Had a great opportunity to use this in anger a few months ago. Working on a
mapping application, where the map could only handle a few hundred items at a
time, but the customer account had tens or hundreds of thousands. And the user
could set a realtime filter which would select an unpredictable subset of the
larger collection.

QA raised a ticket complaining the subset we were using (just taking the first
n items from the collection that matched the filter, collection size was
enough that we didn't want to double enumerate or store the whole thing in
memory) was unbalanced, eg we were giving the impression that areas of the map
that had plenty of items were actually empty.

Switched to reservoir sampling to make sure the subset was a fair
representation of the collection, with only one pass and not blowing our
memory budget. And everyone you explain it to gets to admire the algorithm.

------
acqq
I find the paper mentioned much easier to follow:

[http://www.cs.umd.edu/~samir/498/vitter.pdf](http://www.cs.umd.edu/~samir/498/vitter.pdf)

------
phunge
Indeed I've always thought this was a very cool algorithm.

I once had to extend it for a distributed setting (think mapreduce) where you
can cheaply iterate all records but not sequentially. Instead you have
multiple reservoirs, and you can't resample those uniformly since each saw a
different number of records.

The unbiased way to aggregate 2 reservoirs is to draw a random number from a
hypergeometric distribution. You aggregate a larger number by combining 2 at a
time.

Fun story, an interviewer at Google once asked me to derive reservoir sampling
a long time ago, which I completely wiffed.

~~~
mark-r
I was able to derive it on the fly fast enough to make a StackOverflow answer
once. I had never heard of or used the technique before, and certainly didn't
know what it was called.

I'd probably wiff it in an interview too. For some reason my brain doesn't
work as well in an interview setting. Thankfully I haven't had to worry about
that too many times.

------
javajosh
I really like this algorithm, and I think it's instructive to think about a
naive solution, and how it differs.

Looping n times, calling rand(n) and putting the random element in the output.
This mostly works, and is actually far better, in space _and_ time, especially
in the case where "putting the element in the output" is expensive (as is
implied by the "kajillions of records" premise).

The problem with this solution is that you might pick the same random number
twice or more in a row. This means you have to keep track of your random
numbers if you absolutely can't tolerate a slightly smaller sample.

The reservoir avoids reusing random numbers in a clever way, by only applying
the random number as the target for replacement, and iterating over every N.
So the same elt in the sample might be replaced twice, but no elt in N will be
added to the sample twice.

One interesting thing is that when N is slightly larger than n, those extra
elements are very likely to replace something in the base sample. In fact, the
probability is n/N.

~~~
danieldk
_Looping n times, calling rand(n) and putting the random element in the
output. This mostly works, and is actually far better, in space and time,_

Reservoir sampling is normally used on large streams where you do not want to
or cannot keep the data you are sampling from in memory.

~~~
javajosh
Then I'm a little bit worried about this algorithm, because if the probability
of picking is n/idx then the odds of picking a long tail item get
asymptotically close to 0. I'm not sure if it really qualifies as a sample of
the entire thing.

~~~
loqi
Intuitively, the reason why it balances out is that while earlier items are
more likely to be picked, they're also more likely to get kicked out after
being picked.

------
orbitur
I didn't want to use a public account (Omniref only accepts Github, Google,
and FB) asking this question, so can someone here help me with the probability
explanation?

 _Consider the simplest non-trivial example: a sample of 1 element, from a
two-element reservoir. At iteration one, we accept the first element. At
iteration two, if the sample is to be random, we have to decide to keep the
new element with a probability of 1 /2:_

 _First element: (1 /1) _ (1/2) = 1/2*

Why are we multiplying the acceptance of the first element with the
probability of keeping it? And then...

 _But what if we 're sampling from a pool of three elements instead? We know
that we therefore have to keep the new element with a probability of 1/3…which
means that we need to keep the element we already have with probability 2/3:_

 _First element: (1 /2) _ (2/3) = 1/3*

Where did the 1/2 come from? Why is it not 1/1 here?

~~~
defen
The left side is the cumulative probability that it's been kept "so far". The
right side is the probability it will be kept, rather than replaced by the new
element (probability that the new element will _not_ replace this one).
Multiply them together and you get the overall probability that the element
remains after the new element has been processed.

------
earino
The timing of this post is pretty great! I work a lot with "largish files" in
R, and while R has gotten much better at reading big files (thanks to
data.table's fread), often times you really don't want to work with the whole
file. I was surprised that there wasn't a tool to sample from a stream. I
wrote a simple perl script
[https://github.com/earino/fast_sample](https://github.com/earino/fast_sample)
to do just that. I was "getting around" the issue of needing reservoir
sampling by allowing the user to pass in a "proportion of lines" (e.g. .01 for
1%).

I realized that I needed to add the ability to sample an exact number of
records, so I went ahead and implemented reservoir sampling. It's a neat
algorithm and the performance is pretty stellar. Now I just need to expand it
to allow stratified sampling on a column! :)

------
pacaro
IIRC this is how fortune[1] selects a message fairly given the crappy database
format it uses (possible fortunes are separated by lines containing just '%'

[1]
[http://en.wikipedia.org/wiki/Fortune_(Unix)](http://en.wikipedia.org/wiki/Fortune_\(Unix\))

------
dangayle
This is awesome. I asked about this question on Stack Overflow back in
September, but my question go shut down.

[http://stackoverflow.com/questions/25942333/how-do-
reservoir...](http://stackoverflow.com/questions/25942333/how-do-reservoir-
sampling-algorithms-work)

I use this algorithm in one of my tweet bots that tweets out a random tweet
from a txt file containing 10k different quotes, works fantastically and has
been for several months with no duplicates or malfunctions.

------
damian2000
This reminded me of the card shuffling algorithm from Sedgewick

[http://algs4.cs.princeton.edu/lectures/21ElementarySorts.pdf](http://algs4.cs.princeton.edu/lectures/21ElementarySorts.pdf)

In that he assigns a random number to each element. Sorts the elements based
on the random number. Then takes the first N elements from the beginning.

------
new299
This seems similar to selecting a single random line from a file in a single
pass:

[http://41j.com/blog/2015/02/select-random-line-file-
single-p...](http://41j.com/blog/2015/02/select-random-line-file-single-pass/)

Which is a question I've had come up in interviews. It's a neat trick.

------
taltman1
Reservoir sampling in a few lines of awk:

[https://github.com/taltman/scripts/blob/master/EDA/samp](https://github.com/taltman/scripts/blob/master/EDA/samp)

It's shorter than this ruby implementation, and includes copious
documentation, file handling, and corner-case support.

------
danieldk
Java implementation:

[https://github.com/weblicht/conll-
utils/blob/master/src/main...](https://github.com/weblicht/conll-
utils/blob/master/src/main/java/de/tuebingen/uni/sfs/clarind/conllutils/sample/ReservoirSampler.java)

(Also returns items that were replaced.)

------
jkestelyn
Also of interest:

"Algorithms Every Data Scientist Should Know: Reservoir Sampling"
[http://blog.cloudera.com/blog/2013/04/hadoop-stratified-
rand...](http://blog.cloudera.com/blog/2013/04/hadoop-stratified-
randosampling-algorithm/)

------
throwaway6497
Imagine cracking this question in an a 45 min phone interview, if you have
never heard about it.

------
scotty79
Interesting, although I'd need to do the math on this to believe it.

Another interesting problem is how to pick 3 million rows out of 4 million at
random, without duplicates. With O(N) time and picks streaming from the
algorithm.

------
everyone
I dont need to know this. I've never used a database in my programming career
ever (I'm a game programmer btw) I'm not saying its not interesting and I
wouldnt want to familiarise myself with it for purposes of general interest,
but not only do I dont need to know this but it would waste valuable headspace
at the moment. Theres lots of kinds of programming. We dont all do webapp /
internety stuff. Essentially the title of this article is not just hyperbolic,
its wrong. There are so many article with '10 things programmer must know'
(they are usually all related to internet) If I actually went and learned them
all I wouldnt be able to remember how to write a shader or rotate stuff.

------
d23
This discussion should actually be about the site omniref, since that's why it
was posted. Not to say that's a bad thing, since the site seems cool. Just
wanted to point that out.

~~~
touristtam
yeah I never came across that website before (probably because I don't touch
ruby) and I was pretty impressed. Thanks for sharing. :)

------
namuol
It would be really great to see some plain-english explanation of what the
algorithm does, for those of us who aren't already knee-deep in
ruby/ActiveRecord.

------
kghose
Is there a pointer to the other algorithms?

------
puthre
1\. Count N

2\. compute p = n/N

3\. select * from table where random()<p*(some lambda >1) order by random()
limit n

4\. Repeat 3 until you got n rows

------
kowdermeister
Does it work with a database of dogs? :)

------
fsk
That isn't a uniform distribution. It's biased in favor of records with a
bigger index. (I.e., rand() returns the sane value twice.)

~~~
optimiz3
Yes - the code is wrong. A correct implementation would adjust the probability
of sampling based on how many items have been trialed.

EDIT: I misread it as rand(n), rand(idx) is correct.

~~~
danieldk
But that's happening here as well, right? The random number is an indexed
picked between 0 and the current index.

Though, I think there is an off-by one in this implementation (assuming that
Ruby indexing starts at 0):

    
    
        j = rand(idx)
        out[j] = i if j < n
    

Say that the index is n, it will call rand(n), which gives a random number
[0..n). However, the index should be picked from [0..n].

~~~
timr
You may be right...I spent a lot of time thinking about that, and I convinced
myself that this was correct, but I could be wrong. My thinking is that
rand(N) in ruby returns a value from [0, N-1] inclusive, which is the complete
index range of the sampled stream so far.

That _sounded_ right to me. I could be convinced otherwise.

~~~
danieldk
You are not taking the index of the element that you are currently sampling
into consideration.

Suppose that the sample size is 1 and you are getting the second item (index
1). You will call rand(1), which has 0 as the only possible outcome. So, you
will always replace the first item (index 0). Whereas if you would call
rand(2) (possible outcomes: 0 and 1), you replace the item in the sample with
probability .5 (assuming that the random number generator is uniform).

~~~
timr
D'oh. Yes, I think you're right. That'll teach me to try to mentally debug
code at 2AM!

I'll fix the code and publish a new gem a bit later today. Thanks!

------
hackaflocka
Probability is something else. I consider myself fairly good at understanding
it. I like to think (to myself) I'm in the 95th percentile in the world in
understanding probability. Yet these 4 concepts will regularly throw me off:
[http://tempr.org/54f9f429588f8.html](http://tempr.org/54f9f429588f8.html)

