
Vitter's reservoir sampling algorithm D: randomly selecting unique items - colinprince
https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/
======
ridiculous_fish
> The first obstacle that makes the “dealing” (without replacement) hard is
> that the “draw” method doesn’t work. If you draw (with replacement) over and
> over again, ignoring cards you’ve already picked, you can simulate a deal,
> but you run into problems. The main one is that eventually you’re ignoring
> too many cards and the algorithm doesn’t finish on time (this is the coupon
> collector’s problem).

I don't think this is true!

First note that if you want to pick 98 values from 100, it is much simpler to
instead decide which 2 to skip. Similarly when picking 75 out of 100 - better
to just cross out 25. The hardest case is picking half: N/2 out of N, without
replacement.

This is the Coupon Collector's problem, except we're stopping at half the
coupons. We can use the analysis given in the Wikipedia page [1], except we
only take the second half of the series.

This gives an expected time of

    
    
        N * (H(N) - H(N/2))
    

where H(X) is the Xth Harmonic number.

In the asymptote this yields:

    
    
        N * (γ + ln(N) - γ - ln(N/2))
        = N * (ln 2)
    

so the expected time of the "retry on duplicates" approach is linear in N. (Of
course, the worst case is infinite, if you keep redrawing the same card.)

edit: To complete the analysis, the requirement is that the answer be linear
in k, the number of items selected. But we have k < N, so if we are O(N) we
are also O(k) or better. So I think the argument works.

[1]
[https://en.wikipedia.org/wiki/Coupon_collector%27s_problem](https://en.wikipedia.org/wiki/Coupon_collector%27s_problem)

~~~
kcl
I think it is true. We head off discussion here by referring to the naive
approach, but what you're doing is past naive coupon collector's and past the
technical scope of the context.

There are several steps where your method could (will) fail. Since you don't
introduce any machinery for the hand, you imply an O(k^2) checking algorithm.
You also imply an unsorted hand. If you generate the hand in random order,
then you fall to an O(k*lg(k)) sort. "Sorted" is an implicit requirement that
we initially omit, but revisit later, because introducing it early makes the
discussion harder to follow.

If you try to introduce machinery like a hashtable, the next thing you might
suggest, you violate an implicit O(1) additional space requirement, which I've
grudgingly reintroduced to the more-formal description ("why is this here?").
This conversation keeps going, but other failure paths we leave unexamined.

~~~
anonymoushn
In addition to not being sorted it's also not "online." You won't get the
first value out of it in constant time if k is too big, and you need to access
elements in a random order.

------
fra
A much cleaner, GPL-ed implementation is provided here:
[https://github.com/WebDrake/GrSL/blob/master/sampling/vitter...](https://github.com/WebDrake/GrSL/blob/master/sampling/vitter.c)

(I personally find Kevin's code very hard to follow)

------
pmiller2
IMO the Wikipedia article on this is better than this article:
[https://en.wikipedia.org/wiki/Reservoir_sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)

~~~
sika_grr
I only see an O(n) algorithm there, there is no O(k). Am I missing something?

~~~
pps43
Wiki shows Algorithm R, this article talks about Algorithm D, both by the same
author.

The former is well known, while the latter is more obscure. This is probably
because the extra complexity of Algorithm D is rarely necessary. If k and n
are on the same order of magnitude, there is not much difference, and if k <<
n then simple random sampling would work nearly as well.

------
kutkloon7
This may be a stupid question, but what exactly is wrong with (using the
dealing random cards analogy):

    
    
      for i = 1 to k
          pickedCardIndex = randomInt(n - i)
          hand[i - 1] = deck[pickedCardIndex]
          swap(deck[pickedCardIndex], deck[n - i])
    

As far as I can see this satisfies all conditions stated by the author
(assuming that randomInt(x) produces a uniformly random integer in the range
[0, x], there are n cards in the deck array, and the arrays are 0-indexed).

~~~
codehotter
How do you implement swap if deck is too large to hold in memory?

~~~
philsnow
mmap the deck

------
vrfcodf
Here is an alternative:

Use a multiplicative linear congruential generator with a period 2^n , where n
is ceil(log2(m)), where m is the size of the list.

Seed the generator, start generating values and drop all values that are
larger than m. In the worst case you will have to drop half of the values, in
the best case you will drop none. This depends on how close to a power of 2 m
is.

The generator is very short, one line of code, and very fast: one
multiplication, one addition, and one bitshift. Space complexity is O(1).

If your tweets are generated out from an integer value, then you can use this
value to generate them. The tweets will be unique (assuming the tweet
generator will generate unique tweets from unique values), and you will have m
of them.

~~~
ridiculous_fish
This is very practical, but the result is not uniform; in fact it only returns
a tiny fraction of the possible combinations.

There are "N choose k" combinations, which in the worst case (k = N/2) grows
as ~ 2^N. The number of combinations quickly outstrips the number of possible
states of the LCG, and most combinations will never be found. For example, we
have a list of length 16 and wish to choose 8. We have a choice of 16 distinct
seeds, so there are only 16 different combinations (out of 12870 possible)
that we can get for any given LCG.

~~~
vrfcodf
Good point. You would get different sequences by choosing different constants
for the generator though, but you couldn't cover all of them by a longshot.

~~~
mcherm
And this inability to cover all possible sequences has real-world
implications.

[https://www.cigital.com/papers/download/developer_gambling.p...](https://www.cigital.com/papers/download/developer_gambling.php)

[http://superuser.com/questions/712551/how-are-
pseudorandom-a...](http://superuser.com/questions/712551/how-are-pseudorandom-
and-truly-random-numbers-different-and-why-does-it-matter/712583#712583)

------
jdfr
Vitter's algorithm is for when you need to generate a random list of unique
items with sequential reads. If you drop that requirement, you can just use a
substitution-permutation network: [https://en.wikipedia.org/wiki/Substitution-
permutation_netwo...](https://en.wikipedia.org/wiki/Substitution-
permutation_network)

A SPN will generate a unique random permutation, from which you can draw your
random list of unique items. With a little of post-ptocessing, you can
generate random permutations of arbitrary size.

I have implemented it in matlab and python:

[http://nl.mathworks.com/matlabcentral/fileexchange/36626-alg...](http://nl.mathworks.com/matlabcentral/fileexchange/36626-algorithmic-
pseudo-random-permutations-for-large-sequences)

[https://gist.github.com/jdfr/46633c630471494d67ae](https://gist.github.com/jdfr/46633c630471494d67ae)

------
dfabulich
The C code in this article is a mirror of the code from Appendix 2 in Vitter's
paper, which I guess explains/excuses the abbreviated variable names. The
paper says things like, "use an exponentially distributed random variate Y,"
and uses variable names like n, N, U, S, X, y1, and y2 in the appendix.

Nonetheless, I find this coding style unreadable. "Y" is not a very good name
for an exponentially distributed random variate, especially in a function that
defines 24 variables with mostly meaningless names. :-(

~~~
chipsambos
var naming aside, the whitespacing is perplexing:

    
    
          U = random_double(); negSreal=-S;
          y1=exp(log(U*Nreal/qu1real)*nmin1inv);
          Vprime = y1 * (-X/Nreal+1.0)*(qu1real/(negSreal+qu1real));
          if(Vprime <= 1.0) break;       y2=1.0; top = -1.0+Nreal;       if(-1+n > S){bottom=-nreal+Nreal;limit=-S+N;}
          else{bottom=-1.0+negSreal+Nreal; limit=qu1;}
    

Sometimes two statements per line, sometimes one. Sometimes spaces before and
after equals, sometimes none. Sometimes multiple ifs, breaks and statements on
one line.

It looks like it was copy-pasted from a pdf. Maybe it was.

~~~
retbull
I feel like 2 seconds in an ide would have saved readability quite a bit. A
simple automated cleanup would have helped worlds.

------
forrestthewoods
I can't believe this article doesn't explain how it works. Especially after
stating "after reaching Vitter’s papers it again takes a concentrated effort
to figure out what you need".

~~~
yelnatz
Yeah, the article was pretty dense so I just slogged through it hoping I would
get to the explanation part.

Nope, it just gave me ugly c code at the end.

------
Mithaldu
I read very often (by people who don't even use it) that Perl is a write-only
language. Then i see code like this roll along that demonstrates perfectly how
the readability of any given piece of code is 99% on the coder, not the
language.

------
franciscop
Why cannot you just create random numbers and discard duplicates?

For practical purposes when list.length << total.lenght shouldn't that be good
enough?

~~~
dantiberian
I had the same thought. However thinking about it more, the requirement is
that this algorithm is O(k). As k gets closer and closer to the size of the
list, you will spend more and more time looking for unique items. You also
need to pay the costs of checking for uniqueness which also grows with k.

~~~
franciscop
Well yeah, that is exactly why I said for the situation where the lenght of
the list is much smaller than the total number of tweets. Which should cover
most practical usages.

------
keypusher

       >>> random.sample(xrange(0, 1000000000), 5)
       [5258132, 23096612, 43529214, 91062733, 4912658]
    

This ran in a few milliseconds on my old macbook in python. It is an iterator
over 1 billion element list. Looking at the source, they seem to track
previous selections for large populations in a set, for which lookup of "x in
y" is avg O(1).

[https://svn.python.org/projects/python/tags/r32/Lib/random.p...](https://svn.python.org/projects/python/tags/r32/Lib/random.py)

~~~
3pt14159
Actually, due to seeding, no matter how many times you run this you will never
generate all combinations of tweet orders. Of course that doesn't _really_
matter since for large values of N there isn't enough time in the universe to
do so.

------
CephalopodMD
Could someone explain how this works? I feel like unique shuffling given a
random seed is very useful algorithm, but I can't follow this code because of
the variable naming and stuff.

~~~
throwaway_yy2Di
The general idea is to draw n uniform samples from [1..N] and return them in
sorted order. You can generate these incrementally, in order (no sort needed).
The mean difference between successive samples is ~N/n. If the samples are
uniformly distributed, their differences should have roughly an exponential
[0] distribution (if N, n are large).

So the inductive step is: given k'th sample a[k], let a[k+1] = a[k] + S, where
S is drawn from a discrete exponential distribution with mean 1/λ = N/n.
That's the `exp(log(rand()) * stuff)' part of the code.

This _almost_ works; there's more details [1] if you want it to actually work,
and have accurate statistics, not overflow N, etc.

[0]
[https://en.wikipedia.org/wiki/Exponential_distribution](https://en.wikipedia.org/wiki/Exponential_distribution)

[1] [https://hal.archives-
ouvertes.fr/file/index/docid/75929/file...](https://hal.archives-
ouvertes.fr/file/index/docid/75929/filename/RR-0624.pdf)

------
antics
Huh, is it really true that Reservoir Sampling is not well-known? It was part
of the standard curriculum in my top-30-but-not-top-10 CS school, the
University of Utah. I think that even a significant subset of my cohort could
explain the correspondence between this and the Fisher-Yates shuffling
algorithm.

~~~
coredog64
I don't even have a CS degree and I know about it. I wrote some PowerShell
(don't ask) to sample a large collection of documents from file shares looking
for embedded links to other documents. We needed to break up the existing file
share into smaller pieces and wanted to estimate how many documents would have
to be touched due to broken links. (The surprising answer was "very few",
since most links were relative to the containing directory).

------
Jabbles
The key to why the obvious algorithm of "half a Fisher-Yates shuffle" isn't
enough is that it uses O(k) memory, whereas Vitter's algorithm is constant.

[http://play.golang.org/p/UqC6XWrPLB](http://play.golang.org/p/UqC6XWrPLB)

------
yeukhon
> Reservoir sampling is a family of randomized algorithms for randomly
> choosing a sample of k items from a list S containing n items, where n is
> either a very large or unknown number. Typically n is large enough that the
> list doesn't fit into main memory.

So if N is really large and doesn't fit into main memory, how does one iterate
over such large list in a reasonable amount of time? When I am taught
algorithm my sample input is relatively small, from 10,000 elements to 1M
integer numbers. Big deal. But I am not sure how to really think when N is
really huge and whether we'd ever really have to deal with 10B integers in one
pass. In that case, it's more likely I am paging through my database cursor...
someone please shine light.

~~~
lsiebert
Basically you do a reverse fisher yates shuffle.

You are selecting n of N items, where N is unknown, in one pass, where every
item is treated as unique, and where n is your sample size.

First get n items from N in an array of size n of candidates, which we'll call
C.

Then, for each item k in N until you reach the end you know the current
highest count, so select a random integer representing a location in an array
of that count. If it's 0 to n - 1 (assuming a zero indexed array) replace the
item in C with the current item k.

This means that each item will have an equal chance of being a final candidate
when you reach N. If you care about order of the random sample, you can always
shuffle n after you reach N.

------
pixelbeat
GNU coreutils' shuf since v8.22 (Dec 2013) implements this to minimize memory
usage when sampling large files or unknown numbers of items read from a pipe.

For example this command never finishes, but consumes a constant (small)
amount of mem.

    
    
      seq inf | shuf -n1

------
kutkloon7
Again, I am confused. Why doesn't this snippet solve the problem?

    
    
      void increasingRandomSequence(arrayptr, base, k, n)
      {
          if (k == 0) return;
          int i = randInt(n - k);
          *(arrayptr) = base + i;
          increasingRandomSequence(arrayptr + 1, base + i + 1, k - 1, n - (i + 1));
      }
    

increasingRandomSequence(hand, 0, k, n) fills the array hand with a sequence
which is picked with uniform distribution over all increasing sequences of
length k with numbers in [0, n - 1].

This is O(k), and we can shuffle this in O(k). Why doesn't this solve the
problem?

~~~
dalke
The method is on-line, that is, it doesn't need to know "n".

Assuming I translated your above code correctly into Python, as:

    
    
        import random
    
        def increasingRandomSequence(base, k, n):
          while k > 0:
            i = random.randrange(n - k + 1)
            yield base + i
            base += i+1
            k -= 1
            n -= i+1
    

then I checked the above using:

    
    
        N = 6
        counters = [0] * N
        for i in range(100000):
            for value in increasingRandomSequence(0, 5, N):
                counters[value] += 1
        print(counters)
    

and found a strong bias towards larger numbers. The counts are:

    
    
        [50033, 75037, 87438, 93686, 96900, 96906]
    

which means 5 is in the hand much more often than 0.

------
pklausler
Everywhere in this article, and in all of the current comments, where the word
"unique" appears, it really should be the word "distinct" instead.

------
xerophyte12932
This may be a stupid question, but why not store cards/tweets in a hashmap and
add a .contains() check before adding?

~~~
sfifs
Say you need to draw a 100 bilion samples without replacement from a dataset
with a trillion plus rows, the resource requirements for hashmap would become
enormous.

------
buremba
Bloom filters might be a good alternative for this use case if you don't mind
false positives.

------
amgin3
> generate a list of random tweets, without duplication.

Considering there are somewhere over 1 trillion tweets (over 200
billion/year), this is a very easy problem, you do not even have to check for
duplicates because the chance of getting a duplicate is so small it would be
statistically unlikely to ever happen.

~~~
kardos
You're making the assumption that the requested list is quite short. It should
be straightforward to show that the probability of a duplicate grows with the
requested list length.

------
johnm1019
Based on your list size, try a reasonably designed hash function here.

