
Algorithm No One Knows About (2016) - ColinWright
https://getkerf.wordpress.com/2016/03/30/the-best-algorithm-no-one-knows-about/
======
drfuchs
A bunch of comments here are missing the main point: Unless you have an
Exabyte of memory, you can't use any kind of data structure to remember which
cards you've already picked among 2^64 of them. The goal here is an algorithm
that generates the selected cards, one by one, in order, as if from a
coroutine that only uses a tiny, constant amount of memory. So, no arrays,
lists, sets, hash tables, bitmaps, etc., even if they're hidden in some cute
construct built into your favorite language.

~~~
OscarCunningham
Is there an easy way to adapt the algorithm to get the cards drawn in a random
order? The article says "it’s easy to randomize the order after the fact" but
if we can't store them in memory then that's a no-go.

~~~
web007
You select N from M, in order, where M is "too big" and N fits in memory /
swap / disk. You then shuffle your N selections to get random ordering.

~~~
supergarfield
I understand "you can't use any kind of data structure to remember which cards
you've already picked among 2^64 of them" as "N cards don't fit in memory". Or
is the relevant distinction memory vs disk, as it's feasible to randomize on-
disk data?

~~~
svat
The problem asks to find k cards (integers) out of n, in time O(k) and O(1)
additional space — this means additional to the k required to store the
output. So while holding n items wouldn't fit in memory, it is implicit that k
would. But we still want to avoid Ω(k log k) time, or more than a constant
amount of additional space.

------
ghusbands
The last time I wanted an unpredictable, non-repeating sampling of 64-bit
numbers, I used a symmetric cipher in CTR mode. Essentially, start with a
random key and a random value, return the encrypted version of it, then the
encrypted version of its increment, and so on. Returns every number in the
range, with no predictability.

~~~
nemo1618
This works, but only if you're sampling the exact same space as the block
size. For example, if your cipher outputs 64-bit blocks, but you want to
sample in 2^32, you're back to the drawing board, because there's no way to
truncate the cipher output without risking duplicates.

~~~
twic
You can use the "hasty pudding trick" to truncate the cipher output without
risking duplicates. It is inefficient if the block space of the cipher is much
larger than the sampling space, though.

However, you can also use a Feistel network to create a block cipher for an
arbitrary bit length. So, you can always bound the block space to be no more
than twice the sampling space, which is okay.

~~~
CobrastanJorji
What's the hasty pudding trick? Is it related to the Hasty Pudding cipher?

~~~
twic
It is. Perhaps it's not officially called that, but that's how it was
described to me:

[https://groups.google.com/d/msg/sci.crypt/gjYclOVJDrc/tiv4ic...](https://groups.google.com/d/msg/sci.crypt/gjYclOVJDrc/tiv4icJyJZMJ)

------
OscarCunningham
>The misleading part in using the language of cards is that you don’t often
consider a deck of size 2^64. What’s nice about the card formulation though is
that it conveys how simple the problem statement really is. This is a
fundamental problem that was open for a long time. Nobody knew how to deal
cards.

>The first obstacle that makes the “dealing” (without replacement) hard is
that the “draw” method doesn’t work. If you draw (with replacement) over and
over again, ignoring cards you’ve already picked, you can simulate a deal, but
you run into problems. The main one is that eventually you’re ignoring too
many cards and the algorithm doesn’t finish on time (this is the coupon
collector’s problem).

How is this true? If you're drawing from a deck of 2^64 cards then unless
you're drawing almost all of them there's not going to be any problem with
rejection. Even if you're drawing 2^63 cards you're going to be rejecting only
half of your draws, so the average slowdown is just a factor of 2.

If you really want to draw more than half the deck then just randomly pick the
cards that you _don 't_ want.

~~~
AllegedAlec
> How is this true? If you're drawing from a deck of 2^64 cards then unless
> you're drawing almost all of them there's not going to be any problem with
> rejection. Even if you're drawing 2^63 card you're going to be rejecting
> only half of your draws, so the average slowdown is going a factor of 2.

It it though? I'm a but shoddy on my mathematics, but I'd wager from the
birthday problem that you'd probably have to reject much more than half of the
'cards', and as you get to the end, you'd have to redraw several times for
each card.

~~~
andreareina
Birthday problem says that once you've gone through ~2^8 cards you've got a
~50% of having had to reject at least once. It takes 2^63 until you have a 50%
chance of rejecting on _each_ draw.

~~~
petters
You're right, but it's not 2^8. It is √2^64 = 2^32.

~~~
andreareina
Yes, thanks. Took the root of the wrong number.

------
natch
Some of the variable names used:

i, j, t, qu1, S, n, N, U, X, y1, y2, V

I know others have already commented on this but I think it's worth laying out
all of the cryptic variables and asking: are some of these shorthand for well
known math concepts that make it unnecessary to have a more descriptive name?
I understand the use of temporary value holders like `i`, but have no clue
what some of the others might be used for or what the single letter names
indicate.

Another way to ask the question: if the author had just used instead a, b, c,
d, e, f, g, h, etc. what information would be lost, that is gained by the
particular letter choices that were used? This isn't rhetorical, although
there is a point to be made for sure about clarity in naming... I'm really
wondering if anyone can explain what these letters might mean if you are more
versed in the math than I am.

BTW here's another Vitter paper I found in case anyone's interested:

[https://www.cs.umd.edu/~samir/498/vitter.pdf](https://www.cs.umd.edu/~samir/498/vitter.pdf)

~~~
dzdt
The reason for naming the variables as they are seems to be to keep
consistency with Vitter's paper
[[http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf](http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf)]
Since the paper is really the primary documentation of the algorithm this
makes perfect sense. Renaming the variables to be more meaningful to you would
make it harder to compare to the paper, reducing the effectiveness of the
paper as documentation and making it hard to compare the code to its
specification.

~~~
lonelappde
There's no reason to preserve bad notation from a 30year old paper.

~~~
nemetroid
Are you saying that single-letter variable names in maths is "bad notation"?

------
dahart
It’s called Reservoir Sampling
[https://en.wikipedia.org/wiki/Reservoir_sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)

I was just re-reading this yesterday! I think a lot of the Monte Carlo
rendering people know about this family of algorithms, and the idea is pretty
simple. I’m not sure why it’s described from the start as picking k samples
though, I find it a lot easier to understand and see how it works when you
start with k=1.

Btw, it’s fun to work out the probabilities of picking a sample to prove how
this algorithm works, and that it’s uniform. Big piles of factorials that all
cancel out.

Pps the other super rad algorithms that aren’t well knows are the alias method
for simulating a discrete distribution
[https://en.wikipedia.org/wiki/Alias_method](https://en.wikipedia.org/wiki/Alias_method)

...and the suffix array, which is crazy surprisingly useful
[https://en.wikipedia.org/wiki/Suffix_array](https://en.wikipedia.org/wiki/Suffix_array)

~~~
yati
Reservoir sampling requires you to maintain a reservoir of n items that will
be your sample. This can be prohibitive if the sample size itself is huge. The
algorithm described here decides (commits) whether an item is to be sampled as
it sees it (i.e., online), while consuming a constant amount of memory. It
differs from the usual reservoir sampling approach in that it cannot go back
and "unsample" an item it already selected (whereas the crux of a reservoir
sampling approach is evicting current candidate items and replacing them with
new items).

------
vardump
While reading, I was thinking about LFSR [0] with suitable properties and
period. Not strictly random, although LFSRs are commonly used for generating
"random" numbers.

[0]: [https://en.wikipedia.org/wiki/Linear-
feedback_shift_register](https://en.wikipedia.org/wiki/Linear-
feedback_shift_register)

~~~
altmind
Exactly my thoughts - its wolfenstein fizzlefade or LSFR that you learn in
crypto 101 that can pick all the numbers in the range randomly.

"Algorithm No One Knows About" is such a condescending title...

~~~
tripzilch
Especially if you don't actually explain the algorithm and seemingly fail to
understand its single most important requirement is that the output is _in
order_. Not something to shuffle later, because the only advantage of this
algorithm is that the output remains in order.

Otherwise you use a permutation function, such as an LSFR or preferably a
family of functions.

------
epeus
Getting this wrong can tank your product - see what happened to Draw Something
[http://epeus.blogspot.com/2012/04/draw-something-ceo-
grace-a...](http://epeus.blogspot.com/2012/04/draw-something-ceo-grace-and-
high.html)

~~~
lucb1e
Summary without the drama: by not using a "discard pile" but picking randomly
from the same set of words every time, you get repeat words paradoxically
often (birthday paradox), which is what pictionary did and people complained
about.

------
GistNoesis
Here is how I would do it. Pick a prime p bigger than N. Pick a random
permutation of Z/pZ. Pick an initial state. Advance to the next state, to
cycle through all the values exactly once.

You use rejection sampling if the state is greater than N, so as long as you
pick p close to N this should happen very rarely and be amortized O(1).

If N is big this advance to the next state can be encoded easily as a
multiplication by an element of Z/pZ (multiplication followed by modulo p),
and you will cycle through all the values because Maths. If N is not so big
you pick a few (m) elements of Z/pZ to represent your permutation, and you use
the first number as a multiplier mod p the first time, then the second number
as a multiplier mod p the second time, and so on..., and you will also cycle
through all the value, because Maths as long as you don't get a 1 while
multiplying sequentially your m elements.

Don't forget to stamp it good enough(TM), if not use Dual-ECC.

------
Dylan16807
Okay, so this is a way to generate a sampling in order, so you don't have to
keep a sorted list and your run time can be k instead of klogk. Mostly.

The body of the post, with its talk of strict timing and sample bias, had me
thinking it had some clever way to pick a random number from 1 to n in finite
time. But that's not actually possible to do with a random bit source and a
non-power-of-two n. If this algorithm hits certain rare values it will re-
sample the RNG.

~~~
caf
This appears to be the state of the art in minimising the resampling cost:

[https://arxiv.org/pdf/1805.10941.pdf](https://arxiv.org/pdf/1805.10941.pdf)

------
floor_
I may be misinterpreting the question here but can't the author just use a
large prime to to pull from the array like you would when hashing? Its how you
would in place shuffle a music playlist.

------
klingonopera
Oh dear, I find those single-letter variable names are quite counter-
productive in making the code easy for someone unfamiliar to the project to
understand...

~~~
airstrike
Yeah, I'm way out of my comfort zone here but I found the choice of n and N
pretty off-putting

~~~
klingonopera
A typo is only a (sticky) shift away, and not even obvious!

EDIT: ...and compiles, too!

------
js8
Can't you use
[https://en.wikipedia.org/wiki/Lehmer_random_number_generator](https://en.wikipedia.org/wiki/Lehmer_random_number_generator)
for this? I thought that was a common method of generating a random sequence
without repetition. True, the result is not very random, but in most
applications it doesn't make much of a difference.

~~~
mankyd
Not if you want it to be actually random.

But you might be correct if you simply want the appearance of randomness.

------
tripzilch
> When I came across a need for the algorithm in my work, it took a lot of
> detective work to even find it. There was no source code online. The
> algorithm is not referenced often, and when it is, it is in a highly obscure
> fashion.

Well, the author certainly isn't helping that problem with a clickbait title
like that, and being all mysterious for the first few paragraphs, teasing "you
almost certainly don't know this".

"Dealing cards from an extremely large deck: Vitter's Algorithm"

Except he doesn't even explain the algorithm, just some source code (which is
super unclear, no comments, short variables). And the unhelpful comment that
it was hard to figure out from the paper. All the reason to put a more
descriptive title.

How does he expect to help people find this, figure this out, or was this
entire post just to brag that "his" solution is better than anyone else's
because he found the right algorithm almost nobody else knows about?

------
IIAOPSW
Ok I'll take a shot at it.

Pick a number between 0 and N. Let's call it p for pivot. Each draw after this
initial draw is either less than p with probability p/(N-1) or greater than p
with probability 1-p/(N-1). You need to make k-1 additional draws. The
probability of m of those draws being less than p is given by the binomial
distribution. Choose m from the binomial distribution. Then run the algorithm
recursively on the space 0 to p-1 and p+1 to N. Some pseudo-code.

    
    
        def deckpicker(N,k,offset=0):
            if k == 0:
                return
            p = uniform_dist(0,N)
            m = binomial_dist(k-1, p/(N-1))
            deckpicker(p-1,m,offset)
            print(p+offset)
            deckpicker(N-(p+1),(k-1)-m,offset+p+1)

~~~
yorwba
What's the time complexity of sampling from the binomial distribution?

~~~
IIAOPSW
well...shit.

------
TheMerovingian
[https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

~~~
dahart
It’s related to Fisher-Yates shuffle,
[https://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to...](https://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-
Yates_shuffle)

But it’s not the same thing. The difference is whether you can keep all
samples in memory at the same time. With Vitter / reservoir sampling, you only
need to have your reservoir in memory plus enough space for 1 new sample at a
time, so you can stream through a large number of samples with a tiny amount
of memory.

------
tiborsaas
Why is hashmap not good enough for unique ID generation?

    
    
        const tweetIDs = {};
    
        while(TWEET_COUNT){
            const rnd = Math.random() * MAX_TWEET_ID;
            if(!tweetIDs[rnd]){
                tweetIDs[rnd] = true;
                TWEET_COUNT--;
            }
        }

~~~
coleifer
The algorithm described in the article has other desirable properties:

* IDs are returned in sorted order

* IDs are generated one after another, ie could be implemented as a generator or coroutine

* No additional space required during the computation

Resulting in a single pass.

~~~
klingonopera
Nicely said.

I was thinking of ways to solve the problem by eliminating the overhead of
checking whether a random number was already drawn, when Vitter's solution was
to make the random number generator produce random numbers in a sorted manner,
thus eliminating that overhead entirely.

EDIT: Then again... if the algorithm is to be used to shuffle a list, wouldn't
returning the selection in a sorted manner defeat the entire purpose? If k =
n, output would be input...

~~~
tiborsaas
It would. The original problem is to draw a subset of tweets.

If there's a chance to randomize the entire tweet archive, then it's a
shuffling algorithm now.

------
Someone
”Here’s a program roughly 0% of programmers know how to write: generate a list
of random tweets, without duplication.

[…]

Stated more formally: given non-negative integers _k_ and _n_ with _k <= n_,
generate a list of _k_ distinct random numbers, each less than _n_.“

I think that knowing _k_ beforehand makes it a different, easier problem. I
don’t see how, if you don’t know _k_ beforehand, you can do without enough
memory to store a number of magnitude _n!_ , and even that would be extremely
slow, the more items you have to produce (uniformly pick a permutation of _n_
items, and produce as many of them as requested)

~~~
saagarjha
I don't see why you'd need to generate a permutation?

~~~
Someone
If you don’t know _k_ beforehand, you have to (eventually) handle the case
where _k_ is large, possibly even equal to _n_.

If it turns out to be equal to _n_ , you have to generate a permutation.

If it isn’t, you don’t need to, but you need to keep enough information
around, in case it turns out that you need.

For example, by the time you’ve generated 2⁶³ integers in the range 1…2⁶⁴, you
need to know which 2⁶³ integers still are ‘available’.

I think you can’t do that more compactly than by just picking the permutation
to generate at start (whether you can efficiently determine the k-th number in
that permutation is a separate problem)

------
amcsi
So let me get this straight...

This algorithm creates an initially empty array of size K with the DB internal
ids of what would randomly be selected. Then it goes in order from 0 to the
number of rows - 1, and for each one it goes e.g. "What are the chances that
number 0 would get randomly picked out of N items if K items need to be
picked?" Then, if the #0 passes the "random check", it gets added. Then it
would go to 1 and go "What are the chances #1 would be picked out of N-1 items
if K-1 items need to be picked?" ...and repeated so on until all K random
items have been picked?

~~~
eapriv
No, what you described is O(n).

------
fergie
"(Stated more formally: given non-negative integers k and n with k <= n,
generate a list of k distinct random numbers, each less than n. We reasonably
ask that this be done in O(k) time and O(1) additional space. Finally, the
distribution must be uniform.)"

"...it’s also possible to use the language of cards: deal k cards from a deck
of size n, without replacement. So a poker hand like A2345 of clubs is valid
for k=5 and n=52, but a hand containing AAAAA of clubs is not."

These seem to be fairly straight forward problems to solve- can anybody ELI5
why the Vitter algorithm is necessary?

~~~
rini17
Trivial algorithms (like one implemented in Python) slow down when k is close
to n. Or when n is large, then it is not possible to use bitmap to mark dealt
cards.

~~~
avip
The python algo does not get slow when n is large. The k~n case obviously
calls for inverse selection.

[https://github.com/python/cpython/blob/master/Lib/random.py#...](https://github.com/python/cpython/blob/master/Lib/random.py#L386)

~~~
LolWolf
NOTE: This comment I wrote is totally wrong. I haven't had my morning coffee
yet. H_n - H_(cn) ≈ constant, not diverging (for 0 < c ≤ 1). :)

Yeah, but the k ~ n/2 case (in which inverse selection and normal selection
have the same runtime) is still Ω(n log n) (equiv Ω(k log k)), which is still
"slower" than the presented algorithm.

~~~
yorwba
Actually, this comment is correct despite making a similar argument to your
incorrect one elsewhere. Using a set to accumulate intermediate results
requires k inserts at O(log k) each.

~~~
LolWolf
I'm still not sure about this—the amortized time of insertion should be O(1)
on any one of the usual hash table-type-structures. We're not inserting a
single item to an already-built hash-table, but rather k of them into an empty
hash-table, with elements being uniformly drawn and bucketed using a good hash
function. All of this should give O(1) average time for insertions.

------
pmiller2
And I still don’t know about it, because the article just has a code dump and
some vaguely relevant historical commentary. Posting the actual paper would
have been better.

------
jchook
> given non-negative integers k and n with k <= n, generate a list of k
> distinct random numbers, each less than n.

> We reasonably ask that this be done in O(k) time and O(1) additional space.
> Finally, the distribution must be uniform.)

Sounds like the Knuth multiplicative hash.

The Art of Computer Programming vol. 3, section 6.4, page 516.

------
ikeboy
Tldr of the algorithm from the paper:

Instead of choosing random numbers to include in the list, they're choosing
the gap between numbers. They derive the distribution of this gap then
approximate it so it can be calculated quickly.

------
jbotz
If this really was such an obscure but useful algorithm, someone should change
the title of this post to help people find it in the future so it doesn't
remain obscure (add "card drawing"?)

~~~
LolWolf
It's... not that useful, as far as I can tell (unless I'm still under-
caffeinated!)

@avip mentions the following comment
[https://news.ycombinator.com/item?id=20961320](https://news.ycombinator.com/item?id=20961320)
, which means that we only have to consider k ~ cn for 0 ≤ c ≤ 1/2\. Drawing
in this fashion, we have an expected runtime of k * (H_n - H_{(1-c)n}) ≤ k *
(H_n - H_{n/2}) ~ O(k).[1]

The latter is true since H_n ~ log n + C + O(1/n) for some[2] constant C.

Of course, depending on what the author's constraints are, there would be an
additional O(k) space requirement for keeping track of indices that have
already been picked. Vitter's algorithm is particularly nice, since it can
report the items without this additional space requirement.

\---

[1]
[https://en.wikipedia.org/wiki/Coupon_collector's_problem](https://en.wikipedia.org/wiki/Coupon_collector's_problem)

[2]
[https://en.wikipedia.org/wiki/Euler–Mascheroni_constant](https://en.wikipedia.org/wiki/Euler–Mascheroni_constant)

------
mehrdadn
I haven't read the algorithm code, but I think you can do the following? For n
items, pick a random k <= log(n!), then enumerate the k'th permutation of n
items. It's not trivial, but there are already algorithms for it, e.g.:
[https://code.activestate.com/recipes/126037-getting-nth-
perm...](https://code.activestate.com/recipes/126037-getting-nth-permutation-
of-a-sequence/)

~~~
LolWolf
Ooh, this feels like it gets close to the solution!

It's not quite there as stated, though, since this algorithm requires O(k^2)
time (where k is the number of elements to be randomly drawn) as you have to
insert items into specific indices in the array.

~~~
mehrdadn
Yeah, but I think you can be more clever than just doing naive array
insertions, by trying to batch them up. At least, we know decreasing sequences
of indices can be done in one pass with no changes, and increasing sequences
can be done in one pass by accumulating shifts. That leaves alternations to
worry about. Now I haven't fully thought this through, but I think something
along these lines might work (there may be errors in some of the details
here): treat the list of indices you're inserting into as the leaves of a
binary tree, and create the intermediate nodes of that tree, initializing them
to zero. The intermediate nodes represent offsets (initialized to 0) that will
be logically added to all earlier nodes. Then sort the nodes, breaking ties so
that the later ones are first. Now perform every insertion at {the given index
plus its ancestors}, and then increment the counters for all ancestors of that
index so that you account for shifts of later elements. This should get you to
O(k log k) or so.

In fact, this makes me wonder why there isn't a multi-insert function in every
language's vector implementation...

------
agumonkey
I remember an article about the space complexity of drawing all colors
randomly.. this seems related, isn't it ?

------
eyegor
Does anyone have a mirror for this? No matter what I do, all I'm getting is an
empty page with a sidebar.

~~~
nomadictype
In Firefox: View -> Page Style -> No Style

(For whatever reason the page has a CSS rule #content { display: none; },
which presumably gets removed via JS.)

------
jonplackett
Anyone feel like doing a human readable breakdown of what the algorithm is
going?

That would make me happy

~~~
qws
Yeah, really. A lot of filler paragraphs building up to unreadable code. Bad
article.

