Algorithm No One Knows About (2016) 286 points by ColinWright 32 days ago | hide | past | web | favorite | 190 comments

 A bunch of comments here are missing the main point: Unless you have an Exabyte of memory, you can't use any kind of data structure to remember which cards you've already picked among 2^64 of them. The goal here is an algorithm that generates the selected cards, one by one, in order, as if from a coroutine that only uses a tiny, constant amount of memory. So, no arrays, lists, sets, hash tables, bitmaps, etc., even if they're hidden in some cute construct built into your favorite language.
 Is there an easy way to adapt the algorithm to get the cards drawn in a random order? The article says "it’s easy to randomize the order after the fact" but if we can't store them in memory then that's a no-go.
 You can draw the cards in a random order by using the output of an accumulator fed to a block cipher as an index. The fixed output size of a block cipher does entail extra work in filtering numbers outside the range you'd like, just as you would with an LFSR. (as a direct power of 2, you could directly use an LFSR or some ECB-mode block ciphers as if it were format-preserving, but that is "coincidence").You can produce an exact random permutation with a Feistel network for any number, in this case selecting every number between 0 and 2^64 exactly once, with no appreciable space requirements.`````` procedure feistel(input) { result = input; h = 2<<32; for (let i = 0; i < 8; i++) { l = (input / h)|0; r = input % h; temp = r; fn = (r + i) % h; r = (l + fn) % h; l = temp; result = h * l + r; } return result; }``````
 Is that code correct? Why do you overwrite `result` every step of the loop if you only use it at the end?
 Good eye, `input` should be replaced with `result` when building the subkeys within the loop.Here's a working implementation:`````` function feistel(ctr, lr, ll) { let result = ctr; for (var i = 0; i < 8; i++) { let l = (result / ll)|0; let r = result % ll; let temp = r; // alter round fn's multiplier & increment accordingly. let fn = (r * 16807 + i) % lr; r = (l + fn) % lr; l = temp; result = lr * l + r; } return result; } let rsize = 8; let outputs = new Set(); for(let i = 0; i < rsize * rsize; i++) outputs.add(feistel(i, rsize, rsize)) for(let i = 0; i < rsize * rsize; i++) if (!outputs.has(i)) console.log('failure: ', i);``````
 There are kinds of PRNGs that are designed to emit every possible ouput value in the range. There are some that will emit every possible value before a repeat.If you look at his code, the first function seems to be calling a method that claims to do the latter. I think his documentation may be out of sync.
 You select N from M, in order, where M is "too big" and N fits in memory / swap / disk. You then shuffle your N selections to get random ordering.
 I understand "you can't use any kind of data structure to remember which cards you've already picked among 2^64 of them" as "N cards don't fit in memory". Or is the relevant distinction memory vs disk, as it's feasible to randomize on-disk data?
 The problem asks to find k cards (integers) out of n, in time O(k) and O(1) additional space — this means additional to the k required to store the output. So while holding n items wouldn't fit in memory, it is implicit that k would. But we still want to avoid Ω(k log k) time, or more than a constant amount of additional space.
 Not sure that this qualifies as uniform distribution, even if you spread the N across the M. Because the selection must be predictable, so that any next N don't include the previous ones. That's presuming that the ‘user’ might iterate over M items in total, even if you pick chunks of N internally.
 Well the main point of the algorithm is that the output is going to be a sequential, increasing list of int64 "cards".Which may be useful for tape drives, but I'm not sure what else.The big point that the author is missing, is that if you drop that requirement, or rather if you require the output to be shuffled, like you'd expect from dealing cards, then there are super easy algorithms.Just pick a random permutation function over the 64-bit integers, and start counting. It's literally like dealing cards.People in this thread have suggested block encryption as permutation functions, which will work. But there are also good non-cryptographic (much faster) permutation functions out there.If you got a permutation function you don't need memory to store the permutation, and poof the problem just goes away.You need this algorithm only if you absolutely need the output to remain in order. But that's not really part of the problem he sketches, which is about "take X without replacement".
 > Which may be useful for tape drives, but I'm not sure what else.Partitioning large data sets, or accessing any data store where sequential access is more efficient than random access. Tape drives are merely a canonical and previously-common example of such data stores, but such situations do still exist in the modern world, even if they are not as common. I for one was rather happy to see this, as it addresses a problem I have that I did not know that there was already a solution for!Suppose you have a huge dataset that you cannot load into memory all at once, but you can request samples of via a paging API. You want to sample a random subset of the records in that dataset. Just generating the indices won't do--you need to turn those indices into actual records. Random access imposes too much overhead--even if you can request arbitrary pages, you don't want to have to request each page more than once, and you don't want to have hold previously-requested pages in memory Just In Case you decide to back up and select another record from them later. Ergo, a non-ordered permutation function doesn't give optimal performance.With ordered sequential sampling, you can request one page at a time, get all the records you will ever need out of that page, then move on to the next page that would contain an index from your sample.Now, if your sample size is small enough that you can fit the list of card-indices into memory, rather than just streaming the actual selected records off somewhere else for processing (which in my case, it is), then you could achieve equal data-access performance by selecting the head of a random permutation and then sorting it--but as the article mentions, that kills your asymptotic performance for computing the indices, and means you can no longer perform selection on-line, or parallelize the operations of index selection and data retrieval.
 How do you pick a random permutation function over 64-bit integers? On a set of N=2^64 integers, there are N! permutations, so you need lg(N!) ≈ NlgN random bits to generate a uniformly random permutation, and thus at least that much time. This is too much: we want a solution in O(K) time and O(1) additional space (in addition to the space for K integers) — something that's Ω(NlgN) is out of the question. Things like block encryption do not generate a uniform distribution over all possible N! permutations of N=2^64 integers (at typical parameter sizes); obviously they can generate only as many possible outputs as 2^(the number of random bits used).
 https://en.wikipedia.org/wiki/Format-preserving_encryptionIt is possible.
 What “format-preserving encryption” gives you is a hard-to-predict (pseudorandom) permutation on the N elements. The resulting permutation is not guaranteed to be uniformly distributed on the space of all N! permutations; it just comes from some large space that's still much smaller than N! (depending on the number of random bits used), and it's hard to predict which member of that space the permutation will be.It is simply not possible to sample from a uniform distribution on a set S (here, the set of all N! permutations) in time less than lg|S|. Just think about it: whatever your randomized algorithm, if you only flip a coin M times, there are only 2^M possible outputs of your process. If you want every member of S to be achievable, this requires 2^M ≥ |S|, or M ≥ lg|S|.
 As long as N is factorable into primes, it can be done by Feistel encrypting each prime using modular addition.There are ways to do it with greater coverage of the space between 4 and 2^64, but I would defer to better judgement before doing so.
 Factoring into primes doesn't really affect the lower bound on complexity.Note that for N=2^64, the number N! is about 10^(10^20.5), i.e. just the number of digits in N! is itself about 3×10^20. The number of bits in N! is itself over 10^89. Every prime number from 1 to 2^64 (over 10^17 primes) is a prime factor of this number N!; if an algorithm is supposed to work with those many primes, it's clearly no longer O(K) as K could be small relative to N.An algorithm (like one of the proposed encryption schemes) that produces, say, one of 2^1024 permutations may be “good enough for practical purposes” (as 2^1024 is much larger than the number of atoms in the universe), yet as a fraction of the space of all possible permutations, it is dizzyingly insignificant.
 That is correct.To achieve full coverage of possible permutations, one needs independent keys and innumerable rounds.
 I'm feeling like I'm missing something, but why not? 2^64 is the number of cards in the deck, but that's not the number of cards we're picking. That might be just 10.
 As a sibling comment mentions the constraints given simply don't allow sampling with rejection due to the memory constraints. You get a stream of uniform samples, you're not allowed to store them.But even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.And of course it also works for m < k < n, where m is your memory size. E.g. when you're performing analysis on a dataset with 10^12 entries, you only want to sample 10^9 of them and don't have the storage (or don't want to pay the IO overhead) for either.
 > even if you were allowed to store them the algorithm still has an advantage. It doesn't require random access, only sequential reading and skipping. This was made for tapes, but is also useful for cases where it's simpler or more efficient to stream your data source instead of doing random access.This is a good point, but its evil twin is the point that this algorithm requires you to read the entire data set, every time, making it undesirable for any application where you can do random access.
 Not necessarily. If you're using it a tape drive, it requires you to scroll through half of the entire tape on average every time, but fast-forwarding a tape is a faster operation than actually reading the entire tape. Other sequential access data store may have similar behavior--i.e., if you are accessing data from a scrolling/paging API that lets you skip ahead and drop pages, but doesn't let you back up without restarting.
 Because we're restricted to O(1) extra space by the problem statement, regardless of the number of cards we're picking. We could be picking all 2^64, but definitely don't want the constant in O(1) to be 2^64!
 O(1) /extra/ space, but we must have working space at least the size of the output.
 The paper describes an online one-pass algorithm for which this is not true: as you get each input value you accept or reject it. So you only need O(1) space to store the number of values you've seen, the number remaining, and the number you still need to accept.
 The number of values you've seen could be "only" an exabyte, which is a problem.
 You're storing the number of values you've seen:`````` So far, I've seen 172 values. `````` You're not storing the values themselves. If the number of values you've processed is too large to fit in memory... you probably had to spend a long, long time processing those values. It's not a relevant objection.
 The algorithm has to deal with picking, say, 2^63 of them. No room in RAM or even on disk to keep track of the chosen cards.
 It doesn't. The original problem statement is "generate a list of random tweets, without duplication". If you can't store the list, you're stuffed no matter what.
 I found the article a bit obtuse, but I take it to mean we are drawing a uniform sample from a very large set. This means we don't need a shuffled sample, we just need one with a uniform chance of each source element being in the sample.Duplication in the context of the article appears to refer to tweet identity rather than content, so taking a stream of all existing tweets and selecting a uniformly sampled subset of them is certainly possible without storing the list of sampled tweets.The algorithm referred to has an additional constrain of sequential access only. If you have random access there are simpler approaches to take. If you have an unbounded input stream you just filter using a prng for a uniformly distributed sample.The article only applies to enormous and sequentially accessed data sources, which is why the algorithm died with tape drives: it's not actually a common problem any more.
 From the article: "Stated more formally: given non-negative integers k and n with k <= n"Focus on k <= n.Also, in the comments section in the article, the author counters a comment in favor of Python's random generator along the same lines.
 He also just throws this out there: "visual inspection was enough to convince me that it [Python's sample function] was broken, without going further" -- with no explanation or in what way it is supposedly broken.
 The original problem is to "generate a list of random tweets, without duplication". So lists are most assuredly allowed, and the implied constraints on the output size are in play. Using those to solve the problem under an assumption that the output size is much smaller than the input is therefore fair game.
 Picking tweets is an example of the problem, not the definition. Further, the article was very explicit about noting the small-size cases and ignoring then as irrelevant and boring. So no, they're not fair game.
 Let's just read the first paragraph again:"Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication. You may think you know how to do this, but you almost assuredly don’t."This problem can definitely be changed into harder problems, but this is the premise of the article and it's pretty darn easy.
 I think the premise is ambiguous and my initial reading caught the implication from "tweets" that the datasets are huge, therefore keeping track of state in the naive solution is impractical.
 The problem he states is really not for the algorithm he fails to actually explain. The algorithm only makes sense if you require the output to be in order (and you need a really large sample). The part where he suggests shuffling the output afterwards is absurd; if you wanted random order, there are much simpler algorithms.
 Yes but, I feel like he's still working too hard.I've 'reinvented' an algorithm at least twice to generate/iterate all permutations or combinations of a list by counting from 1 to P(n, k) or C(n, k) and just doing repeated divisions to determine the next element in the sequence. That generates them in order. If n is small enough to fit into memory then you can shuffle the inputs, but that still gives you a partially ordered output (all outputs with 'foo' in the same position are contiguous, for instance).I always left myself a TODO to use a PRNG instead of an increment if I needed the outputs to be more shuffled than that, but it never got that far.
 If you can store 2^64 data points, why cant you store 2^63 indexes? That's a very small increase in memory requirements.
 The issue isn't storage requirements; the issue is that, in the environments where this algorithm is useful, random access to the storage device is orders of magnitude slower than sequential access. Think magnetic tape in 1986, or today's spinning rust hard drives that take relatively forever to "seek", or even RAM being accessed sequentially, allowing the CPU to speculatively pre-fetch cache lines in burst-mode.Using Vitter's algorithm for generating the random numbers in order, one at a time, lets you do a single, serial pass through the 2^64 records on said device, streaming out the selected records sequentially.If instead you generate your 2^63 random numbers out of order, you'll have to do at least 2^63 random writes one way or another. Even if you have an extra exabyte to do these writes into (and don't mind waiting to generate all 2^63 of them before outputting the first one), it will take a prohibitively long time to do, since it's not sequential writing.That's the motivation behind this algorithm; if the characteristics of your exabyte-scale storage device are not "sequential = fast / random = orders-of-magnitude-slower", then it may not be of interest.
 No if you generate them out of order, you can just use a permutation function and have zero memory requirements.This algorithm is only fast given that sequential access is much faster than random access (and you don't need your output shuffled).
 You can't naively store 2^63 indices, else you'll get exponential run time to insert or lookup.Using a random access memory structure like a binary tree gives you logarithmic run time, but for a 64-bit index and 64-bit pointers will cost you 3x the space, so now your index is larger than your source data.And since you probably don't have random access memory, but instead have random block access, you'll need to use something like a b-tree to avoid getting destroyed by seek times, and this has a further space overhead - you're now more like 6-8 times larger than the source data.You may as well merge sort the 64-bit source data (indices) at O(nlogn) time and constant space costs.
 It might be easy to generate and output the 2^64 data points, because you can throw them away. With 2^63 indices you need to keep them around.
 The last time I wanted an unpredictable, non-repeating sampling of 64-bit numbers, I used a symmetric cipher in CTR mode. Essentially, start with a random key and a random value, return the encrypted version of it, then the encrypted version of its increment, and so on. Returns every number in the range, with no predictability.
 The thing the Vitter algorithm does which this doesn't is that it produces the samples in order. This is why it's useful with tape drives, or sampling market data through the day: you can start at the beginning, and take the samples without ever having to seek backwards.
 It is also the only reason to use this algorithm :) It's a pretty important requirement that I don't think the author emphasized enough.If you want them shuffled (or don't care if they're shuffled), better use a permutation function.BTW there are much faster permutation functions than cryptographic ones (you don't need the security aspect for this application). Look to modern PRNGs.
 This is a good approach. Thanks to bijective mapping between original and encrypted spaces the result is guaranteed to be unique. And since the mapping function is a secure encryption algorithm, the output has uniform distribution. Well done.
 Indeed, but this assumes that modern crypto is secure! :)
 When our crypto prof told us we're not sure if one way functions exist, everyone's jaw dropped.
 I know this, as I work in crypto.But I’d have never considered using it for this - it doesn’t register in the brain as anything other then crypto.Awesome solution, thanks for hacking my brain.
 This works, but only if you're sampling the exact same space as the block size. For example, if your cipher outputs 64-bit blocks, but you want to sample in 2^32, you're back to the drawing board, because there's no way to truncate the cipher output without risking duplicates.
 You can use the "hasty pudding trick" to truncate the cipher output without risking duplicates. It is inefficient if the block space of the cipher is much larger than the sampling space, though.However, you can also use a Feistel network to create a block cipher for an arbitrary bit length. So, you can always bound the block space to be no more than twice the sampling space, which is okay.
 What's the hasty pudding trick? Is it related to the Hasty Pudding cipher?
 It is. Perhaps it's not officially called that, but that's how it was described to me:
 A variable-size block construction would help here. It doesn't even have to be cryptographically secure, only bijective and statistically good enough similar to PRNGs sans CS. It doesn't even have to be keyed.
 Another good tool for this is the linear feedback shift register, and most programmers outside of crypto seem to be unaware of it too. A "complete" N-bit LFSR gives you 2^N non-repeating pseudorandom values. You start with some value (the seed), and feed it into your LFSR operation along with your chosen bitmask (the "taps"), and you get a difficult-to-predict value back out."Complete" LFSRs are created by carefully selecting your taps, and there's plenty of literature available for a broad range of taps for many different numbers of output bits.Your cipher may have an LFSR buried somewhere in it, I guess they're a common component in crypto. (I know dick-all about crypto, mostly.)Usual caveats for PRNGs apply: LFSRs are cool and magical and super fast, but the next number in the sequence can be computed by anyone else that can figure out your LFSR and seed, and since a complete LFSR will hit each value in 2^N exactly once per cycle, it doesn't even count as pseudorandom.But, if you need a random-looking non-sequential non-repeating selection widget, an LFSR just might do the trick.
 I’ve used this trick with a 64-bit block cipher (base-32 encoding the result) to generate short, unique public identifiers (to put in URLs) for entries in a database based on an integer primary key.
 There is also Skip32 cipher for 32-bit values. Interestingly enough, it was initially designed to cover the scenarios like yours.
 This is a good way to pick a few 64 bit numbers without duplicates, if 64 bit numbers is what you need.I didn't see any replies to you mention this part, but Vitter's algorithm works on an unknown number of samples. You can pick a 64-bit number uniformly when you know in advance you have exactly 2^64 choices. But how would you pick samples from a stream with equal probability, without knowing how many items are in the stream? Maybe it's 3, maybe it's 2.493e85, you won't know until you run out, and when you do, you want to have picked some items from the stream with uniform probability.
 Incorrect, you do know this number beforehand. The parameter `N` in the code is the upper bound on the number of values, and the parameter `n` is the number of samples.
 I don't know which code you mean, but I was referring to Vitter's basic "Algorithm R", which does not depend on knowing the number of samples or an upper bound in advance. https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R
 Wait, are symmetric ciphers guaranteed to cycle through the full space of block-sized numbers when applied to their output like that? I thought that was just "close to" true?
 Yes, it's bijective. For every input value, it provides a unique output that will decrypt back to that input.You might be misunderstanding the algorithm. We're only encrypting an incrementing number, not ever encrypting something that is already encrypted.
 Ah, okay. Thanks! Maybe I was thinking of hash functions, for which it's not guaranteed that every output has an (n-sized) pre-image?Edit: To clarify, I do now understand that you were proposing:random number X -> output encrypt(X), encrypt(X+1), encrypt(X+2), etcWhich I had misinterpreted as:random number X -> output encrypt(X), encrypt(encrypt(X)), encrypt(encrypt(encrypt(X))) etc.
 Counter mode means encrypting successive (all different) numbers, not the previous output. There is no issue with cycles.
 Yes, the original commenter replied to say that and I agreed to it.
 This is my homespun solution whenever I need a deterministic, non-repaeting mapping of a sequence. Except I rely - shame on me - on a variable blocksize encryption scheme I've hacked together for just this purpose.Output - of the mapping as well as of anything I run through the encryption algo - passes all pseudo-random validity testing I throw at it, but obviously I would never, ever consider using the thing where real security was a concern.
 Choosing an initial random number, then keep modulo-incrementing it by a generator (a number larger than, and relatively prime to, the stack size) wouldn't work as well?
 In a similar vein, I use a PRNG (based on Blake2b cryptographic hash) and mask off bits for the required domain.
 Why not simply increment the CTR IV with a monotonically increasing counter?
 you can also use a Lehmer random number generator of full period.
 Like all LCGs, the output of the Lehmer RNG is very predictable, but it would indeed suffice for the purpose given in the article.
 >The misleading part in using the language of cards is that you don’t often consider a deck of size 2^64. What’s nice about the card formulation though is that it conveys how simple the problem statement really is. This is a fundamental problem that was open for a long time. Nobody knew how to deal cards.>The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).How is this true? If you're drawing from a deck of 2^64 cards then unless you're drawing almost all of them there's not going to be any problem with rejection. Even if you're drawing 2^63 cards you're going to be rejecting only half of your draws, so the average slowdown is just a factor of 2.If you really want to draw more than half the deck then just randomly pick the cards that you don't want.
 Yeah, the motivating text from the article isn't very convincing. Seemed really odd that they didn't even describe the algorithm either. Sure, there's code and a link to the paper, but you spent so much time in prose trying to talk it up, needs to have the full story in prose too if you're going to do that. At least give the main idea.I skimmed the paper, and the sort of algorithm used does have some advantages I can see. Mostly, you can start giving output right away and don't need extra storage, and you don't have to be careful that n << N or have separate algorithms with a threshold or anything.
 The algorithm in the paper does have a threshold of n/13, beyond which it switches to "Method A".
 Ah, thank you, I did not see that part.
 The problem the author tries to avoid here is in this way the memory requirement is proportional to 2^63.
 EDIT: This is totally wrong. I haven't had my morning coffee yet. H_n - H_(cn) ≈ constant, not diverging (for 0 < c ≤ 1). :)The problem is that all of these little errors add up to give you a logarithmic slowdown. In particular, the solution proposed in this comment is still Ω(k log k) (where k is the number of elements drawn), which is slower than the algorithm's proposed solution of O(k).
 I see. In particular the log(k) doesn't come from repeat attempts, it comes from the time needed to search through your list of drawn elements to see if you've seen the chosen element before.
 See my edit! Oops.Anyways, if you were to keep it in some hash-based structure, you could check membership! (But I'm sure you already knew that :)
 No structure can check membership in O(1) time if n is big enough.
 In general, assuming that we have constant-time operations on arbitrarily large numbers (which we almost universally assume when doing these analyses, except in very specific fields), then hash structures do check membership and add items in O(1) time (since the time spent increasing the structure size is amortized fairly quickly).Additionally, the only problem with this approach is the extra O(k) space taken up by the structure, which doesn't meet the condition if the result is to be reported in a streaming fashion (but this is not the case in most practical applications)! For comparison, Vitter's is able to do this streaming output in O(1) space.
 > then hash structures do check membership and add items in O(1) time (since the time spent increasing the structure size is amortized fairly quickly).Hashtables can never guarantee constant time operations. It is expected O(1) time, but it is completely possible that everything hashes to the same bucket and you get O(n) operations, no matter how much you rehash.
 The algorithm is a randomized algorithm and we're assuming we're picking a good hash. The probability of that event is exponentially vanishing, so, in expectation (which is the runtime we're looking for, as we are analyzing a randomized algorithm based on rejection sampling) it is O(1).In the worst case analysis of this rejection-sampling algorithm, the time for table look up and insertion doesn't even factor in since you'll have infinite run-time.
 > Hashtables can never guarantee constant time operations.A bit mask is a form of hash table and is guaranteed O(1) lookup for O(lg(Universe)) space.When you have fixed universes, all kinds of options open up. I personally love vEB Trees, which can do set operations in O(lg(lg(U)))
 Yes, in cases where direct addressing is possible you get guaranteed constant time. That's really more of an array than a hashtable IMO.
 > How is this true? If you're drawing from a deck of 2^64 cards then unless you're drawing almost all of them there's not going to be any problem with rejection. Even if you're drawing 2^63 card you're going to be rejecting only half of your draws, so the average slowdown is going a factor of 2.It it though? I'm a but shoddy on my mathematics, but I'd wager from the birthday problem that you'd probably have to reject much more than half of the 'cards', and as you get to the end, you'd have to redraw several times for each card.
 Birthday problem says that once you've gone through ~2^8 cards you've got a ~50% of having had to reject at least once. It takes 2^63 until you have a 50% chance of rejecting on each draw.
 You're right, but it's not 2^8. It is √2^64 = 2^32.
 Yes, thanks. Took the root of the wrong number.
 Ah, right. I wasn't thinking properly. Thanks for the explanation.
 But on each draw you need to check if its a card you've already drawn. You can hold your cards in a set data structure but then you pick up the complexity of the insertion operation into sets. IIRC sets are basically hash tables that map keys to themselves. Insertion is O(1)...until you need to resize your table due to hash collision.
 If you think about it one draw at a time it's pretty clear and there's no room to be missing anything really.2^63 is half of 2^64. For the last draw, each time you try to pick an unused, what's the odds it was taken already? (approx 1/2) You won't have to try too many times. The draws before the last are strictly better, so overall you should be fine.
 That seems correct. On top of that, if you ever need to choose more than half, you choose n-k first then reverse the set. Could it be that the distribution condition is not satisfied? Although I don’t see how. The article could’ve explained the problem better (I mean it keeps reffering to “draw without replacement” - what does that even mean).
 The "choose n-k" trick is quite nice. It has one big downside though, which is that you can't start reporting any results until you're totally done.One algorithm that I haven't yet seen mentioned in this discussion is: shuffle the array of choices, pick the first k.If you're clever about it, this has some nice properties: you can stop shuffling after the first k, and if you want you can start reporting results immediately. (You'd want to use something like Fisher-Yates to shuffle for these properties)This also wouldn't use much extra space, and doesn't require special handling for small or large k.Biggest downside is it will change the order of the original set. And it's only worth doing if you already have an array full of the items to choose, most likely.
 I think the article discusses the fact that shuffling the entire initial set takes up way too much memory.
 A use case may be shuffling the entire list, if I'm not mistaken?
 They specifically say that a nice thing about Vitter's algorithm is that it produces its results in sorted order, and that it's easy to shuffle them after they've been drawn.> The reason it’s nice for an algorithm to produce results in sorted order is because it’s easy to randomize the order after the fact.
 ...but is it actually so? The more I think about it, the entire topic seems to be, how do you say, a dog biting its own tail?You have a sorted list. Now randomize it. Either you pick random from source, or you place randomly into destination. And then, as k approaches n...
 You can shuffle k items in O(k) time using an algorithm that swaps items in place. You can't use a similar algorithm to pick k out of n because that'd require you to materialize all n items in memory simultaneously, which exceeds the space constraints.
 I thought of swapping too, but there's something about it I can't quite fathom, that tells me, it will not be "truly" randomized as with the method I described above......and a way to offset that, if you insist on the method of swapping, would be to remember what you've swapped and cross-checking that, but then you end up with the same problem as when you pick random items...
 Fisher-Yates [1] algorithm is proven to "produce an unbiased permutation: every permutation is equally likely".Additionally, "The modern version of the algorithm is efficient: it takes time proportional to the number of items being shuffled and shuffles them in place."
 According to examples in the English Wikipedia, the method I described above is actually the Fisher-Yates method (and its naive implementation, and thus the O(n²) complexity), and the swapping method on everyone's minds was actually Durstenfeld's (optimization?), with O(n).The initial code snippet intended to be Durstenfeld was erroneously an off-by-one Sattolo implementation, which then further solidifyed my suspicion that it had a bias, which Sattolo's implementation actually indeed has. The Durstenfeld method is however proven to be unbiased.To add further confusion to the names in the issue, though some readers might not be able to verify due to the language barrier, the German Wiki states (claims?) Fisher-Yates is Durstenfeld's method with O(n), and the "naive" Fisher-Yates is labelled as the "direct" approach, with O(n²). The name Durstenfeld is not mentioned anywhere on the German Wiki. Though the names are different, the maths are the same in both English and German. I speak German and often read both versions of an article, in case anyone's wondering why I even brought up the topic.Thank you for posting the link.EDIT: I try not to post usernames in text, in case anyone ever changes their name.EDIT2: I recommend to always positively increment an iterator from 0 to n-1, and use in-loop arithmetics to get the actual desired iterator/index you require. It's less performant, but I'd argue mistakes are less common.
 OscarCunningham's implementation https://news.ycombinator.com/item?id=20961548 remembers what has already been swapped using the counter i.
 An easy way to randomise a list is the following:`````` import random list = [0,1,2,3,4,5,6,7,8,9] length = len(list) for i in range(1,length): j = length - i k = random.randrange(j+1) list[j], list[k] = list[k], list[j] print(list) [3, 9, 6, 4, 7, 8, 5, 1, 0, 2] `````` The randrange(j) can be done quickly by finding the smallest power of 2 above j and then picking randomly below it until you get an answer less than j.
 That code doesn't work as written (there is an off-by-one error in the first iteration of the loop). But the intended algorithm (starting at the end of the list, swap each element with a random predecessor) doesn't produce a uniform distribution anyway; for instance it will never produce an output where the last element is unchanged.
 I had an error which I believe I have now fixed. The intended algorithm is "starting at the end of the list, swap each element with a random predecessor or possibly with itself".
 Wouldn't this have a bias to how the initial input sequence is?EDIT: On the other hand, due to randomness, it could cancel itself out, I don't know, beats me. My brain hurts, at this point, I'd try to figure it out statistically first, to see if it's even a valid assumption before continuing the theoretical thought, but I'm presently too lazy to code that out... it's Friday, cheers.
 How long do you think it will take you to shuffle a list of size 2^64?
 This solution requires storing the 2^63 cards you have picked. That's impossible to do today.
 It would take a lot of storage, but it's not impossible. 2^64 bits equals 2097152 terabytes. A lot, but doable with modern technology. I guess for \$100-200M.
 Some of the variable names used:i, j, t, qu1, S, n, N, U, X, y1, y2, VI know others have already commented on this but I think it's worth laying out all of the cryptic variables and asking: are some of these shorthand for well known math concepts that make it unnecessary to have a more descriptive name? I understand the use of temporary value holders like `i`, but have no clue what some of the others might be used for or what the single letter names indicate.Another way to ask the question: if the author had just used instead a, b, c, d, e, f, g, h, etc. what information would be lost, that is gained by the particular letter choices that were used? This isn't rhetorical, although there is a point to be made for sure about clarity in naming... I'm really wondering if anyone can explain what these letters might mean if you are more versed in the math than I am.BTW here's another Vitter paper I found in case anyone's interested:
 The reason for naming the variables as they are seems to be to keep consistency with Vitter's paper [http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf] Since the paper is really the primary documentation of the algorithm this makes perfect sense. Renaming the variables to be more meaningful to you would make it harder to compare to the paper, reducing the effectiveness of the paper as documentation and making it hard to compare the code to its specification.
 There's no reason to preserve bad notation from a 30year old paper.
 Are you saying that single-letter variable names in maths is "bad notation"?
 The variable names are mostly from the paper, since the blog post mostly just transliterated the Pascal code in the appendix. Some of them are explained. n: sample size. N: file size / total record count. U,V: independent uniform variates. X: random variate approximating skip distance. S: skip distance. qu1: N-n+1. y1 and y2 aren't explained, they're just intermediate formulas. t seems to be a local loop variable. i and j are index variables for the array used to return the sample, introduced by the blogpost.I'm not saying it's good code, but BLAS is similar: https://github.com/Reference-LAPACK/lapack/blob/master/BLAS/.... At some point working code beats comprehensibility.
 > working code beats comprehensibility.If you give long names to every variable, by the time you finish reading a line you forgot what they were doing.Tolstoy novel problem.
 >are some of these shorthand for well known math concepts that make it unnecessary to have a more descriptive name?With context, I would agree that it's unnecessary to have a more descriptive name. N is the population, n is a sample in the population. However, without the context and experience which makes these values recognizable, they're exceptionally unhelpful.
 I don't have an answer, but I remember Go programmers on Quora swearing up and down that (the Go convention of) using single-letter variables is ingenius and leads to great code ... although none of them could give a single example.
 This is a bad take and odd place to criticize Go programmers.Go convention is often short variable names, but the only place I know where single-letter variable names is encourage is for receiver methods. https://golang.org/doc/effective_go.html#methodsIn that case the definition of the single-letter variable is literally a line or two of code away.
 So is HN a good place to ask, and if so, are you able to find a perspicuous example highlighting the superior code quality?
 The receiver name is likely to be used within the method, and always defined at the top of the block (and recommended to be consistent across all of the receiver's method definitions). So locality and consistency are great, improving the readability of the code.Names like "self" and "this" proscribed by other languages begin to have issues when you start dealing with nested definitions and closures. Allowing them to be explicitly named in Go avoids these ambiguities.
 The distinction isn't against protected words, but against more descriptive names (like "buffer" or "receiver"), and the issue is more complex functions, not one-liners.Anyway, didn't mean to put you on the spot, since you don't speak for Go programmers and weren't planning to justify the argument, I just wanted to clarify why these wouldn't satisfy the challenge.
 It’s called Reservoir Sampling https://en.wikipedia.org/wiki/Reservoir_samplingI was just re-reading this yesterday! I think a lot of the Monte Carlo rendering people know about this family of algorithms, and the idea is pretty simple. I’m not sure why it’s described from the start as picking k samples though, I find it a lot easier to understand and see how it works when you start with k=1.Btw, it’s fun to work out the probabilities of picking a sample to prove how this algorithm works, and that it’s uniform. Big piles of factorials that all cancel out.Pps the other super rad algorithms that aren’t well knows are the alias method for simulating a discrete distribution https://en.wikipedia.org/wiki/Alias_method...and the suffix array, which is crazy surprisingly useful https://en.wikipedia.org/wiki/Suffix_array
 Reservoir sampling requires you to maintain a reservoir of n items that will be your sample. This can be prohibitive if the sample size itself is huge. The algorithm described here decides (commits) whether an item is to be sampled as it sees it (i.e., online), while consuming a constant amount of memory. It differs from the usual reservoir sampling approach in that it cannot go back and "unsample" an item it already selected (whereas the crux of a reservoir sampling approach is evicting current candidate items and replacing them with new items).
 While reading, I was thinking about LFSR [0] with suitable properties and period. Not strictly random, although LFSRs are commonly used for generating "random" numbers.
 Wolfenstein 3D's FizzleFade uses an LFSR to visit each screen pixel exactly once in a pseudorandom order: http://fabiensanglard.net/fizzlefade/index.phpThere's an alternative FizzleFade implementation using Feistel networks: http://antirez.com/news/113 -- I think the same technique could be used here to randomly permute the index space.
 I was thinking the same. Depends on what "random" actually means. If you want to "pick one of these that you haven't already", a LFSR is great. Less so if you want truly random, or if anyone will start looking closely.
 Exactly my thoughts - its wolfenstein fizzlefade or LSFR that you learn in crypto 101 that can pick all the numbers in the range randomly."Algorithm No One Knows About" is such a condescending title...
 Especially if you don't actually explain the algorithm and seemingly fail to understand its single most important requirement is that the output is in order. Not something to shuffle later, because the only advantage of this algorithm is that the output remains in order.Otherwise you use a permutation function, such as an LSFR or preferably a family of functions.
 Getting this wrong can tank your product - see what happened to Draw Something http://epeus.blogspot.com/2012/04/draw-something-ceo-grace-a...
 Summary without the drama: by not using a "discard pile" but picking randomly from the same set of words every time, you get repeat words paradoxically often (birthday paradox), which is what pictionary did and people complained about.
 Here is how I would do it. Pick a prime p bigger than N. Pick a random permutation of Z/pZ. Pick an initial state. Advance to the next state, to cycle through all the values exactly once.You use rejection sampling if the state is greater than N, so as long as you pick p close to N this should happen very rarely and be amortized O(1).If N is big this advance to the next state can be encoded easily as a multiplication by an element of Z/pZ (multiplication followed by modulo p), and you will cycle through all the values because Maths. If N is not so big you pick a few (m) elements of Z/pZ to represent your permutation, and you use the first number as a multiplier mod p the first time, then the second number as a multiplier mod p the second time, and so on..., and you will also cycle through all the value, because Maths as long as you don't get a 1 while multiplying sequentially your m elements.Don't forget to stamp it good enough(TM), if not use Dual-ECC.
 Okay, so this is a way to generate a sampling in order, so you don't have to keep a sorted list and your run time can be k instead of klogk. Mostly.The body of the post, with its talk of strict timing and sample bias, had me thinking it had some clever way to pick a random number from 1 to n in finite time. But that's not actually possible to do with a random bit source and a non-power-of-two n. If this algorithm hits certain rare values it will re-sample the RNG.
 This appears to be the state of the art in minimising the resampling cost:
 >had me thinking it had some clever way to pick a random number from 1 to n in finite time. But that's not actually possible to do with a random bit source and a non-power-of-two nI can sample 1-6 in finite time by rolling a die. I suppose you mean given a RNG in range 1-2^n, you cannot get a sample range non-power-of-two in fixed time.However, you can sample in finite time with probability as high as you like, since the likelihood of another roll falls off exponentially. So if the odds of your machine quantum tunneling into a star is p, then you can make an algorithm that does your sampling with fixed time T with probability more than 1-p, making it perfectly usable.in practice, resampling rand takes very, very little time. I use it all the time for uniform [0,n) queries. The avg number of rolls is for the most part like 1.1 or less.For example, to sample 0-9 given a 32 bit full period generator, since 2^32 mod 10 is 6, only in 6 cases out of 2^32 cases do you need a second roll. This is common.
 I may be misinterpreting the question here but can't the author just use a large prime to to pull from the array like you would when hashing? Its how you would in place shuffle a music playlist.
 > When I came across a need for the algorithm in my work, it took a lot of detective work to even find it. There was no source code online. The algorithm is not referenced often, and when it is, it is in a highly obscure fashion.Well, the author certainly isn't helping that problem with a clickbait title like that, and being all mysterious for the first few paragraphs, teasing "you almost certainly don't know this"."Dealing cards from an extremely large deck: Vitter's Algorithm"Except he doesn't even explain the algorithm, just some source code (which is super unclear, no comments, short variables). And the unhelpful comment that it was hard to figure out from the paper. All the reason to put a more descriptive title.How does he expect to help people find this, figure this out, or was this entire post just to brag that "his" solution is better than anyone else's because he found the right algorithm almost nobody else knows about?
 Oh dear, I find those single-letter variable names are quite counter-productive in making the code easy for someone unfamiliar to the project to understand...
 Yeah, I'm way out of my comfort zone here but I found the choice of n and N pretty off-putting
 A typo is only a (sticky) shift away, and not even obvious!EDIT: ...and compiles, too!
 They are like that because they essentially transliterated the original Pascal code from the paper. Leaving the variable names as they were in the paper allows for the paper to act as documentation for the code, so to speak.
 Can't you use https://en.wikipedia.org/wiki/Lehmer_random_number_generator for this? I thought that was a common method of generating a random sequence without repetition. True, the result is not very random, but in most applications it doesn't make much of a difference.
 Not if you want it to be actually random.But you might be correct if you simply want the appearance of randomness.
 It’s related to Fisher-Yates shuffle, https://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to...But it’s not the same thing. The difference is whether you can keep all samples in memory at the same time. With Vitter / reservoir sampling, you only need to have your reservoir in memory plus enough space for 1 new sample at a time, so you can stream through a large number of samples with a tiny amount of memory.
 Exactly, just truncate the iteration to the first k elements. Need a hash table to get the modified values, so it breaks the "O(1) extra space" condition, but that condition is bullshit anyway, you're already using O(k) space to store the numbers you find, an extra hash table is also O(k) space. http://ideone.com/5uwqNO
 Ok I'll take a shot at it.Pick a number between 0 and N. Let's call it p for pivot. Each draw after this initial draw is either less than p with probability p/(N-1) or greater than p with probability 1-p/(N-1). You need to make k-1 additional draws. The probability of m of those draws being less than p is given by the binomial distribution. Choose m from the binomial distribution. Then run the algorithm recursively on the space 0 to p-1 and p+1 to N. Some pseudo-code.`````` def deckpicker(N,k,offset=0): if k == 0: return p = uniform_dist(0,N) m = binomial_dist(k-1, p/(N-1)) deckpicker(p-1,m,offset) print(p+offset) deckpicker(N-(p+1),(k-1)-m,offset+p+1)``````
 What's the time complexity of sampling from the binomial distribution?
 well...shit.
 Why is hashmap not good enough for unique ID generation?`````` const tweetIDs = {}; while(TWEET_COUNT){ const rnd = Math.random() * MAX_TWEET_ID; if(!tweetIDs[rnd]){ tweetIDs[rnd] = true; TWEET_COUNT--; } }``````
 The algorithm described in the article has other desirable properties:* IDs are returned in sorted order* IDs are generated one after another, ie could be implemented as a generator or coroutine* No additional space required during the computationResulting in a single pass.
 Nicely said.I was thinking of ways to solve the problem by eliminating the overhead of checking whether a random number was already drawn, when Vitter's solution was to make the random number generator produce random numbers in a sorted manner, thus eliminating that overhead entirely.EDIT: Then again... if the algorithm is to be used to shuffle a list, wouldn't returning the selection in a sorted manner defeat the entire purpose? If k = n, output would be input...
 It would. The original problem is to draw a subset of tweets.If there's a chance to randomize the entire tweet archive, then it's a shuffling algorithm now.
 > * IDs are returned in sorted orderThis is key. The author does not explain this well, or at all. If you require the IDs in sorted order, you need this algorithm.If you don't mind or want them shuffled, use a permutation function.(the idea of shuffling afterwards is nonsense, if you could do that, you can also check for duplicates when drawing)
 >The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).This will take longer time at each iteration and when both MAX_TWEET_ID and and TWEET_COUNT is large and close to (or equal to) each other it may take very long to complete.
 That gets pretty slow and memory intensive once you sample a large amount of the total doesn’t it?
 It was not a requirement :) However it won't get slow, it's O(1) to lookup if a key exists.
 > (Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)As k approaches n the expected number of draws needed to get k unique values increases faster than O(k).
 It will get slow towards the end once each loop iteration has a very low chance of being productive (only happens when you're choosing all or almost all elements). You can fix that, though there's potential tradeoffs.
 ”Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication.[…]Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n.“I think that knowing k beforehand makes it a different, easier problem. I don’t see how, if you don’t know k beforehand, you can do without enough memory to store a number of magnitude n!, and even that would be extremely slow, the more items you have to produce (uniformly pick a permutation of n items, and produce as many of them as requested)
 The author fails to emphasize this, but the key property here is that the output must be returned in order.
 I don't see why you'd need to generate a permutation?
 If you don’t know k beforehand, you have to (eventually) handle the case where k is large, possibly even equal to n.If it turns out to be equal to n, you have to generate a permutation.If it isn’t, you don’t need to, but you need to keep enough information around, in case it turns out that you need.For example, by the time you’ve generated 2⁶³ integers in the range 1…2⁶⁴, you need to know which 2⁶³ integers still are ‘available’.I think you can’t do that more compactly than by just picking the permutation to generate at start (whether you can efficiently determine the k-th number in that permutation is a separate problem)
 So let me get this straight...This algorithm creates an initially empty array of size K with the DB internal ids of what would randomly be selected. Then it goes in order from 0 to the number of rows - 1, and for each one it goes e.g. "What are the chances that number 0 would get randomly picked out of N items if K items need to be picked?" Then, if the #0 passes the "random check", it gets added. Then it would go to 1 and go "What are the chances #1 would be picked out of N-1 items if K-1 items need to be picked?" ...and repeated so on until all K random items have been picked?
 No, what you described is O(n).
 "(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)""...it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not."These seem to be fairly straight forward problems to solve- can anybody ELI5 why the Vitter algorithm is necessary?
 Probably the hardest things are the space and time requirements. If you relax e.g. the O(k) time requirement to O(k log k), the problem essentially becomes trivial.[1]More generally, though, what solution are you thinking of? (The problem, as stated, is actually fairly difficult! As far as I can tell...)---[1] See: https://en.wikipedia.org/wiki/Coupon_collector's_problem . The solution is then to draw repeatedly and then check if this card has been drawn.
 I see- in that case I need to read up on O(k) vs O(k log k)
 We can compute a random k-size subset of {0..n-1} as rndsubset(0,k,n), whererndsubset(i,0,n) = {}rndsubset(i,k,n) = {i} union rndsubset(i+1,k-1,n) with probability k/(n-i), rndsubset(i+1,k,n) otherwiseThis however takes on average n/k time per element, which is prohibitive when n >> k.The probability of {h} being the next chosen element in rndsubset(i,k,n) is k/(n-h) * Prod_{i<=j
 Trivial algorithms (like one implemented in Python) slow down when k is close to n. Or when n is large, then it is not possible to use bitmap to mark dealt cards.
 The python algo does not get slow when n is large. The k~n case obviously calls for inverse selection.
 NOTE: This comment I wrote is totally wrong. I haven't had my morning coffee yet. H_n - H_(cn) ≈ constant, not diverging (for 0 < c ≤ 1). :)Yeah, but the k ~ n/2 case (in which inverse selection and normal selection have the same runtime) is still Ω(n log n) (equiv Ω(k log k)), which is still "slower" than the presented algorithm.
 Actually, this comment is correct despite making a similar argument to your incorrect one elsewhere. Using a set to accumulate intermediate results requires k inserts at O(log k) each.
 I'm still not sure about this—the amortized time of insertion should be O(1) on any one of the usual hash table-type-structures. We're not inserting a single item to an already-built hash-table, but rather k of them into an empty hash-table, with elements being uniformly drawn and bucketed using a good hash function. All of this should give O(1) average time for insertions.
 What's preventing someone from implementing a non-trivial algorithm in Python?
 That was not at all what I meant.
 And I still don’t know about it, because the article just has a code dump and some vaguely relevant historical commentary. Posting the actual paper would have been better.
 > given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n.> We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)Sounds like the Knuth multiplicative hash.The Art of Computer Programming vol. 3, section 6.4, page 516.
 Tldr of the algorithm from the paper:Instead of choosing random numbers to include in the list, they're choosing the gap between numbers. They derive the distribution of this gap then approximate it so it can be calculated quickly.
 If this really was such an obscure but useful algorithm, someone should change the title of this post to help people find it in the future so it doesn't remain obscure (add "card drawing"?)
 It's... not that useful, as far as I can tell (unless I'm still under-caffeinated!)@avip mentions the following comment https://news.ycombinator.com/item?id=20961320 , which means that we only have to consider k ~ cn for 0 ≤ c ≤ 1/2. Drawing in this fashion, we have an expected runtime of k * (H_n - H_{(1-c)n}) ≤ k * (H_n - H_{n/2}) ~ O(k).[1]The latter is true since H_n ~ log n + C + O(1/n) for some[2] constant C.Of course, depending on what the author's constraints are, there would be an additional O(k) space requirement for keeping track of indices that have already been picked. Vitter's algorithm is particularly nice, since it can report the items without this additional space requirement.---
 I haven't read the algorithm code, but I think you can do the following? For n items, pick a random k <= log(n!), then enumerate the k'th permutation of n items. It's not trivial, but there are already algorithms for it, e.g.: https://code.activestate.com/recipes/126037-getting-nth-perm...
 Ooh, this feels like it gets close to the solution!It's not quite there as stated, though, since this algorithm requires O(k^2) time (where k is the number of elements to be randomly drawn) as you have to insert items into specific indices in the array.
 Yeah, but I think you can be more clever than just doing naive array insertions, by trying to batch them up. At least, we know decreasing sequences of indices can be done in one pass with no changes, and increasing sequences can be done in one pass by accumulating shifts. That leaves alternations to worry about. Now I haven't fully thought this through, but I think something along these lines might work (there may be errors in some of the details here): treat the list of indices you're inserting into as the leaves of a binary tree, and create the intermediate nodes of that tree, initializing them to zero. The intermediate nodes represent offsets (initialized to 0) that will be logically added to all earlier nodes. Then sort the nodes, breaking ties so that the later ones are first. Now perform every insertion at {the given index plus its ancestors}, and then increment the counters for all ancestors of that index so that you account for shifts of later elements. This should get you to O(k log k) or so.In fact, this makes me wonder why there isn't a multi-insert function in every language's vector implementation...
 I remember an article about the space complexity of drawing all colors randomly.. this seems related, isn't it ?
 Does anyone have a mirror for this? No matter what I do, all I'm getting is an empty page with a sidebar.
 In Firefox: View -> Page Style -> No Style(For whatever reason the page has a CSS rule #content { display: none; }, which presumably gets removed via JS.)
 Anyone feel like doing a human readable breakdown of what the algorithm is going?That would make me happy
 Yeah, really. A lot of filler paragraphs building up to unreadable code. Bad article.

Search: