If you can store 2^64 data points, why cant you store 2^63 indexes? That's a ver...

drfuchs · on Sept 13, 2019

The issue isn't storage requirements; the issue is that, in the environments where this algorithm is useful, random access to the storage device is orders of magnitude slower than sequential access. Think magnetic tape in 1986, or today's spinning rust hard drives that take relatively forever to "seek", or even RAM being accessed sequentially, allowing the CPU to speculatively pre-fetch cache lines in burst-mode.

Using Vitter's algorithm for generating the random numbers in order, one at a time, lets you do a single, serial pass through the 2^64 records on said device, streaming out the selected records sequentially.

If instead you generate your 2^63 random numbers out of order, you'll have to do at least 2^63 random writes one way or another. Even if you have an extra exabyte to do these writes into (and don't mind waiting to generate all 2^63 of them before outputting the first one), it will take a prohibitively long time to do, since it's not sequential writing.

That's the motivation behind this algorithm; if the characteristics of your exabyte-scale storage device are not "sequential = fast / random = orders-of-magnitude-slower", then it may not be of interest.

tripzilch · on Sept 16, 2019

No if you generate them out of order, you can just use a permutation function and have zero memory requirements.

This algorithm is only fast given that sequential access is much faster than random access (and you don't need your output shuffled).

codebje · on Sept 13, 2019

You can't naively store 2^63 indices, else you'll get exponential run time to insert or lookup.

Using a random access memory structure like a binary tree gives you logarithmic run time, but for a 64-bit index and 64-bit pointers will cost you 3x the space, so now your index is larger than your source data.

And since you probably don't have random access memory, but instead have random block access, you'll need to use something like a b-tree to avoid getting destroyed by seek times, and this has a further space overhead - you're now more like 6-8 times larger than the source data.

You may as well merge sort the 64-bit source data (indices) at O(nlogn) time and constant space costs.

saagarjha · on Sept 13, 2019

It might be easy to generate and output the 2^64 data points, because you can throw them away. With 2^63 indices you need to keep them around.