In large-scale ML training, shuffling datasets with over 1 billion data points (e.g., LAION-5B) is common. Typical methods like `np.random.shuffle` on an indices array, `df.sample(frac=1)` on the entire dataset, or using the online Fisher-Yates have significant drawbacks:
1. np.random.shuffle is actually kind of slow when you have > 1B data. It takes away your precious time to "can I enter my training loop to see if things work".
2. They take O(n) extra memory
I want to show a cool trick that should be more popular among large-scale ML practitioners - use a cheap symmetric cipher (a "pseudo-random permutation") to generate a permutation of [0, 1, 2, ..., n - 1] on the fly. This approach allows you to obtain a permutation of [0..n] using O(1) space and time (no need to store an extra shuffled version of the array), with a seed of your choice. You can randomly index in O(1) time, and the permutation can be generated instantly for any data scale, effectively functioning like np.random.shuffle but instantly.
The only tradeoff is a slight degradation in shuffling quality, but this is typically irrelevant for ML training purposes.
Using a Feistel cipher for fast shuffling is not a new concept and can be implemented in various languages. This Show HN submission `smallperm` is notable because
1. It is a library for Python, the de facto language for ML
2. It is backed by Rust, and efficiently engineered (iterating through the permutation is 2x-3x slower than iterating through an `np.arange`, which is to say magnitudes faster than pure Python implementations)
3. During testing, this library has higher-quality shuffle than some other cipher-based libraries in statistical tests
In the end, I am also here to shout that one does not always need np.random.shuffle for your billion-scale data ;). Maybe three minutes of `np.random.shuffle` on each start-up is not that long, but we can often shuffle in 0 seconds.