> It needed to handle extreme load (hundreds of millions of participants in some cases).
I can see extreme loads being valuable for an A/B test of a pipeline change or something that needs that load... but for the kinds of A/B testing UX and marketing does, leveraging statistical significance seems to be a smart move. There is a point where a large sample is trivially more accurate than a small sample.
Even if you're testing 1% of 5 million visitors, you still need to handle the load for 5 million visitors. Most of the heavy experiments came from AI-driven assignments (vs. behavioral). In this case the AI would generate very fine-grained buckets and assign users into them as needed.
I can see extreme loads being valuable for an A/B test of a pipeline change or something that needs that load... but for the kinds of A/B testing UX and marketing does, leveraging statistical significance seems to be a smart move. There is a point where a large sample is trivially more accurate than a small sample.
https://en.wikipedia.org/wiki/Sample_size_determination