"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects.
If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."
If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?
Full disclosure: I work for Qubit who published this white paper.
Maybe the confusion here is in tests which have a "true" effect and an "observed" effect. If an experiment has a true effect, then you have some chance to observe it, which is the power.
But false positives have by definition already been observed as winners (that's what false positives are), so there's no need to apply the factor of 0.8 to them.
"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects. If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."
If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?