Related but different for people who haven't done a lot of stats: your significance goes down with the number of hypotheses. If you've got 20 scenarios and looking for a 5% significance, one of those will be significant purely by chance.
You can correct for the "multiple hypotheses" problem by using a significance equal to (1 - (0.95)^(1/n)), substituting whatever significance you want for the 0.95 and using n=number of hypotheses. http://en.wikipedia.org/wiki/Bonferroni_correction
The table for that correction is equally frightening:
Edit: Following your link, the table you listed is indeed of the Bonferroni correction, and the formula is as I stated. The formula you stated is actually of the Sidak correction, which "is often confused with the Bonferroni correction", according to your link.
The Dunn-Sidak correction is preferred over the Bonferroni correction -- it's much less conservative, and will allow the finding of significance in situations that Bonferroni would miss.
Another thing: depending on how your problem is structured, it might be a bit confusing to think of these as corrections for the number of hypotheses. I like to think of them in terms of the number of planned/unplanned comparisons that are being performed: you can do an experiment with a single stated hypothesis, yet still need to use these corrections if you perform "unplanned" comparisons using the data (a.k.a "data mining", or "a data fishing expedition") later on.
This is a really great summary of planned vs. unplanned comparisons and why they matter:
Bonferroni correction is an extremely conservative correction, which loses significance very quickly. Depending on the relation between the hypothesises, and the relative magnitude of the significance levels, much better methods are available, such as bootstrap step-downs.
But in general, it's good to make people aware how quickly you are doing multiple comparisons and how it invalidates the significance levels completely.
The impact of this article rests with this sentence:
"Try 26.1% – more than five times what you probably thought the significance level was."
That is, if you peek after every observation and stop as soon as you reach 5% significance, there's actually a 26% chance the results are not significant. But that doesn't mean there's a 26% chance the other option is significantly better—just that there's a 26% chance neither is statistically better.
And for most startups, I think that's a fine compromise.
Sometimes I'll launch a new design and test just to make sure it's not terribly worse. If it reaches statistical significance (even if I "peek") then I'm cool with the new design and will make the switch.
And I'll continue to test and tweak the new design immediately after finishing the previous test. The time saved from my lazy statistics means we can move much more quickly.
If we had thousands of "conversions" a day, then it would make sense to be deliberate with our testing methods. But we don't, we have tens of conversions per day. And we can improve much faster using half-assed split-tests and intuition.
There's no need to half-ass the test, you should be able to get the actual significance at any point in the experiment. The software just has to correctly calculate the conditional probability of significance.
What a fantastic article. Thank you. I thought this quote really summed it up:
If you write A/B testing software: Don’t report significance levels until an experiment is over, and stop using significance levels to decide whether an experiment should stop or continue.
There are two parts: 1) what is an appropriate statistical model for A/B testing and 2) how should we make decisions based on our current beliefs (the Bayesian posterior).
A sensible starting point for the first is a hierarchical beta-binomial model. For instance:
Translating that example, the binomial variable represents the number of conversions given the total number of exposures. So if you show a red button 100 times and 10 people convert, then, using the notation in that PDF, n_i = 100 and y_i=10. We are interested in p(\theta_i|y_i, n_i), the posterior distribution of the conversion rate for experiment variation i (red, blue, green) given our data.
The hierarchical part of the model is what's Bayesian. Here we use a Beta prior, since \theta_i is between 0 and 1. This prior shrinks each estimate towards to overall conversion rate based on how much variation there is between experiments -- the \alpha and \beta parameters. You can think of \alpha and \beta as pseudo-observations -- the number of conversions and failures you've "seen" apriori. Given that we have multiple experiments, you actually have a sense for the distribution of \theta_i, and we can therefore estimate \alpha and \beta by adding a third layer p(\alpha, \beta).
There are many ways to make a richer model, but if you haven't seen Bayesian modeling before that's probably enough.
The beauty of the Bayesian approach is the posterior is what you want -- your belief about conversion given the data you observe, the model you assume, and your prior beliefs. As you add data, your posterior beliefs update, but at every point in time it always represents your best guess.
It solves the multiple comparison problem via shrinkage rather than by adjusting p-values. This is intuitive. If you see an outlier and you don't have much data yet, then it's probably just a random fluctuation and your prior shrinks your best guess towards what you think conversion rates should be overall. For instance, if you believe conversion rates are typically .05 and never .2, then if you see something like .2 after just a few observations, you'll probably guess the true \theta_i is more like .08.
The second part of the problem, optimal sequential decision-making is more tricky. It's a bandit problem, where there's a tradeoff between exploration and exploitation. As far as I'm aware, this is still considered a very tricky problem to solve optimally in all put the most simple cases. Practically you could probably get close to the optimal answer via forward simulation. There's a lot written on Bayesian bandit problems.
An approximate solution to a very similar problem is proposed here:
Once you see the logic of this approach, it's really shocking that A/B testing companies have not implemented it. It's really the only way to think about optimal decision making under uncertainty.
One way around the proposed problem is to become much more educated about statistics, another way is just to bump your threshold of statistical significance up to 99.9%.
There's nothing magic about 95%, it was a convenient heuristic for science and that's all. With the vast amounts of data points that a high traffic website will generate, reaching p < 0.001 should be not too difficult and a significance threshold of 99.9% will erase a lot of other statistical sins.
This can't be emphasized enough. For an experiment to be statistically valid, you have to run the experiment. Not part of the experiment. Not most of the experiment. The whole experiment.
The problem is that this advice completely ignores the motivation for experimenting: optimal decision making.
If you run a test that ends inconclusively should you really just throw up your hands? And if you run a test that's quickly conclusive, should you really avoid all the profit that could be gained from immediately exploiting this knowledge?
Inconclusive result is a kind of result. You can test comparable designs all you want and get an inconclusive result for a long time. This means there's no big difference and that's that.
If you get significant result at 0.0005, then it's up to you - might as well stop. There's even a table in the article saying what significance is appropriate after "correction".
You can correct for the "multiple hypotheses" problem by using a significance equal to (1 - (0.95)^(1/n)), substituting whatever significance you want for the 0.95 and using n=number of hypotheses. http://en.wikipedia.org/wiki/Bonferroni_correction
The table for that correction is equally frightening: