
How Not To Run An A/B Test - llambda
http://www.evanmiller.org/how-not-to-run-an-ab-test.html
======
rfrey
Related but different for people who haven't done a lot of stats: your
significance goes down with the number of hypotheses. If you've got 20
scenarios and looking for a 5% significance, one of those will be significant
purely by chance.

You can correct for the "multiple hypotheses" problem by using a significance
equal to (1 - (0.95)^(1/n)), substituting whatever significance you want for
the 0.95 and using n=number of hypotheses.
<http://en.wikipedia.org/wiki/Bonferroni_correction>

The table for that correction is equally frightening:

    
    
      #hypoth   req.sig
      =================
         1      0.05
         2      0.025
         3      0.017
         5      0.01
        10      0.005

~~~
laughinghan
Is it just me, or does that table look like

    
    
        req.sig = (intended significance)/(#hypoth)
    

? Which is kind of a simpler formula.

Edit: Following your link, the table you listed is indeed of the Bonferroni
correction, and the formula is as I stated. The formula you stated is actually
of the Sidak correction, which "is often confused with the Bonferroni
correction", according to your link.

~~~
timr
The Dunn-Sidak correction is preferred over the Bonferroni correction -- it's
much less conservative, and will allow the finding of significance in
situations that Bonferroni would miss.

Another thing: depending on how your problem is structured, it might be a bit
confusing to think of these as corrections for the number of hypotheses. I
like to think of them in terms of the number of planned/unplanned comparisons
that are being performed: you can do an experiment with a single stated
hypothesis, yet still need to use these corrections if you perform "unplanned"
comparisons using the data (a.k.a "data mining", or "a data fishing
expedition") later on.

This is a really great summary of planned vs. unplanned comparisons and why
they matter:

<http://udel.edu/~mcdonald/statanovaplanned.html>

------
edash
The impact of this article rests with this sentence:

"Try 26.1% – more than five times what you probably thought the significance
level was."

That is, if you peek after every observation and stop as soon as you reach 5%
significance, there's actually a 26% chance the results are not significant.
But that doesn't mean there's a 26% chance the other option is significantly
better—just that there's a 26% chance neither is statistically better.

And for most startups, I think that's a fine compromise.

Sometimes I'll launch a new design and test just to make sure it's not
terribly worse. If it reaches statistical significance (even if I "peek") then
I'm cool with the new design and will make the switch.

And I'll continue to test and tweak the new design immediately after finishing
the previous test. The time saved from my lazy statistics means we can move
much more quickly.

If we had thousands of "conversions" a day, then it would make sense to be
deliberate with our testing methods. But we don't, we have tens of conversions
per day. And we can improve much faster using half-assed split-tests and
intuition.

~~~
extension
There's no need to half-ass the test, you should be able to get the actual
significance at any point in the experiment. The software just has to
correctly calculate the conditional probability of significance.

------
jplewicke
Previously discussed at <http://news.ycombinator.com/item?id=1277004> , with
some good commentary on simulations of various stopping rules.

------
richardburton
What a fantastic article. Thank you. I thought this quote really summed it up:

 _If you write A/B testing software: Don’t report significance levels until an
experiment is over, and stop using significance levels to decide whether an
experiment should stop or continue._

------
pilom
So how do you do Bayesian Experiment Design. Its presented as "the way
forward" but I have no idea what it is or how to do it.

~~~
equark
There are two parts: 1) what is an appropriate statistical model for A/B
testing and 2) how should we make decisions based on our current beliefs (the
Bayesian posterior).

A sensible starting point for the first is a hierarchical beta-binomial model.
For instance:

<http://www.stat.cmu.edu/~brian/724/week06/lec15-mcmc2.pdf>

Translating that example, the binomial variable represents the number of
conversions given the total number of exposures. So if you show a red button
100 times and 10 people convert, then, using the notation in that PDF, n_i =
100 and y_i=10. We are interested in p(\theta_i|y_i, n_i), the posterior
distribution of the conversion rate for experiment variation i (red, blue,
green) given our data.

The hierarchical part of the model is what's Bayesian. Here we use a Beta
prior, since \theta_i is between 0 and 1. This prior shrinks each estimate
towards to overall conversion rate based on how much variation there is
between experiments -- the \alpha and \beta parameters. You can think of
\alpha and \beta as pseudo-observations -- the number of conversions and
failures you've "seen" apriori. Given that we have multiple experiments, you
actually have a sense for the distribution of \theta_i, and we can therefore
estimate \alpha and \beta by adding a third layer p(\alpha, \beta).

There are many ways to make a richer model, but if you haven't seen Bayesian
modeling before that's probably enough.

The beauty of the Bayesian approach is the posterior is what you want -- your
belief about conversion given the data you observe, the model you assume, and
your prior beliefs. As you add data, your posterior beliefs update, but at
every point in time it always represents your best guess.

It solves the multiple comparison problem via shrinkage rather than by
adjusting p-values. This is intuitive. If you see an outlier and you don't
have much data yet, then it's probably just a random fluctuation and your
prior shrinks your best guess towards what you think conversion rates should
be overall. For instance, if you believe conversion rates are typically .05
and never .2, then if you see something like .2 after just a few observations,
you'll probably guess the true \theta_i is more like .08.

The second part of the problem, optimal sequential decision-making is more
tricky. It's a bandit problem, where there's a tradeoff between exploration
and exploitation. As far as I'm aware, this is still considered a very tricky
problem to solve optimally in all put the most simple cases. Practically you
could probably get close to the optimal answer via forward simulation. There's
a lot written on Bayesian bandit problems.

An approximate solution to a very similar problem is proposed here:

[http://www.mit.edu/~hauser/Papers/Hauser_Urban_Liberali_Brau...](http://www.mit.edu/~hauser/Papers/Hauser_Urban_Liberali_Braun_Website_Morphing_May_2008.pdf)

Once you see the logic of this approach, it's really shocking that A/B testing
companies have not implemented it. It's really the only way to think about
optimal decision making under uncertainty.

------
shalmanese
One way around the proposed problem is to become much more educated about
statistics, another way is just to bump your threshold of statistical
significance up to 99.9%.

There's nothing magic about 95%, it was a convenient heuristic for science and
that's all. With the vast amounts of data points that a high traffic website
will generate, reaching p < 0.001 should be not too difficult and a
significance threshold of 99.9% will erase a lot of other statistical sins.

~~~
Samuel_Michon
_"One way around the proposed problem is to become much more educated about
statistics"_

Sounds like a plan. I feel dumb after reading that article.

Any ideas on where to start? I found some videos at Khan Academy [0], will
those help me grasp the concepts discussed in this thread?

[0] [http://www.khanacademy.org/video/statistics--the-
average?pla...](http://www.khanacademy.org/video/statistics--the-
average?playlist=Statistics)

------
quanticle
This can't be emphasized enough. For an experiment to be statistically valid,
you have to run the experiment. Not part of the experiment. Not most of the
experiment. The whole experiment.

~~~
equark
The problem is that this advice completely ignores the motivation for
experimenting: optimal decision making.

If you run a test that ends inconclusively should you really just throw up
your hands? And if you run a test that's quickly conclusive, should you really
avoid all the profit that could be gained from immediately exploiting this
knowledge?

~~~
viraptor
Inconclusive result is a kind of result. You can test comparable designs all
you want and get an inconclusive result for a long time. This means there's no
big difference and that's that.

If you get significant result at 0.0005, then it's up to you - might as well
stop. There's even a table in the article saying what significance is
appropriate after "correction".

