
How Not To Run An A/B Test - TimothyFitz
http://www.evanmiller.org/how-not-to-run-an-ab-test.html
======
btilly
This is an important thing to be aware of, but I wouldn't take the numbers at
strictly face value. Repeated peeks at the same running experiment are _not_
independent of each other. Furthermore once the underlying difference between
A and B starts asserting itself statistically, it doesn't stop. And finally a
chance fluctuation in the opposite direction from an underlying difference has
to be much larger to get to statistical significance than one in the same
direction. These are massive complications that make the statistics very hard
to calculate.

I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is
ran Monte Carlo simulations of running an A/B test while continuously
following the results with various sets of parameters, and running the test to
different confidence levels. In that model I peeked at every single data
point. You can find the results starting at
<http://elem.com/~btilly/effective-ab-testing/#slide59>. (See
<http://meyerweb.com/eric/tools/s5/features.html#controlchart> for the
keyboard shortcuts to navigate the slides.)

My advice? Wait until you have at least a certain minimum sample size to
decide. Only decide there with high certainty. And then the longer the
experiment runs, the lower the confidence you should be willing to accept.
This procedure will let you stop most tests relatively fast, but still avoids
making significant mistakes.

~~~
3pt14159
Hey btilly,

Just wanted to let you know that that slideshow changed my life. It made the
company I'm with (FreshBooks) truly shine while doing split tests, which made
me look good. Anyways, thanks dude.

~~~
btilly
You are welcome. It is always good to hear that something I worked hard on has
proven to be useful.

------
nkurz
It's a good article, and a good intro to the pitfalls of statistical
interpretation, but I think it reaches the wrong conclusion. Yes, when one has
a very limited data set and needs to draw a conclusion in a hurry, and one has
full confidence that there are no confounding variables in one's experiment,
then paying very close attention to small differences in p-values can make
sense. But how often is this the case when testing a new logo or signup page?

I'm less mathematically sophisticated than the author, and would choose a
simpler approach: ignore weak results. If one determines that there is a 95%
chance that 51% of people prefer Logo A, either stick with what what you have,
go with the one you like, or keep searching for a better logo. If you can't
see the effect in the raw data without rigorous mathematical analysis, it's
probably not a change worth spending much time on.

Instead of adjusting your significance test for each 'peek', simply ignore
anything less than 99.9% 'significant'. And while you are at it, ignore
anything that's less than a 10% improvement, on the assumption that structural
errors in your testing are likely to overwhelm any effects smaller than this.
Drug trials and the front page of Google aside, if the effect is so small that
it flips into and out of 'significance' each time you peek, it's probably not
the answer you want.

------
patio11
This is important enough of a usage note that I'm going to probably mention it
in my software's documentation. I personally largely ignore this issue and
thing I'm probably safe doing so with my usual testing workflow, but it _is_
an easy thing to burn yourself on if you sit and watch your dashboard all day.

~~~
jacquesm
Have you gotten around to this yet?

<http://news.ycombinator.com/item?id=1203295>

Or did I miss your posting?

~~~
patio11
Crikey, how did that fall off my list? I have a half-written analysis sitting
in my home directory. It will probably take me a few days to finish the other
half: I have a couple things ahead of it in the queue and this next
application isn't going to write itself.

~~~
jacquesm
Don't worry, I was just very curious. I know about your recent life changing
move so I figured you had your hands more than full.

------
jacquesm
When I do stuff like this I purposefully ignore the results for the time I've
set for the experiment. It's very easy to fall prey to thinking you have a
result that will not change in the longer term. Things like daily and weekly
cycles for instance can really throw off your analysis.

The only danger is having a 'hidden variable' influence your results and
averaging over the longer term masks that influence. For example, if you are
not geo-targeting your content, you could conclude after a long run of testing
that a certain page performs better than another, only to throw away the
averaged out effect of having the different pages up on different times of the
day, one of them performing significantly better for one audience and vv.

So you should keep all your data in order to figure out if such masking is
happening and giving you results that are good but that could be even better.

------
paraschopra
This is an interesting issue and I have seen users of my app (Visual Website
Optimizer) complaining that their results were statistically significant a day
before but now they aren't. Justifiably, they expect significance to freeze in
time once it has been achieved. However, as you say significance is also a
random function and not necessarily monotonically increasing or decreasing.

The constraint here is not the math or technology rather it is users' needs.
They want data, reporting and significance calculation to be done in real
time. And even though we have a test duration calculator, I haven't seen any
user actually making use of it. Plus many users will not even wait for
statistical significance to be achieved.

Though, in VWO, we will love to wait calculating significance until end of
experiment. I'm sure the users won't like it at all.

------
ryanjmo
While I understand the merit of what this article is saying I really want to
caution always requiring a strict high confidence level when making a decision
in start-ups. Requiring a strict confidence level does make sense for a
company like Zynga who has nearly a limitless supply of users to run tests on,
but for a start-up the value of being able to make a decision quickly often
outweighs the value of being '95% confident'. Let's not forget all this wasted
time worrying about the details of all of this math.

In my opinion, peak early and often and when your gut tells you something is
true, it probablly is.

It is actually a mathematical fact if at any point in your A/B tests A is
bigger than B, based on that data there is at least a 50% prob that
asymptotically A is bigger than B.

~~~
teaspoon
This article neither advocates a high nor low confidence level. The point is
that your confidence level is meaningless if you don't fix the sample size.

"...when your gut tells you something is true, it probablly is."

If you run a data-driven business with a philosophy like that, you've rewound
management science to about 1700 AD. Human "guts" aren't evolved for
evaluating UX effectiveness from sparse data.

------
carbocation
The first calculation that the author is setting up is a power calculation,
which is a strong start. Based on your expectations about the effect size of
the treatment (in this case, the difference between A and B), and your desired
probability of correctly identifying a difference, you can figure out how
large of a sample size you need to see an effect. (This is called Beta.)

If you're going to take several peeks as you run your trial and you want to be
particularly rigorous, consider alpha spending functions. In medicine, alpha-
spending functions are often used to take early looks trial results. 'Alpha'
is what you use to determine which P-values you will consider significant. To
oversimplify a bit, early peeks (before you've got your full sample size) have
very extreme alphas. If your trial ultimately uses an alpha of 0.05, a
prespecified early look may use an alpha of 0.001. (There are ways of
calculating a meaningful alpha values; these are just examples drawn from a
hat.)

By setting useful alphas and betas, you can benefit from true, potent
treatment effects (if present) earlier than you might otherwise, without too
much risk of identifying spurious associations.

~~~
paraschopra
Great point, do you have any recommendation for a paper on alpha spending
functions? This looks interesting way to compensate for early peeking into
significance.

~~~
carbocation
Argh, having a hard time finding the paper I was thinking of. I thought it was
in JAMA in 2009 but perhaps not. It had a lot of this information nicely
graphed, but alas. I'll keep digging around and reply again if I find it.

------
shalmanese
Why are we still using p<0.05 for web A/B testing? p<0.05 made sense when each
individual data point cost real money to generate: grad students interviewing
participants or geologists making individual measurements. p < 0.05 was a good
tradeoff between certainty and cost.

Now, in the world of the web where measurement has an upfront cost but 0
incremental cost, why not move to p < 0.001 or p < 0.0001? Sure, you need to
increase the magnitude of data you're gathering by 2 or 3 but that's so much
easier than delving into the epistemological complexities of p < 0.05

------
harisenbon
While interesting, I think that this is more of a mathematical proof for
something that people doing any sort of testing should remember:

Don't stop before the test is complete, just because you've gotten _an_
answer.

I generally leave my A/B tests up well after I've gotten a significance
report, mostly because I'm lazy but also because I know that given enough time
and enough entries, the significance reports can change.

Especially in the multi-variate tests that Evan wrote about, just because you
get one result as significant doesn't preclude other possibilities from also
being significant.

------
marciovm123
I had a statistics course at MIT where the professor would bring up one of the
solutions to this problem at least once a week for the entire semester:

<http://en.wikipedia.org/wiki/Bonferroni_correction>

~~~
btilly
That would be a solution to using pairwise statistics to come up with an
answer to an A/B/C/D test. That is not a solution to the challenge of
evaluating an A/B test at multiple time points.

------
khafra
Interesting that frequentist A/B software packages let you essentially break
the test without telling you. Are there bayesian A/B testers that give you a
likelihood ratio instead?

