
Most Winning A/B Test Results are Illusory [pdf] - ernopp
http://www.qubitproducts.com/sites/default/files/pdf/Plain%20whitepaper%20sans.pdf
======
gkoberger
At my first webdev internship, my only job was to report to the "Head of
Analytics" (a young liberal arts guy). All I did all day was make the tweaks
he told me to do. It was stuff like "make this button red, green or blue", or
"try these three different phrasings".

We got no more than 100 hits a day, with no more than 2-3 conversions a day,
and he would run these tests for, like, 2 days.

I hated it, and the website looked horrible because everything was competing
with each other and just used whatever random color won.

~~~
CoffeeDregs
I've seen that, too. One of my clients redid their marketing site 3x in one
year, each time claiming incredible improvements. The incredible improvements
turned out to be local hill climbing, while the entire site's performance
languished... 3-4 years ago there were a ton of blog posts about how a green
button produced incredible sales when compared to a red button. And so
everyone switched to green buttons...

By contrast, I've evolved multiple websites through incremental, globally
measured, optimizations. It's a lot of fun and it requires you to really
understand your user (I've called AB testing+analytics "a conversation between
you and your users"). But, as you point out, it can be tough to get
statistically relevant data on changes to a small site. That's why I usually
focused on big effects (e.g. 25%), rather than on the blog posts about "OMG!
+2.76% change in sales!". That's also why I did a lot of "historical testing",
under the assumption that week-week changes in normalized stats would be
swamped by my tests.

~~~
patio11
_under the assumption that week-week changes in normalized stats would be
swamped by my tests_

This is an enormously problematic assumption, which you can verify by either
looking at the week-to-week stats for a period prior to you joining the
company, or (for a far more fun demonstration) doing historical testing of the
brand of toothpaste you use for the next 6 weeks. Swap from Colgate to
$NAME_ANOTHER_BRAND, note the improvement, conclude that website visitors pay
an awful lot of attention to the webmaster's toothpaste choices.

------
ronaldx
I love the concept of A/A testing here, illustrating that you get apparent
results even when you compare something to itself.

I can't imagine how A/B tests are a productive use of time for any site with
less than a million users.

There are so many more useful things you could be doing to create value. If
you're running a startup you should rather have some confidence in your own
decisions.

~~~
Homunculiheaded
confidence in your own decisions can also be referred to as a Bayesian prior
;)

I've treated the A/B tests I've run pretty much as a case of Bayesian
parameter estimation (where the true conversion of A and of B are your
parameter). You then get nice beta distributions you can sample from, as well
as use the prior to constrain expectations of improvement and also reduce the
effects of early flukes in your sampling.

~~~
ronaldx
Sorry, but I don't understand how Bayesian statistics could possibly solve the
problems described here.

Sometimes bad scenarios will get good results, by luck, and sometimes good
scenarios will get bad results, by luck.

Using more advanced statistical methods doesn't change that these cases are
fundamentally indistinguishable.

~~~
darkxanthos
You're right. The one exception though is with Bayesian statistics you can
estimate an effect size using your experiment results using a credibility
interval.

If the differences are drastic enough you can still get value from split
testing. Incremental changes are just probably not going to bring you much
luck.

------
darkxanthos
I do this professionally as my sole job. This is one of the very few papers
I've read that seem completely legit to me. I especially love their point on
necessary sample sizes to get to a 90% power.

~~~
ep103
How do you calculate the correct sample size for a test, to achieve the
correct "power"?

~~~
gwern
For binomial scenarios like a stock A/B test, most statistical environments
will have some sort of built-in power functions. For example, R does; an
example: [http://www.gwern.net/AB%20testing#power-
analysis](http://www.gwern.net/AB%20testing#power-analysis)

------
pak
This article's title echoes a paper which continues to influence the medical
research and bioinformatics community, "Why Most Published Research Findings
Are False" by JPA Ioannidis.

[http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...](http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124)

While the OP's article targets some low-hanging fruit, like halting criteria,
multiple hypotheses, etc. which should be familiar to anyone serious about
bioinformatics and statistics, Ioannidis takes these things a little farther
and comes up with a number of corollaries that apply equally well to A/B
testing.

After all, the randomized controlled trials that the FDA uses to approve new
drugs are essentially identical to what would be called an A/B test on Hacker
News.

------
hvass
I strongly recommend using Evan Miller's free A/B testing tools to avoid those
issues!

Use them to really know if conversion rate is significantly different, whether
the mean value of two groups is significantly different and how to calculate
sample size:

[http://www.evanmiller.org/ab-testing/](http://www.evanmiller.org/ab-testing/)

~~~
napoleond
This is awesome, thanks for the link! (And the visualizations help a ton,
especially for the t-test... it's been a while since I took any stats courses
and the terminology always puts me off a bit but the graphs make sense.)

------
tristanz
Putting aside bandits and all that, it seems like the first step should be to
set up a hierarchical prior which performs shrinkage. Multiple comparisons and
stopping issues are largely due to using frequentist tests rather than a
simple probabilistic model and inference that conditions and the observed
data.

Gelman et al, "Why we (usually) don't have to worry about multiple
comparisons" [http://arxiv.org/abs/0907.2478](http://arxiv.org/abs/0907.2478)

------
gabemart

      > We know that that, occasionally, a test will generate a
      > false positive due to random chance - we can’t avoid that.
      > By convention we normally fix this probability at 5%. You
      > might have heard this called the significance probability
      > or p-value.
    
      > If we use a p-value cutoff of 5% we also expect to see 5
      > false positives.
    

Am I reading this incorrectly, or is the author describing p-values
incorrectly?

A p-value is the chance a result at least as strong as the observed result
would occur if the null hypothesis is true. You can't "fix" this probability
at 5%. You can say "results with a p-value below 5% are good candidates for
further testing". The fact that p-values of 0.05 and below are often
considered significant in academia tells you nothing about the probability of
a false positive occurring in an arbitrary test.

~~~
martingoodson
Author of the paper here. You're right this is incorrect. I corrected this in
the final copy but a earlier draft seems to have been put on the website.
There are a few other errors too. I am describing the 'significance level'
here not the 'p-value', as you say.

~~~
nchlswu
is the corrected final version uploaded at the same URL? I'd like to
distribute to some colleagues.

~~~
ernopp
Just to let you know it's been updated.

------
paraschopra
The article is spot on. We at
[http://visualwebsiteoptimizer.com/](http://visualwebsiteoptimizer.com/) know
that there are some biases (particularly related to 'Multiple comparisions'
and 'Multiple seeing of data') that lead of results that seem better than they
actually are. Though the current results are not wrong. They are directionally
correct, and with most A/B tests even if 95% confidence is really a true
confidence of 90% or less, the business will still do better implementing the
variation (v/s not doing anything).

Of course, these are very important issues for A/B testing vendors like us to
understand and fix, since users mostly rely on our calculations to base their
decisions. You will see us working towards taking care of such issues.

~~~
martingoodson
I'm afraid that's not quite right. A simple python simulation will show you
that a variant with -5% (ie NEGATIVE) uplift will still give a positive
results around 10% of the time if you perform early stopping of the test.

~~~
paraschopra
No matter which method you adopt, you cannot eliminate false positives
entirely. You merely decrease / control the proportion of them.

~~~
martingoodson
To remove all doubt, your interpretation of the statistics is incorrect. In
particular this sentence is demonstrably false: "They are directionally
correct, [...] the business will still do better implementing the variation
(v/s not doing anything)."

------
moapi
Good article in general, I have a small question:

"Let’s imagine we perform 100 tests on a website and, by running each test for
2 months, we have a large enough sample to achieve 80% power. 10 out of our
100 variants will be truly effective and we expect to detect 80%, or 8, of
these true effects. If we use a p-value cutoff of 5% we also expect to see 5
false positives. So, on average, we will see 8+5 = 13 winning results from 100
A/B tests."

If we expect 10 truly effective tests and 5 false positives, we'd have 15
tests that rejected the null hypothesis of h_0=h_test. Taking power into
account, shouldn't we see 15*0.8, 12 winning results from the results? I.e.
wouldn't one of the false positives also have not-enough-power?

~~~
ernopp
Full disclosure: I work for Qubit who published this white paper.

Maybe the confusion here is in tests which have a "true" effect and an
"observed" effect. If an experiment has a true effect, then you have some
chance to observe it, which is the power.

But false positives have by definition already been observed as winners
(that's what false positives are), so there's no need to apply the factor of
0.8 to them.

------
dbroockman
The "regression to the mean" and "novelty" effect is getting at two different
things (both true, both important).

1\. Underpowered tests are likely to exaggerate differences, since E(abs(truth
- result)) increases as the sample size shrinks.

2\. The _much bigger problem_ I've seen a lot: when users see a new layout
they aren't accustomed to they often respond better, but when they get used to
it, they can begin responding worse than with the old design. Two ways to deal
with this are long term testing (let people get used to it) and testing on new
users. Or, embrace the novelty effect and just keep changing shit up to keep
users guessing - this seems to be FB's solution.

------
stevoski
Great read.

What bothers me about A/B tests is when people say, eg."there was a 7%
improvement" without telling us the sample size, or error margin. I'd rather
hear: On a sample size of 1,000 unique visits, the improvement rate was 7%
+/\- 4%

------
ameister14
I really liked this; it's condescending, but in a good natured sort of way.
It's as if the author was trying to explain really basic statistics to a
marketer, then realized that the marketer had NO idea what he was talking
about.

So you get statements like "This is a well-known phenomenon, called
‘regression to the mean’ by statisticians. Again, this is common knowledge
among statisticians but does not seem to be more widely known."

I thought that was hilarious.

------
IanOzsvald
Martin gave this paper as a talk at our PyData London conference this weekend
(thanks Martin!), videos will be linked once we have them. He shares hard-won
lessons and good advice. Here's my write-up:
[http://ianozsvald.com/2014/02/24/pydatalondon-2014/](http://ianozsvald.com/2014/02/24/pydatalondon-2014/)

------
mildtrepidation
Would be interested to see patio11's feedback on this one.

~~~
patio11
Correct on the math, to the limit of my understanding of it and quick glance.

I am agnostic about whether most A/B testing practitioners administer their
tests correctly -- of the universe of companies I've seen, far and away the
most common error regarding A/B testing is "We don't A/B test.", which remains
an error even after you read this article.

The novelty effect they talk about, which the article says is probably simple
reversion to the mean, is -- in my opinion -- likely a true observation of the
state of the world. You can watch your conversion-rate-over-time for many
offers, many designs, many products, etc, and they _often_ start out quite
high and taper off, both in circumstances where there is obvious alternate
causality and in circumstances where they isn't. By comparison, I have not
often participated in tests where conversion rates started out abnormally low
and reverted to the mean, which we'd expect exactly as often as "started out
high" if that was indeed what we were seeing.

I believe so strongly in the novelty effect that I have written proposals to
profitably exploit it by scalably manufacturing novelty. Sadly, none of them
are public. It's on my to-do list for one of these months but a lot of things
are on my to-do list for one of these months.

If you run many tests, which as time approaches infinity you darn better, your
odds of seeing a false positive approach one. Contra the article, you _gladly
accept this_ as a cost of doing business, because you know to a statistical
certainty that you've seen many, many more true positives.

That about sums it up. If you have any particular questions, happy to answer
them. My takeaway is "Good article. Please don't use it to justify a decision
to not test."

------
beambot
Related... someone should write a good article about estimating customer
acquisition costs (CAC, or ROI if you prefer) based on conversion rates of
ads.

It drives me batty when people tell me their "average" conversion rate is 1%
after running a $25 ad campaign with so few clicks. It seems like too many
folks are just oblivious to sample size, confidence interval, and power
calculations -- something that could be solved with a quick Wikipedia search
[1].

[1]
[https://en.wikipedia.org/wiki/Sample_size_determination](https://en.wikipedia.org/wiki/Sample_size_determination)

------
gatehouse
Regarding the final bullet point of doing a second validation, the sample size
should be bigger right? Because of the tendency for winners to coincide with
+ve random effects, you will choose a larger experiment size and expect to see
a lesser result.

------
27182818284
Visibility on this is set to "Private" is is really supposed to be linked
publically on HN? I was about to Tweet a link to it and then I felt dirty,
like maybe the author wanted to send the link to just a select group.

------
rubiquity
Coming from a poker background, where sample size trumps everything, I've
LOL'ed at every person that has ever whipped out an A/B test on me.

~~~
StavrosK
This doesn't follow. What if their sample size was 100,000 conversions?

~~~
rubiquity
Did you even read the article? The third point is "regression to the mean."

------
lingben
compare and contrast this whitepaper with arguably one of the most common
optimization apps out there:

[https://help.optimizely.com/hc/en-
us/articles/200133789-How-...](https://help.optimizely.com/hc/en-
us/articles/200133789-How-long-to-run-a-test)

------
coderdude
In my experience it can't be overstated how important it is to wait until you
have a large sample size to decide whether a variation is the winner. Nearly
all of the A/B tests I run start out looking like a variation is the clear,
landslide winner (sometimes showing 100%+ improvement over the original) only
to eventually end up regressing toward the mean. I can't get a clear idea of
the winner of a test until I've shown the variation(s) to 10s of thousands of
visitors and received a few thousand conversions. I've also learned that it's
important to only perform tests on new visitors when possible. That means
tests need to run longer to get the appropriate sample size. If you're testing
over a few hundred conversions and performing tests on new and returning
visitors then you're probably getting skewed results. Again, that's just in my
experience so far. YMMV. One thing to consider with a test is that the
variations may be too subtle to have a significant, positive impact on
conversion.

