
Most Winning A/B Test Results Are Illusory [pdf] - maverick_iceman
http://www.qubit.com/sites/default/files/pdf/mostwinningabtestresultsareillusory_0.pdf
======
ted_dunning
This is yet another article that ignores the fact that there is a MUCH better
approach to this problem.

Thompson sampling avoids the problems of multiple testing, power, early
stopping and so on by starting with a proper Bayesian approach. The idea is
that the question we want to answer is more "Which alternative is nearly as
good as the best with pretty high probability?". This is very different from
the question being answered by a classical test of significance. Moreover, it
would be good if we could answer the question partially by decreasing the
number of times we sample options that are clearly worse than the best. What
we want to solve is the multi-armed bandit problem, not the retrospective
analysis of experimental results problem.

The really good news is that Thompson sampling is both much simpler than
hypothesis testing can be done in far more complex situations. It is known to
be an asymptotically optimal solution to the multi-armed bandit and often
takes only a few lines of very simple code to implement.

See [http://tdunning.blogspot.com/2012/02/bayesian-
bandits.html](http://tdunning.blogspot.com/2012/02/bayesian-bandits.html) for
an essay and see [https://github.com/tdunning/bandit-
ranking](https://github.com/tdunning/bandit-ranking) for an example applied to
ranking.

~~~
yummyfajitas
Thompson sampling is a great tool. I've used it to make reasonably large
amounts of money. But it does not solve the same problem as A/B testing.

Thompson Sampling (at least the standard approach) assumes that conversion
rates do not change. In reality they vary significantly over a week, and this
fundamentally breaks bandit algorithms.

[https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...](https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html)

Furthermore, you do not need to use Thompson Sampling to have a proper
Bayesian approach. At VWO we also use a proper Bayesian approach, but we use
A/B testing in to avoid the various pitfalls that Thompson Sampling has.
Google Optimize uses an approach very similar to ours, (although it may be
flawed [1]) and so does A/B Tasty (probably not flawed).

[https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...](https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf)

Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc.
However my post critiquing bandits was published before I took on this role.
It was a followup to a previous post of mine which led people to accidentally
misuse bandits:
[https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...](https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html)

[1] The head of data science at A/B Tasty suggests Google Optimize counts
_sessions_ rather than _visitors_ , which would break the IID assumption.
[https://www.abtasty.com/uk/blog/data-scientist-hubert-
google...](https://www.abtasty.com/uk/blog/data-scientist-hubert-google-
optimize/)

~~~
paulddraper
No, no, no, no.
[https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...](https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html)
needs a rebuttal so very, very badly.

> Depending on what your website is selling, people will have a different
> propensity to purchase on Saturday than they have on Tuesday.

Affects multi-armed bandit and fixed tests. If you do fixed A/B test on
Tuesday, your results will also be wrong. Either way, you have to decide on
what kind of seasonality your data has, and don't make any adjustments until
the period is complete.

If anything, multi-armed bandit shines because it can adapt to trends you
don't anticipate.

> Delayed response is a big problem when A/B testing the response to an email
> campaign.

Affects multi-armed bandit and fixed tests. If you include immature data in
your p-test, your results will be wrong. Either way, you have to decide how
long it takes to declare an individual success or failure.

> You don't get samples for free by counting visits instead of users

Affects multi-armed bandit and fixed tests. Focusing on relevant data
increases the power of your experiment.

\---

For every single problem, the author admits "A/B tests have the same problem",
and then somehow concludes that multi-bandit tests are harder because of these
design decisions, despite the fact they affect any experiment process.

~~~
yummyfajitas
_If anything, multi-armed bandit shines because it can adapt to trends you don
't anticipate._

It can, but the time it takes is exp(# of samples already passed).

You can improve this by using a non-stationary Bayesian model (i.e. one that
assumes conversion rates change over time) but this usually involves solving
PDEs or something equally difficult.

 _For every single problem, the author admits "A/B tests have the same
problem", and then somehow concludes that multi-bandit tests are harder
because of these design decisions, despite the fact they affect any experiment
process._

The point the author (me) is trying to make is not that bandits are
fundamentally flawed. The point is that for A/B tests, all these problems have
simple fixes: make sure to run the A/B test for long enough.

For bandits, the fixes are not nearly as simple. It usually involves non-
simple math, or at the very least non-intuitive things (for instance not
actually running a bandit until 1 week has passed).

At VWO we realized that most of our customers are not sophisticated enough to
get all this stuff right, which is why we didn't switch to bandits.

~~~
paulddraper
> The point is that for A/B tests, all these problems have simple fixes: make
> sure to run the A/B test for long enough.

Multi-bandit has the same fix: make sure the test has run for long enough
before adjusting sampling proportions.

~~~
yummyfajitas
So what I'm proposing to do is run A/B with a 50/50 split for a full week,
then when B wins shift to 0/100 in favor of B.

You seem to be proposing to run A/B with a 50/50 split for a full week, then
when B does a lot better shift to 10/90 in favor of B and maybe a few weeks
later shift to 1/99.

What practical benefit do you see to this approach? From my perspective this
just slows down the experimental process and keeps losing variations (and
associated code complexity) around for a lot longer.

~~~
paulddraper
First, Google Analytics (for example) runs content experiments for a _minimum_
of two weeks regardless of results. It's hardly an unrealistic timeframe for
reliable conclusions.

> What practical benefit do you see to this approach?

Statistically rigorous results, with minimal regret.

In your example, you reach the end of the week, and your 50/50 split has one-
sided p=0.10, double the usual p<0.05 criteria. What do you do?

(a) Call it in favor of B, despite being uncertain about the outcome. (b) Keep
running the test. This compromises the statistical rigor of your test. (c)
Keep running the test, but use sequential hypothesis testing, e.g.
[http://elem.com/~btilly/ab-testing-multiple-
looks/part1-rigo...](http://elem.com/~btilly/ab-testing-multiple-
looks/part1-rigorous.html) This significantly increases the time to reach a
conclusion, and costs you conversions in the meantime.

(a) and (b) are the most popular choices, despite them being statistically
unjustifiable. [http://www.evanmiller.org/how-not-to-run-an-ab-
test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

\---

The essential difference when choosing the approach is that 50/50 split
optimizes for shortest time to conclusion, and multi-bandit optimizes for
fewest failures.

In web A/B testing, the latter is usually the most applicable, and for that,
you _cannot beat_ Thompson sampling on the average, no matter how diabolically
clever your scheme. [https://www.lucidchart.com/blog/2016/10/20/the-fatal-
flaw-of...](https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of-ab-
tests-peeking/)

There are times when the former is more important, e.g. marketing wants to
know how to brand a product that is being released next month. These are the
clinical-like experiments that frequentist approaches were formulated for.

~~~
yummyfajitas
_Statistically rigorous results, with minimal regret._

The results are only statistically rigorous provided your bandit obeys
relatively strong assumptions.

As another example, suppose you ran a 2-week test. Suppose that from week 1 to
week 2, both conversion rates changed, but the delta between them remained
roughly the same. A 50/50 A/B split doesn't mind this, and in fact still
returns the right answer. Bandits do mind.

I don't do p-values. I do Bayesian testing, same as you. I just recognize that
in the real world, weaker assumptions are more robust to experimenter or model
error, both of which are generally the dominant error mode.

 _In web A /B testing, the latter is usually the most applicable, and for
that, you cannot beat Thompson sampling on the average, no matter how clever
yur scheme. _

This is simply not true. The Gittins Index beats Thompson sampling, subject
again to the same strong assumptions.

Look, I know the theoretical advantages of bandits and I advocate their use
under some limited circumstances. I just find the stronger assumptions they
require (or alternately the much heavier math requirements) mean they aren't a
great replacement for A/B tests which are much simpler and easier to get
right.

------
tedsanders
I think the entire approach discussed in this pdf is flawed. (Edit: not saying
PDF itself is flawed or wrong, just the hypothesis testing approach to A/B
testing.)

The right question to ask is: What is the difference between A and B, and what
is our uncertainty on that estimate?

The wrong question to ask is: Is A different/better than B, given some
confidence threshold?

The reason this is the wrong question is that it's unnecessarily binary. It is
a non-linear transformation of information that undervalues confidence away
from the arbitrary threshold and overvalues confidence right at the arbitrary
threshold.

A test with only 10 or 100 samples still gives you information. It gives you
weak information, sure, but information nonetheless. If you approach the
problem from a continuous perspective (asking _how_ big the difference is),
you can straightforwardly use the information. But if you approach the problem
from a binary hypothesis-testing perspective (asking _is_ there a difference),
you'll be throwing away lots of weak information when it could be proving real
(yet uncertain) value.

Once you switch away from the binary hypothesis-testing framework, you no
longer have to worry about silly issues like stopping too early or false
positives or false negatives. You simply have a distribution of probabilities
over possible effect sizes.

~~~
Obi_Juan_Kenobi
That's putting the cart before the horse.

Before you can quantify a difference, you have to determine whether one exists
in the first place. That is the purpose of binary testing; without it, you're
just looking at noise without any means to decide what is real and what is
not.

As a corollary, if you can meaningfully quantify the difference between A and
B, then you should have no trouble establishing that they are different.
Obviously business decisions are not generally going to uphold the rigor of
good science, but what is the purpose of quantifying things when you're as
likely to be wrong as your are right?

~~~
tedsanders
Asking whether A and B are different is not a useful question. There will be a
difference 100% of the time. (Though of course sampling may not reliably
detect the difference for feasible samples.)

The superior question is which is probably better, and by how much. If all you
know is that A is 75% likely to be better than B, then go with A. It's useful
information, even if it doesn't cross a an arbitrarily preset threshold of 95%
or whatever you use.

You don't need to wait for certainty to act. In fact, all actions are taken
under uncertainty. So it feels incredibly artificial and counterproductive to
frame these questions in such a binary, nonlinear way. It's clinging to
certainty when certainty does not exist.

~~~
yummyfajitas
Actually even that isn't the best question.

The best question is what's the expected reward from choosing A.

Consider the scenario of 10% chance B > A, 90% chance they are equal. To me
that sounds like a winning bet.

In contrast, 70% chance B is a lot better than A, 30% chance B is a lot
_worse_ than A sounds like a scary bet. I'll wait and gather more info.

Proper decision theory is based on a _loss function_ which incorporates the
magnitude of gains/losses rather than just their existence.

[https://en.wikipedia.org/wiki/Loss_function](https://en.wikipedia.org/wiki/Loss_function)

------
godDLL
This is very well explained, even if you don't understand statistics.
Apparently not many vendors of A/B testing software do.

~~~
iaw
I suspect the people building it do, the people selling it probably do not.

~~~
CalRobert
Having worked between customers and engineering at a place that did this, I
can confirm that "our tests are Bayesian!!" was a refrain everyone was taught
to repeat, but few if any were taught what it meant.

This video is helpful:
[https://www.youtube.com/watch?v=Dy_LRK2Pkig](https://www.youtube.com/watch?v=Dy_LRK2Pkig)

I still don't know a great way to describe it in the 6 or 7 seconds you have
before the potential customer's attention starts to flag.

~~~
iaw
> I still don't know a great way to describe it in the 6 or 7 seconds you have
> before the potential customer's attention starts to flag.

I don't think there is a 6-7 second way to describe it. I mean, imagine
someone that knows nothing of frequentist statistics, could you explain the
meaning/significance of a confidence interval to them? It's a pretty high bar
to clear, I've found for CI's it's typically 1-5 minutes depending on
background.

------
ssharp
For most businesses, focusing on the math and making incremental improvements
to the statistical methodology is a waste of time. There is a "good enough"
approach to using tools like Optimizely and VWO.

Instead, they should be focusing on the quality of the tests they run. Quality
hypotheses, preferably backed by data-driven insights into behavior that test
very clear, if not dramatic, changes to the user flow will be what leads to
bottom-line improvements, not increasing the mathematical merit and rigor of
poorly-conceived tests. Of course, doing both is ideal but I put more fault on
the testing software than the companies using it.

Keep track of historical conversion rate and adjust/account for noise in your
historical conversion rate. Conceive a testing program that focuses on the
quality testing and you'll likely see an upward trend in that historical
conversion rate.

------
jkuria
Hmmh, this is interesting. Most A/B software will let you set a level of
statistical confidence that needs to be attained before a winner can be
declared. For example in Google Analytics two common ones are 95% and 99%. We
stop our tests when they reach at least 95% confidence. Is the author saying
one must wait for 6000 events even if the difference between A/B is large? The
larger the relative difference, the fewer events needed.

~~~
contravariant
The problem is that people often repeatedly check the significance, despite
the fact that this test only guarantees a certain false positive rate if you
use it _once_.

If you're planning to stop as soon as you find a positive result you'll need
to modify the tests to ensure that the _total_ chance that _any_ test results
in a positive is low enough. In general you'll need to keep increasing the
significance level as you do more events (if you only plan to test a finite
number of times you _can_ keep the significance fixed, but I think this will
lead to more false negatives than an 'adaptive' significance).

To illustrate, if you have 6000 events and check for 99% significance after
every one, then you'd expect about 60 false positives on average. Of course
these false positives aren't distributed uniformly, so it's not like you'll
always find 60 false positives, however it's not like you'll only find 0 or
6000 either, meaning that (significantly) more than 1% of the time you'll have
at least 1 false positive.

~~~
tedsanders
Checking and stopping is only a problem if you use the inappropriate formula.
It's not generally a problem as far as I'm aware. What do you think?

~~~
apathy
It is a problem. You are "spending" some of your error "budget" whenever you
peek prior to an endpoint. This is why if you're going to do interim analyses
in clinical trials, you have to include it in the design.

People got tired of trialists playing games with statistics and patients dying
or ineffective drugs making it to market. Now if you want to run a trial as
evidence for approval, you need to specify an endpoint, how it will be tested,
when, with what alpha (false positive) threshold, and what's the minimum
effect size required for this.

If you are doing interim analysis, futility, or non-inferiority, you have to
write that into the design, too.

People can jerk around with subgroup analyses in publications but the FDA
won't accept that sort of horse shit for actual approval. And thank heavens
for that.

~~~
tedsanders
In a world with rational actors and free computation, there shouldn't ever be
a penalty for having more information about reality. Therefore, the only
reason not to peek is that actors are irrational and/or computation is
expensive.

Honestly, if the first 100 patients die into a 1,000-patient clinical trial, I
have zero qualms about making the judgment to stop early, even if it wasn't
written into the design. I'm not going to kill 900 people by religiously
following bad statistical principles.

I think we should be open-minded and understand that sometimes peeking is ok
and sometimes it isn't.

When the effect is large, you can end earlier. There's no reason to cling to a
formula and procedure that requires a fixed number of samples when other
methods exist that lack that drawback.

~~~
contravariant
There's no problem with aborting a test early, for whatever reason. However
that doesn't mean you can still draw conclusions from such a test. If you plan
to do a trial with 1,000 patients and you stop midway because you've reached
statistical significance you run a big risk of claiming a treatment works when
it doesn't.

Similarly, every test you do has a small probability to give a false positive,
the more test you do the bigger the total chance that you'll be jumping to
conclusions.

Also, the size of the effect is irrelevant since that should already be
accounted for by whatever test you do.

~~~
tedsanders
Any Bayesian analysis should still be valid, despite stopping conditions. You
can still draw conclusions from an aborted test. You just have to use valid
formulas, and not formulas that assume incorrect counterfactual scenarios. I
think it's dogmatic to say stopping is bad and you can't do analysis. In my
mind, you totally can. It just needs to be the appropriate analysis.

Anyway, I don't even think false positives are the right way to think about
this. The framing should be continuous, not binary. The goal is to maximize
success, not maximize the odds of picking the better of A or B.

~~~
mattkrause
Your last sentence is telling.

A Bayesian analysis tells you Bayesian things: specifically, this is the most
reasonable conclusion one can draw from this data, right now. A frequentist
analysis also tells you things, but they are different things. Specifically,
frequentists are concerned with...frequencies: if we were to run this
procedure many times, here's what we can say about the possible outcomes.

There's this persistent meme that all Bayesian methods protect you from
multiple comparisons/multiple looks problems. That's not really true. Bayesian
methods don't offer any Type I/Type II error control--and why would they, when
the notion of these errors doesn't make much sense in a Bayesian framework?

You can certainly use Bayesian methods to estimate a parameter's value.
However this you cannot repeatedly test your estimates--Bayesian or otherwise
--in a NHST-like framework and expect things to work out correctly.

------
Vinnl
Nice article. One question though:

> Perform a second ‘validation’ test repeating your original test to check
> that the effect is real

Isn't this just the same as taking a larger sample size?

~~~
Obi_Juan_Kenobi
Almost.

The difference is time, so the populations involved are 'the people that used
the site last month' and 'the people that used the site last week'. Usually
your assumption is that these are comparable, but that's not necessarily true.
Furthermore, the effect is only meaningful if that assumption _is_ true, for
most business cases (i.e. you want this effect to hold in the future).

In practice, a lot of effects do disappear when you repeat a test because
there was some unaccounted for factor that varied between them. It's a good
sanity check.

But you're right, much of the purpose is to discover mean regression. This is
something that happens more often than you'd expect because you tend to be
focusing large effects, many of which will simply be due to chance.

~~~
Vinnl
So basically, it means waiting a while before performing the test again? Since
"larger sample size" mostly just means "keep the test running for longer", so
the difference is that in that case, the "second" test is very close in time
to the "first" test?

~~~
tedsanders
Not asking me, but yes. I think the underlying issue is whether you assume the
process that generated the data yesterday is similar enough to the process
that generated data today. In truth, all data points are unique and all data
is high dimensional data. It only becomes workable in low dimensions once we
make assumptions like "the process that produced this point yesterday is the
same as the process that produced this second days point today." Separating
the tests makes it slightly less difficult to test for violations of these
assumptions (e.g., model drift). But otherwise it seems silly and arbitrary to
commit yourself you two medium tests instead of one big test. At least two
medium tests let you filter out likely failures if you allow yourself to quit
midway. (This whole business of tests and halting is silly anyhow.)

~~~
Vinnl
Right, thanks. And am I correct in thinking that your final sentence before
the parentheses means that two medium tests are worse, because you don't want
to filter out likely failures?

------
guico
I don't get this: "But remember: if you limit yourself to detecting uplifts of
over 10% you will miss also miss negative effects of less than 10%. Can you
afford to reduce your conversion rate by 5% without realizing it?"

You'd potentially lose 5% CR if you ship the variant even when it doesn't show
a detectable uplift. Why would you do that?

------
RA_Fisher
Any other statisticians want to champion this?
[https://news.ycombinator.com/item?id=13434410](https://news.ycombinator.com/item?id=13434410)

------
youngtaff
Pity they didn't add a date to the paper!

