
A/B Testing Rigorously (without losing your job) - btilly
http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigorous.html
======
btilly
Note, I expect this to be the first in a series. It probably won't wind up
laid out exactly as I expect, but here is my tentative plan:

1\. Rigorous (this one) - How to use frequentist techniques to define a very
rigorous statistical procedure that - no matter how small the bias in the test
- can give correct answers with very high probability (and, of course, support
multiple looks at your statistics).

2\. Fixed data - Your business has an upper limit on how much data it
realistically can collect. How to design a statistical procedure that makes
reasonably good decisions while remaining within that limit.

3\. Simple test - How to use Bayes' Theorem to design very simple and
straightforward test criteria, where every A/B test has a known cost up front.

4\. Bayesian statistics - How a Bayesian approach compares to the previous
ones. Where the sticking points are, what its advantages are.

5\. Bayesian optimization - How a Bayesian approach can let you turn the
problem of A/B testing into a pure optimization problem where your decision to
stop is based on whether the cost of continuing the test exceeds your expected
return from continuing it over some time horizon.

We'll see how far I get. As it goes on, they become tricker to write at the
level that I'm aiming for. And it may be that people will be interested in
something else. But I thought that the first two were important. If for no
other reason than to give me something definitive to point people at who read
Evan Miller's article and then were telling people to run A/B tests in a way
that made no business sense.

~~~
loup-vaillant
From your article:

> _We will follow Evan's lead and use a frequentist approach._

I though: "how about using the _correct_ approach instead?", followed by a
helpful comment about how you are a Bad Person¹. Hem.

Then I noticed that the way you talk about p-value doesn't sound very
frequenty. And then your comment. I eagerly await for the rest of the sequence
–err, series².

[1]: <https://xkcd.com/386/>

[2]: Come to think of it, this does looks like LessWrong material to me.

~~~
btilly
I suspect that you may like me substantially less when I get to discussing the
trade-offs that are inherent in Bayesian statistics.

I know how wonderful it can be to feel that you have realized The Truth, but
the frequentist school of statistics does not remain in existence just because
statisticians are too stupid to recognize the obvious superiority of Bayesian
methods.

~~~
loup-vaillant
Having read chapter 2 of _Probability Theory: the Logic of Science_ by E.T.
Jaynes, my probability for the stupidity hypothesis went way, way up. There's
probably a heavy status-quo bias at work too.

I mean, many frequentist methods directly contradict Cox theorems! That should
convince anyone they belong to History, not Science, doesn't it?

~~~
loup-vaillant
I know the above comment sounds extremist, but bear in mind that its truth
value has nothing to do with that. I mean for instance, the only reason _"2 +
2 = 4, and anyone who believe otherwise lacks either a brain or some crucial
information"_ does not sound extremist is because everyone actually agrees
with it.

So. Who in her right mind would reject the three assumptions behind Cox's
theorem, and for what reason? Assuming we don't reject those axioms, why
should we not reject Frequentism at once?

<https://en.wikipedia.org/wiki/Cox%27s_theorem>

------
gburt
I am strongly of the opinion that even if you got a non-significant result
doing your A/B test, assuming we're testing something of no cost to change
(like a web design that has already been deployed for the sake of the test),
your point estimate for the effect is the "best data you have" and you should
act on it.

To provide an example, you run a test and find Page 1 outperformed Page 2 by a
factor of 1.1, but that was non-significant at your desired significance level
(say, p=0.05 and power = 0.8). You should deploy page 1 instead of page 2,
assuming there are no other costs associated with deploying page 1 instead of
page 2, because your BEST GUESS RIGHT NOW is that page 1 is 1.1x better than
page 2.

~~~
grzaks
But if your 1.1x result is not confident in reality it could perform 1.1x
worst. In my opinion in this situation you're just guessing.

~~~
rieter
Yes, it is a guess that might be wrong, but it is a best guess you have.
Staying on the old page is also a guess, which might be wrong, possibly with a
larger likelihood.

------
aaronjg
That seems to be a pretty sound approach, compared to some of the stuff about
multiarmed bandits that shows up here some times. And I certainly expect Noel
Welsh to chime in as well.

There are two schools of thought about the approaches to sequential testing,
the Bayesian approach lead by Anscombe, and the frequentist by Armitage. I
talked a bit about this and outlined Anscombe's approach here [1]. And it is
great to see such a nice write up of the frequentist approach and the tables
of the stopping criteria

[1] [http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-
te...](http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/)

~~~
mwexler
You imply that the MAB approach has problems, or at least is not as "sound".
Care to state what those problems or issues are?

~~~
aaronjg
I am mostly critical of claims like '20 lines of code that will beat A/B
Testing Every Time.' Multi armed bandits are also not as useful for inference
as the frequentist methods that Ben presents in his posts.

------
mistercow
If you do an A/B test and get a result that leans one way but is statistically
insignificant, then it seems to me you might as well just go with that answer.
No, you don't have good reason to believe that it's better than the
alternative, but you _do_ have good reason to believe that the choice is
harmless.

~~~
prosa
> you _do_ have good reason to believe that the choice is harmless.

The issue you will run into here is that 95% confidence means that you will
only have a false positive 5% of the time. It does _not_ mean a neutral
finding is 95% likely to be neutral. The lever that controls that is
_statistical power_ , which is oft-ignored in conversations about A/B testing.
Most statisticians use 80% power, which means a full 20% of neutral findings
were false negatives.

~~~
mistercow
That is very true. However, you do have much better reason to believe that the
option is neutral than you have to believe that the option is beneficial. In
an example like the one given in the article, you also likely have enough
statistical power to be reasonably confident that the option is _close_ to
neutral, so if you're making a negative decision, it's probably not strongly
negative.

~~~
prosa
Right. And you can also ratchet up the statistical power you want, at the cost
of increased sample size requirements.

------
ArnoVanLumig
One article I've found helpful with A/B testing is [1] which also describes
some common statistical problems (e.g. Simpson's paradox as it applies to A/B
testing) based on case studies. It's written as a research paper by some folks
at Microsoft, but it's definitely very readable.

[1]: <http://www.exp-platform.com/Documents/2009-ExPpitfalls.pdf>

------
majormajor
My stats knowledge is fairly rusty, so I got a bit lost midway through, but
here's a side question I've been wondering about, especially given that the
linked Miller article talks about the null hypothesis being that the two are
equal: how does that null hypothesis fit into these tests? I.e., you have to
choose _a_ button, so does it even make sense when the hypothetical employee
in that example at the start of the article says "we didn't get an answer"? A
.18 p-value isn't great, but I imagine he'd still recommend the green one --
there's much less reason to think the other button was better.

Why not do a one-sided test where the null hypothesis is "A is at least as
good as B" and the test hypothesis is "B is better than A"? It seems like
you'd be gaining some more power to detect B being better, without losing much
since A's safe to choose regardless of if they're about the same or if A's
actually better?

~~~
btilly
The reason is that statistics is done with numbers. I do not know how to take
a statement like "A is at least as good as B" and tell you the probability
that after 15 coin flips, A is 3 ahead of B. I do know how to do that
calculation under the assumption that they are exactly equal. Or under the
assumption that A comes up 51% of the time, B 49% of the time.

The point of the null hypothesis is that it is an absolute worst case. So
whatever the real difference, it can't be harder to detect than the null
hypothesis. Therefore if we make very few mistakes on the null hypothesis,
then we're confident that we'll make very few mistakes, no matter what.

And finally you're right that going with the green button is better than just
tossing up your hands. But how to decide how much better is a surprisingly
complicated question, and there is no simple answer about how to quantify it.
(Remember, statistics has to be done with numbers, so quantifying the answer
matters.)

~~~
majormajor
Wouldn't it just be a normal one-sided hypothesis test? The only thing that
changes under H_0: A > B vs H_0: A = B is that you're only interested in the
area under one side of the curve instead of both. The test statistic remains
the same, you just get a different p value.

It should let you get away with a lower sample size at the expense of possibly
only being able to conclude "A's better" or "A's not better" instead of "A's
better" or "A and B could be the same" or "B's better." (It's unclear to me
what would be the problem with looking at the P values for both the > test and
the != test, for a single test, and only "falling back" to the > test if you
happen to be in a range where you can say "we can't say for sure that B's
better, but we can be pretty sure that A isn't better," like how the different
P values are presented here[1].)

It still wouldn't be valid to do something like "oh the two-sided one was
inconclusive but leaned in favor of A, so let's do a one-sided one to check if
A's better after all," though.

[1]
[http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests...](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm)

~~~
btilly
In this article I was going for the highest possible standard. The ins and
outs of one-sided versus two-sided was not a topic that I wanted to explore.

In the next article I plan to relax things a _lot_ farther than just one-sided
versus two-sided tests!

~~~
majormajor
Gotcha, thanks. And one- vs two-sided alone would hardly be a huge win, I was
just curious since I didn't remember seeing it brought up at all in any of the
prior articles I'd read.

------
will_critchlow
How do you approach this issue? [http://www.distilled.net/blog/conversion-
rate-optimization/w...](http://www.distilled.net/blog/conversion-rate-
optimization/why-your-cro-tests-fail/)

I haven't come up with a decent answer yet especially for smaller sites that
can't run tests on very specific segments...

~~~
btilly
Based on your article and code, I believe that you have misunderstood what the
statistical test is supposed to be telling you.

When you make decisions at 95% confidence, your guarantee is that at most 5%
of the time are you going to wrongly conclude that one is better when it
isn't. However you have absolutely no guarantees about having correctly called
the direction of the test if you do hit that significance level. (Indeed if
the null hypothesis is correct, every time you call the test, you're wrong!)

What you did is simulated the test many times, ignored the cases where you
were told there was no answer (thereby throwing away a large part of your
guarantee), and found that you could be wrong a large portion of the time.
Furthermore if there was a large, discoverable, random factor, you found that
could be correlated with a lot of the mistakes. Unfortunately, in addition to
the discoverable factor, there are lots of unknown random factors that also
get randomly correlated. And even if there aren't, there is always the
possibility of experiencing bad luck.

The only way to solve this is to throw enough traffic at the problem that the
underlying bias can be reliably detected statistically. There is a complicated
relationship between the size of bias you're willing to get wrong, and the
amount of data that you need to collect to reliably detect it. I hope to
explore that relationship in the next two articles.

For larger sites, that solution is perfectly workable. For smaller ones it is
not, and the best that I can suggest is that they rely heavily on design
principles that have been validated through A/B testing on larger sites, and
hope they are not going too far wrong.

~~~
will_critchlow
I'm pretty sure we should just ignore my code for the purposes of this
discussion - my math(s) may be fuzzy, but my code has _never_ been a strong
point.

I was mobile when I wrote the original question - perhaps a better way of
phrasing it would have been something like:

Don't many (most? all?) of these theoretical approaches assume that the
sequence of results for each page (call them a_i and b_i for i=1,2,3... where
each x_n is 0 [no conversion] or 1 [conversion]) are sequences of iid random
variables with underlying conversion probabilities p_a and p_b? In reality,
these sequences are much more complex and, if the scale of the variation in
conversion probability _within the sequence_ is greater than the difference
between p_a and p_b won't the test be much weaker than we originally thought?

To use an example that is simplified vs. reality but hopefully indicates what
I mean, imagine that we have two traffic sources - one with a conversion
probability twice that of the other (on each page variant - so p_1_a = 2
_p_2_a and p_1_b = 2_ p_2_b). We randomly send traffic from both sources (1
and 2) to each variant (a and b). Do the standard tests work even though our
sequence of conversions are not iid?

~~~
btilly
No.

One good way to see why is this. The theoretical approaches that you're
talking about do not care about the internal details of your random number
generator. You could have p_a be the result of a single random decision
(convert or not), or be the result of first finding out that we had a random
source, then having conversion probabilities that depend on the source. Either
way, as long as in the end you get a stable p_a of converting from noticing
you have a new visitor to an actual conversion, probability theory says that
the exact same statistical statements will be true, and you'll have entirely
equivalent results.

(Note, the way I set up my particular approach means that actual conversion
rates can shift over the test without invalidating the results. I didn't call
that out too strongly, but I value that detail.)

However there is a gotcha lurking. The gotcha is that if conversion depends on
your source, then A/B testing will optimize for your current traffic mix, but
its result may become wrong if that mix shifts. My usual approach is to simply
assume that the world is not going to be malicious in this way, unless I have
specific reason to suspect it is.

Thus, for instance, I would not think twice about source vs button color. But
if I had a landing page with random testimonials, I'd expect that a
testimonial from pg would convert better than a testimonial from Phil Ivey.
And conversely for traffic from a gambling site.

------
monkeyfacebag
Can someone who understood this post all the way through tell me what _m_
refers to? It first shows up in this sentence

> Therefore we're faced with a series of decisions, at each number of
> conversions _n_ , _m_ more turned up _A_ than _B_ ( _m_ can, of course, be
> negative).

which I am not able to parse.

~~~
bobbles
"m more often resulted in result A than result B"

I think would be another way of saying it. I could be completely wrong though,
it is a bit confusing.

~~~
btilly
Did I manage to clarify the wording in my latest revision?

If so, how would you like to be acknowledged?

------
carlsednaoui
Loving this, thank you for taking the time to make these. Is there any way we
could get notified by email when the rest of the series comes out? My email is
user @ gggmail.com just in case.

------
sycren
I have done many split tests on my company website using Optimizely, How does
this service and others measure up in accuracy of the results?

------
grzaks
btbilly: I read your article but before I digg into the math and try to
understand it - can you please take a look at the tool I use now
<http://mystatscalc.com> and see 1) how it might work 2) do you think it's
giving correct results?

Waiting for rest from the series!

Grzegorz

~~~
btilly
That tool is giving you a standard frequentist p-value. Which means that if
you insist on 95% confidence, and take multiple looks, then you'll have more
than a 5% chance of eventually getting to confidence randomly.

If you stop your tests on the p-value cutoffs that I gave in my third graph,
you'll get very strong a priori guarantees that the decisions that you make
will be right. The downside is that you won't have any guarantee of making
decisions in reasonable time. But that is the subject of the next article.

