

Determining A/B test sample size - noahnoahnoah
http://37signals.com/svn/posts/3004-ab-testing-tech-note-determining-sample-size

======
equark
Somebody really needs to write a Bayesian takedown of all these A/B testing
articles. A/B testing is a Bayesian decision problem. There's really no other
way to think about it. Determining sample size and frequentist confidence
intervals are only relevant insofar as they approximate Bayesian concepts.

The issue is the proper tradeoff between exploration and exploitation. What
drives the decision is outstanding uncertainty conditional on the data
observed (not conditional on the null hypothesis of zero effect and some non-
sequential iid sampling process), the discount rate (which is totally absent
in this article), and the reward structure (which is not a Type I and Type II
error).

The absurdity of the frequentist approach is clear from the admonition not to
look at the results of the tests too often.

~~~
mturmon
I think even a Bayesian approach will have to grapple with the issue of
looking at the results too often. The problem is that if you make your
decision on "when do I stop testing", dependent on the test results so far,
then the test results can be biased.

I'm sure you're aware of this, but I'm just trying to clarify the idea for
other readers.

The idea is not well-illustrated in the article. (Although the article does
provide some usable guidance until the whole Bayesian framework gets built and
populated with correct parameters, like the reward structure.)

So, to be concrete -- Suppose you're flipping coins and you figure (by some
procedure) you need 100 flips to reach significance. By the 70th flip, you
observe that p(head) ~= 40/70 ~= 57%, so you decide to stop the test because
clearly you're not dealing with a 50/50 coin. That's not OK, because you'll
always see favorable and unfavorable excursions in a series of coin flips --
if you choose to stop in the middle of such an excursion, you'll bias the
result. You've made the stopping time dependent on the observed values.

In some situations you can do this (it's related to
<http://en.wikipedia.org/wiki/Optional_stopping_theorem>), but the way that I
described above is not one of them.

~~~
equark
No this is actually a common misunderstanding and gets to the heart of the
difference between conditioning on the data vs considering the sampling
process. At the 70th flip your best guess is that it is 57%, given a uniform
prior. It's perfectly fine to stop based on the results you have, that doesn't
change the likelihood of seeing what you saw. Imagine looking each time,
clearly your best guess is the sample mean unless you have prior knowledge.

What's confusing is thinking about the sampling distribution. But what might
have happened in some other world is of no consequence if you condition on the
data rather than the parameter.

This is the likelihood principle.
<http://en.wikipedia.org/wiki/Likelihood_principle>. See the example there and
how it relates to sequential trials. It's actually rather deep. Other good
links are:

[http://books.google.com/books?id=_ravDT9e8nMC&lpg=PA17&#...](http://books.google.com/books?id=_ravDT9e8nMC&lpg=PA17&dq=stopping%20rule%20robert&pg=PA15#v=onepage&q=stopping%20rule%20robert&f=false)

[http://books.google.com/books?id=oY_x7dE15_AC&lpg=PA27&#...](http://books.google.com/books?id=oY_x7dE15_AC&lpg=PA27&dq=likelihood%20principle%20berger&pg=PA27#v=onepage&q=likelihood%20principle%20berger&f=false)

[http://projecteuclid.org/DPubS?service=UI&version=1.0...](http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&page=toc&handle=euclid.lnms/1215466210)

~~~
mturmon
My only point is that any kind of analysis has to be careful about the way its
mathematical assumptions relate to how the real-life experiment is conducted.

I'm not even going near the question of whether the Bayesian approach is
"better" than the frequentist approach.

I was trying to point out that the frequentist analysis in the OP does make
assumptions about the nature of the experiment (that you will run exactly N
trials) and that if you break those assumptions by stopping the test for some
N' < N because the answers are looking good, then you'd better understand that
your earlier analysis did not apply.

And in another reply, I wanted to add that there is a frequentist answer (the
Wald test) to the practical question: Can you widen the scope of the analysis
so that I _can_ stop early if I'm getting results that point strongly in one
direction?

Being sure that your assumed sampling distribution matches the actual
experiment is key, even in the Bayesian case.

My graduate statistics class was taught from Berger, your second link, so I'm
broadly sympathetic to the "Bayesian choice" -- but more important, I wanted
to give some usable insight to someone who just wants to do an A/B test.

~~~
equark
Yes, examining the data will mess up the sampling distribution and invalidate
the standard Wald test. But it's absurd in the AB testing context to advocate
not acting on your data. Of course it's also absurd to look at conventional
p-values if you do. So it's a bit of a Catch-22.

All this confusion goes away if you realize you are interested in p(lift |
data) rather than p( data | lift=0). The sampling distribution -- the
distribution of the statistic under repeated sampling, p(data | lift=0) --
does not play a role in Bayesian statistics. Obviously the "model"
(likelihood/prior) does, but this doesn't include the experimental procedure
provided that the experiment is only based on observed data.

AB testing, as a decision procedure, is an area where I don't think the
standard frequentist - Bayesian debate applies. The Bayesian decision rule is
the _only_ profit maximizing solution. That said, I"m sympathetic to being
practical. But all the confusion and conflicting advice related to AB testing
stems directly from trying to fit it into a frequentist frame.

------
bryanh
I rarely see many people take into account the opportunity cost of letting a
really close A/B test reach 99.99% confidence when the benefit is by
definition very marginal (that's why its taking so long, right?). I mean, is
it really that bad to go on "close enough" results and move on to bigger and
better tests?

~~~
hammock
In academia, we used p values of .95 or greater. I was taught that in
business, though, the rule of thumb for making decisions is more like .8, and
that's typically the standard I use as well.

------
DanielRibeiro
Another way to see this, is to use this online calculator:
[http://visualwebsiteoptimizer.com/ab-split-significance-
calc...](http://visualwebsiteoptimizer.com/ab-split-significance-calculator/)

------
Loic
If you are lazy, you can get the functions coded in PHP here:
<http://abtester.com/calculator/>

