

A Bayesian Approach to A/B Testing - pospischil
http://blog.custora.com/2012/05/a-bayesian-approach-to-ab-testing/

======
zeroonetwothree
There's a related notion of an "adaptive" statistical design, where the
allocation of users to each test group varies based on prior performance of
the group. For example, if after the first 100 users you notice that group A
seems to be doing slightly better than group B, you will favor it by
allocation more users to that group. You can compute this allocation in such a
way as to maximize the number of successes. In particular it will converge to
always picking the better approach eventually, assuming there is a real
difference. This also means you don't really need to "stop" the experiment to
make everyone use the better version (although you may want to for other
reasons).

Here is one paper: <http://web.eecs.umich.edu/~qstout/pap/SciProg00.pdf>

~~~
aaronjg
Noel Welsh at untyped has been working on a cool implementation of this
adaptive design.

Check it out [http://untyped.com/untyping/2011/02/11/stop-ab-testing-
and-m...](http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-
like-a-bandit/)

And the HN discussion: <http://news.ycombinator.com/item?id=3867380>

------
ced
_If you aim to make inferences about which ideas work best, you should pick a
sample size prior to the experiment and run the experiment until the sample
size is reached._

That's not a very Bayesian thing to say. It doesn't matter what sample size
you decided to pick at the beginning. A Bayesian method should yield
reasonable results at every step of the experiment, and allows you to keep on
testing until you feel comfortable with the posterior probability
distributions.

If 10 customers have converted so far, and 30 haven't, then you would expect
the conversion rate to be somewhere between 10% and 40%, as evidenced by this
graph of the Beta distribution(10,30):

[http://www.wolframalpha.com/input/?i=plot+BetaDistribution+1...](http://www.wolframalpha.com/input/?i=plot+BetaDistribution+10+30)

You then do the same with method B, and stop testing once the overlap between
the two probability distributions looks small enough.

Anscombe's rule is interesting, but it seems rather critically dependent on
the number of future customers, which is hard to estimate. The advantage of
the visual approach outlined above is that it's more intuitive, and people can
use their best judgment to decide whether to keep on testing or not.

 _Disclaimer_ : I am not an A/B tester.

~~~
noelwelsh
This way of framing the problem is known as the bandit problem. You can find
lots of papers about it (Bayesian and frequentist). As others have mentioned
in this thread we have a startup providing bandit algorithms as SaaS:
<http://mynaweb.com/>

~~~
ced
I've looked at your website, and from what I gathered, I would make the same
criticism as I made for Anscombe's rule: it's not easy at all to decide what
rewards should be, and how to put a price on exploration vs. exploitation. The
more I think about it, the more I feel that an engineer looking at Beta
distributions could weigh the trade-offs and make a better decision than a
black-box algorithm with inadequate assumptions.

Granted, this doesn't really scale to testing many combinations of feature,
and I think that I can see what you're shooting for. Best of luck with Myna.

------
prosa
This is a powerful approach when you can quantify your regret. For many
startups, however, it's important to understand the tradeoffs involved in
moving one metric upward or downward. To take Zynga as an example, they care
about virality at least as much as engagement (or perhaps moreso). Adding or
removing a friendspam dialog is likely to trade some virality for user
experience. What percentages make or break the decision? Sometimes this is a
qualitative call.

In environments where you need to look at the impact of your experiments
across multiple variables, and make a subjective call about the tradeoffs,
it's really important to have statistical confidence in the movement of each
variable you're evaluating. This is a key strength of the traditional A/B
testing approach.

------
cmansley
May I ask how this is Bayesian in anyway? I understand that using the term
Bayesian is good for directing clicks to a site, but this seems like good old
fashion frequentist math. None of the hallmarks of a Bayesian approach to the
problem are here: having a distribution over hypothesis, having an explicit
prior, computing the posterior of the distributions.

I have some experience with the medical trial literature and specifically
bandit algorithms and using cumulative regret verses other statistical
measures like PAC frameworks. And regret is most certainly not a Bayesian
idea. Instead you are explicitly modeling the cost of each action (providing
an A or B test to a user) instead of assuming all costs are equal.

Yes, this is a better approach because it explicitly models the costs
associated with the exploration/exploitation dilemma. But, it is not Bayesian.

~~~
aaronjg
In the clinical trial literature Anscombe's approach is considered Bayesian,
and Armitage is frequentist. From Armitage's 1963 response to Anscombe's
paper:

'Anscombe takes the Bayesian view that inferences should be made in terms of
the likelihood function... An immediate consequence is that stopping-rules are
irrelevant to the inference problem.'

Page 6 of the Anscombe paper that I cited may be helpful in your understanding
of the approach.

------
spitfire
This gets into the nitty gritty of running trials (A/B, split testing). If
things like this get baked into libraries it has the chance of pushing the
state of the art forwards.

Very worthy of an HN post.

EDIT: Actually, check out their entire blog. It's worth your time.

~~~
aaronjg
What libaries are you currently using where you would like to have things like
this?

------
roryokane
This description of content optimization using bandit algorithms sounds like
an even better approach: [http://untyped.com/untyping/2011/02/11/stop-ab-
testing-and-m...](http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-
make-out-like-a-bandit/)

That company has already made a web app and service to optimize content using
that approach, Myna, at <http://www.mynaweb.com/>. A simulated experiment
showed their approach to be better than A/B testing:
<http://www.mynaweb.com/blog/2011/09/13/myna-vs-ab.html>. Though Myna's
website doesn't say whether it is currently free or not, or what its pricing
will be when it goes out of beta.

~~~
noelwelsh
As of yesterday evening, Myna is in public beta. That is, you can sign up
straight from the website. Myna is completely free for now. When we start
charging, the cost will be in line with other companies in the same space. If
you're earning from your site the cost of Myna should be a rounding error.

------
LinaLauneBaer
"k is the expected number of future users who will be exposed to a result"

Does this mean that this approach does not make much sense if your estimate of
k is totally wrong?

How do you estimate k?

~~~
aaronjg
Anscombe talks about this and proposes two solutions:

One is to estimate it based on the number of daily visitors your site gets,
and then estimate how long you will run the winning alternative in the
campaign.

He also proposes: 'perhaps k should be assesed, not as a constante, but as an
increasing function of |y|/n, since the more striking the treatment difference
indicated the more likely it is that the experiment will be noticed... One way
of introducing such a dependence of k on |y|/n is to assess k+2n as a
constant.'

This actually simplifies the math somewhat, and you can see the full details
in Anscombe's paper cited in the blog.

~~~
dfabulich
I don't get it. What if this is my home page? What if I intend to run the
campaign "forever?"

If k is an estimate of how much traffic I will ever see, it seems like I'm
going to be calculating the Phi-inverse of approximately 0.

Where can I see Anscombe's paper online? (It was published in 1963; it's not
linked in the blog post, just cited.)

~~~
aaronjg
Just sent you a copy of the paper. If you plan to use the result 'forever'
then theoretically you would be willing to sacrifice a huge (infinite) amount
of suboptimal performance now, so that you get the correct answer in the for
when you decide to pick the winning idea. It would be very important to have
the correct winning idea, because it is going to run for eternity.

In practice, we don't actually ever run the winning idea for ever. We do
website re-designs periodically, we test new ideas, business needs change. So
we can pick a reasonable value for k based on these constraints.

Alternatively, you can get better performance by _not_ picking a stopping
criteria, and dynamically choosing which homepage to show. As soon as one idea
appears to be doing better, you start showing that to more users. By choosing
the appropriate adaptive sampling strategy, you can reduce regret to be less
than if you have a constant sampling strategy. However, for many people the
adaptive strategy may be more trouble to implement than it is worth.

The most important takeaway is to _not_ use repeated significance tests to
determine experiment termination time. Either use the Anscombe bound with an
appropriate k, or fix the sample size before starting the experiment.

------
tel
It's worth noting that considering the zoomed in graph (the 4th image), while
it shows correctly that it could cause problems if you use significance as a
stopping rule, also clearly shows that the classical test is far more powerful
for n < 2000, i.e. it states a result is significant with more sensitivity.

So while Anscombe's rule looks good for massive amounts of users, smaller
tests with predefined stopping rules can be more useful if you only have a few
thousand observations.

------
Cblinks
How many tech companies use the Bayesian Approach rather than the traditional
(confidence-testing) Approach?

------
appleaintbad
There is nothing wrong with studying this approach and trying it out to see
whether the interpretations are more helpful. However, insufficient data and
insufficient technique are common when studying extremely complex systems;
this Bayesian approach makes assumptions that may not be correct.

~~~
chimeracoder
> However, insufficient data and insufficient technique are common when
> studying extremely complex systems;

A Bayesian approach is _especially_ well-suited to small sample sizes, unlike
a frequentist approach, which will give a nonsensical result for a sample size
of 0,1, or 2.

As for improper technique, I can't help you there. 'Garbage in, garbage out',
as they always say.

