
A Formula for Bayesian A/B Testing - bufo
http://www.evanmiller.org/bayesian-ab-testing.html
======
sharnett
Very nice, much faster than simulating it. I guess if you're using an
informative prior, instead of adding 1 to the number of successes and
failures, you add the corresponding parameters of your beta prior?

A pretty good shortcut (if you don't have a log-beta function, for example) is
to approximate both A and B with the normal distribution. Then their
difference is also normal, so you can just check the probability that a normal
random variable is greater than zero.

Specifically, the mean μ of a beta distribution is α/(α+β) and the variance
σ^2 is αβ/((α+β)^2(α+β+1)). Use these as the parameters for your normal
approximation, and we have the difference D ~ N(μ_A-μ_B, σ_A^2+σ_B^2). The
probability that B beats A is just the CDF of D evaluated at 0.

In Python:

    
    
      from scipy.stats import norm as norm
    
      def beta_mean(a, b):
          return a/(a+b)
      def beta_var(a, b):
          return a*b/((a+b)**2*(a+b+1))
      def probability_B_beats_A(α_A, β_A, α_B, β_B):
          mu = beta_mean(α_A, β_A) - beta_mean(α_B, β_B)
          sigma = (beta_var(α_A, β_A) + beta_var(α_B, β_B))**.5
          return norm.cdf(0, mu, sigma)

~~~
dxbydt
For those of us stuck in Java/Scala land, with attached results:
[https://gist.github.com/krishnanraman/4be978cbe8979e4c97fa](https://gist.github.com/krishnanraman/4be978cbe8979e4c97fa)

------
EvanMiller
I'm pretty sure this formula is correct, but I haven't seen it published
anywhere. John Cook has some veiled references to a closed-form solution when
one of the parameters is an integer:

[http://www.mdanderson.org/education-and-
research/departments...](http://www.mdanderson.org/education-and-
research/departments-programs-and-labs/departments-and-divisions/division-of-
quantitative-sciences/research/biostats-utmdabtr-005-05.pdf) [pdf]

But he doesn't really say what that closed form is, so I think his version
must have been pretty hairy. (My version requires all four parameters to be
integers, so I doubt we were talking about the same thing.)

Sadly I couldn't get the math to work out for producing a confidence interval
on |p_B - p_A| so for now you're stuck with Monte Carlo for confidence bounds.

Thanks to prodding from Steven Noble over at Stripe, I'll have another formula
up soon for asking the same question using count data instead of binary data.
Stay tuned!

~~~
octernion
Thanks for doing this, Evan! Much appreciated - I've already implemented it to
compare with our traditional confidence score based A/B metrics.

I'm not sure what you mean by Monte Carlo for confidence bounds - is there
some reading I can do to brush up on that?

edit: here is your solution implemented in Ruby as I implemented it
[https://gist.github.com/nickelser/6dfa0f2737c502dae267](https://gist.github.com/nickelser/6dfa0f2737c502dae267)

~~~
paraschopra
Confidence interval means that is expected conversion rate of A is 0.5 and for
B is 0.4, what is the actual spread. In other words, say, for 95% of the time
does the conversion rate fall in the 0.3-0.5 range or 0.35-0.45 range. More
data you collect , narrower this range will be. Confidence bounds for
individual conversion rates can be calculated from the formula since they
follow beta distribution. What Evan is saying that the confidence bound for
the difference in conversion rates would need to be calculated through
repeated simulations and taking (say) 5 percentile and 95 percentile of the
values that come.

For example, given you know probabilities for A and B (say 0.5 and 0.4) and
that number of trials has been 100, you can simply run thousands of
simulations, know what the conversion rates come out to be in respective
simulations and aggregate the difference in conversion rates at the end of
each simulation. That would be a confidence bound.

~~~
octernion
Gotcha - thanks so much for writing this up. I'll see about actually
implementing this :)

------
jessaustin
_If you are unable to find a log-gamma function lying around, rewrite the
above equation in terms of factorials using Γ(z)=(z−1)!, and notice that there
are an equal number of multiplicative terms in the numerator and denominator.
If you alternately multiply and divide one term at a time, you should be able
to arrive at an answer without encountering numerical overflow._

Something about the right side of the equation immediately preceding this
quote seems to indicate that _many_ of the terms in the numerator would cancel
with equivalents in the denominator. I'm not really familiar with CAS systems,
but is this the sort of thing they could do? Doing this simplification _once_
when one writes the code seems to be a win over calculating the original
expression every time the code runs.

~~~
avibryant
I had the same thought, and went down the simplification rabbit-hole. Here's
where I ended up:
[https://gist.github.com/avibryant/545e63b446a840951d5e](https://gist.github.com/avibryant/545e63b446a840951d5e)

------
paraschopra
Evan, does this formula account for multiple comparisons (if you have multiple
goals and multiple variations)? I guess it would suffer from the same problems
that if you have 100 variations and 10 goals, some of them are bound to
produce a significant result, just by random chance. Isn't it? In classical
testing, you can fix it by making your calculated alpha smaller than the real
alpha, so you need much more data if there are multiple comparisons. What
happens in Bayesian case?

Edit: I did some Googling and found this
[http://www.stat.columbia.edu/~gelman/research/unpublished/mu...](http://www.stat.columbia.edu/~gelman/research/unpublished/multiple2.pdf)

~~~
grayclhn
That paper is a very different setup --- the hierarchical model Gelman, Hill,
and Yajima use implicitly builds multiple comparisons into their process. You
could do the same thing here in principle by drawing alpha and beta for
different experiments from an underlying distribution, but I'd be cautious
about applying that approach mindlessly.

If you haven't read it, Gelman and Hill's book is excellent:
[http://www.stat.columbia.edu/~gelman/arm/](http://www.stat.columbia.edu/~gelman/arm/)
and so is Gelman's blog.

------
vog
Great work! However, an additional _testing_ section would be nice (right
below the _implementation_ section).

That section should provide two or three lists of example input values, and
the expected output value (up to some accuracy).

As the author notes, although this formula looks pretty simple, you can make a
lot of numerical mistakes when implementing it. A test suite would help
implementors to catch those mistakes early.

~~~
akane
Assuming the Julia implementation is correct, you could write tests based on
output from that.

------
hidden-markov
Indeed, much faster than Monte Carlo integration:

    
    
      5.778  evaluation.py:20(evaluate) # Sampling version
      0.043  evaluation.py:64(evaluate) # Closed formula
    
    

(See my A/B testing library [https://github.com/bogdan-
kulynych/trials](https://github.com/bogdan-kulynych/trials))

------
spitfire
Here's the Mathematica version:

PrBbeatsA[\\[Alpha]a_, \\[Beta]a_, \\[Alpha]b_, \\[Beta]b_] := \\!\\( \
_UnderoverscriptBox[\\(\\[Sum]\\), \\(i = 0\\), \\(\\[Alpha]b\\)] \_
FractionBox[\\(Beta[\\[Alpha]a + i, \\[Beta]b + \\[Beta]a]\\),
\\(\\((\\[Beta]b + i)\\) Beta[ 1 + i, \\[Beta]b] Beta[\\[Alpha]a,
\\[Beta]a]\\)]\\)

------
psychometry
A minor complaint: that formula calculates a probability (a number within the
range [0,1]), rather than odds (a ratio of expected values).

