
Bayesian A/B Testing - Peroni
http://developers.lyst.com/data/2014/05/10/bayesian-ab-testing/
======
EvanMiller
This is a good discussion, but the author is confused about what the "Low Base
Rate Problem" is. It doesn't have anything to do with the null hypothesis
being true most of the time -- the example the author gives is actually a
second form of repeated significance testing, which could be addressed with a
Bonferroni or Šidák correction.

The Low Base Rate Problem is when you have a binary outcome and one of the
outcomes is rare (say, less than 1%). There is so little entropy in the
information source that you have to acquire a heck of a lot of samples in
order for the statistical test to have any power. The problem is not unique to
frequentist statistics; it's a consequence of information theory and so it
affects Bayesian statistics as well.

Nonetheless, I highly recommend examining Bayesian test techniques to avoid
repeated significance testing (both within a single trial and across multiple
trials). A side benefit is that when someone says "What's the probability that
the new purple dragon logo outperforms the old one?", you can give them an
answer without backpedaling and explaining null hypotheses, p-values,
significance levels, and all that jazz.

The major drawback to Bayesian techniques is that it tends to be
computationally expensive. For example, to evaluate the A/B test and answer
the purple-dragon question with normal priors, you have to integrate a normal
distribution in two directions, and there's not a clean analytic formula for
that. That's why there's a jagged histogram in the blog post; it changes every
time you hit "Calculate" because it's being integrated with Monte Carlo
techniques, which take a lot of juice compared to (frequentist) analytic
methods.

~~~
kiyoto
>A side benefit is that when someone says "What's the probability that the new
purple dragon logo outperforms the old one?", you can give them an answer
without backpedaling and explaining null hypotheses, p-values, significance
levels, and all that jazz.

This is surprisingly valuable. As a former quant/math person, I am shocked
time and again how most self-professed data driven people have no idea about
inferential statistics. I've learned over time that most people just want to
see data and descriptive statistics, preferably in visual representations, and
interpret them somewhat creatively and pretty much non-rigorously.

------
ballison
I think using Bayesian methods is a good thing. The reservation I have about
this post is that it's motivated as A/B testing, which in classical
frequentist testing usually equates to a null hypothesis that some parameter
is equal in two populations (in this example, conversion rate under method A
is equal to conversion rate under method B). The rest of the post then
describes a test of a single population against a known value (is the
conversion rate under method B = 5%).

These are not the same problems at all, and it's not clear how the authors
propose to extend the tests they're informally describing to the test of
equality of a parameter in two populations. It is possible in a Bayesian
framework, but it's not this simple.

~~~
kobyszcze
Thanks for the comment. We do look at comparing proportions in two samples
(have a look at our calculator: [http://developers.lyst.com/bayesian-
calculator/](http://developers.lyst.com/bayesian-calculator/)). The way we do
this is by constructing a posterior distribution of differences between the
two proportions through taking samples from the proportion posterior
distributions and taking their difference. But yes, we should have made it
more obvious in the post!

------
MLWiDA
My two cents:

1\. As pointed out correctly in the article and "Most A/B-Test Results are
illusionary", using a non-sequential frequentist method in a sequential way
leads to wrong conclusions. In this regard, the comparison is unfair. On the
backside, the frequentists' sequential methods for more than one dimension
seem to be rather complicated (for one dimension and binomial distribution,
check out the Sequential Probability Ratio Test (SPRT)).

2\. Finding a good prior is really really hard. Choosing a weak prior may lead
to jumping to wrong conclusions. I found it useful to a) choose a
uninformative prior b) do not perform a statistical analysis until 1-2 weeks
of data are in (to be sure that all special-day-effects like weekends are
caught) ... which could be seen as a prior with weight of the same interval ;)
c) using a simple old-fashioned permutation test to control the alpha error
and prevent jumping to conclusions

While this approach has turned out to be too conservative in the binomial
case, it becomes extremely important when dealing with revenue. Single
customers do turn the tide. Often these are clear outliers, but sometimes it
is not that easy to say. But this is a topic for another day.

------
morgante
I had the pleasure of taking a great Bayesian stats class in college and these
methods seem preferable.

Unfortunately, the requirement of priors (seemingly) precludes Bayesian stats
from many A/B testing applications. For example, our site launches every
article with 3 or 4 headline variations. A priori, we have no idea which
headline will be most successful or even what the expected conversion rate is
for that article.

Is there any way to use Bayesian methods in such a scenario?

~~~
kobyszcze
You could use identical priors for each variant, with the mean equal to the
empirical mean of your existing system, and the variance controlling how
plausible big improvements are.

In this case your prior mean could be something as simple as an average
conversion rate for all your past articles.

Remember that by using frequentist statistics your are implicitly assuming a
uniform prior---I think it's quite easy to find weakly informative priors
which are more plausible than saying that any conversion rate is equally
likely.

~~~
yummyfajitas
Or even better, you can look at the conversion rates of past articles and fit
a beta distribution to them.

------
punee
I guess that's as good a place as any to ask some more general (and naive)
question : how come Bayesian methods still haven't taken over the testing
market given their (as perceived by my narrow understanding) advantages?

~~~
jcromartie
Because Bayesian probability is not intuitive as we (as a society) currently
approach statistics. If we start teaching Bayesian probability in grade school
then maybe the next generation will "get it" sufficiently to make mainstream
applications possible.

~~~
punee
Assuming I'm a marketer who's using A/B testing tools and who's only
interested in getting "statistically significant" results, and assuming
Bayesian methods provide some advantage in terms of regret minimization that
translates to real dollars earned, I feel that I could totally outsource my
understanding of the theoretical underpinnings that get me that result. After
all, what percentage of Optimizely or VWO users perfectly understands the
statistical framework the tools are based on? So I'm not really convinced by
that line of reasoning.

~~~
tel
Bayesian methods don't perform so much better in the situation you just named.
They're also more expensive.

The really shine in that they provide a really uniform vocabulary of producing
new, more sophisticated models while "frequentist" methods usually rely on
ingenuity to get to better models. So if you're actively exploring a model
space and want to attach a bunch of assumptions and degrees-of-freedom to
correspond with a theory you're testing... then Bayesianism is the way to go.

Theoretically, marketers are doing that exact process. In practice, they don't
see that is a statistical process, though. Maybe someday a tool will bridge
that gap successfully... but again, you're unlikely to get a huge advantage
with Bayesian methods without that increased work investment.

