
A Better Approach to A/B Test Analysis - bvanvugt
http://blog.sendwithus.com/a-better-approach-to-ab-test-analysis/
======
jessicaraygun
Hey! I'm the author of Confidence.js. Emily Malcolm and I have been working
hard on this new approach for the past few weeks and we're super excited to
share it!

We're both here to answer any questions :)

~~~
tedsanders
>First, we use Chi Squared Tests to determine if differences in the A/B test
data are meaningful or not.

Could you explain why you've chosen to take a binary approach to determining
whether differences are meaningful or not?

To me it seems like a continuous approach would be both more useful and more
realistic. Creating an artificial threshold for significance seems a bit silly
(and it also makes the model harder for users to use, because different
applications might need different significance levels to justify an action).

From my perspective, every data point contains information and if you wait for
significance you're essentially ignoring early information.

Edit: Also, when switching costs are small, significance levels become mostly
pointless and you just want to switch to the best A/B option immediately. As
evidence swings the other way, you just switch back.

~~~
em441
Hi! This is Emily, the stats brains behind this. My research indicated that
80% significance was commonly used for A/B Testing (and of course any
significance level >80% would be even more conservative). I definitely see
your point of the advantage of a continuous test and I think for the expert
being able to see the exact significance level for the test would be very
useful. However, our thoughts were that the average user might not necessarily
have the expertise to know what a "good enough" level of significance would be
for any test. Rather than having to educate every user on what significance
means and how to interpret it we decided that a "yes, significant" or "no, not
significant" would be more easily interpreted by all, regardless of their
statistical background. If there is demand for a more continuous approach, it
certainly could easily be implemented.

~~~
tedsanders
Thanks for the reply. I guess it's a good reminder for everyone building
statistical products (or even non-statistical products) that the goal is not
an optimal formula, but a result that makes users comfortable.

------
mrmch
Worth pointing out that our (sendwithus) js library for calculating all of
this is open source: github.com/sendwithus/confidence

------
wtracy
The statistical methods we use now were created in the context of long-running
experiments that had to be set up in advance and run in parallel. In that
situation, you have to decide up-front how many subjects to test on, and the
methods reflect this.

I'd like to see someone takle creating a method aimed at our situations where
results steadily trickle in. There ought to be a way to come up with adaptive
thresholds such that at any given time we can ask, "Do we have statistically
significant results yet, or do we keep the test running?"

~~~
tedsanders
I totally agree. And I think Bayesian methods (which compute the likelihood of
the model given the data) tend to work much better for these rolling data
applications than frequentist methods (which compute the likelihood of the
data given the model). The problem for frequentist methods here is that when
data is constantly rolling in it's hard to specify the space of all possible
data collected (because traditional bounds like the number of data points no
longer work).

Here's a link to a paper on a Bayesian approach to the multi-armed bandit
problem:
[http://onlinelibrary.wiley.com/doi/10.1002/asmb.874/abstract...](http://onlinelibrary.wiley.com/doi/10.1002/asmb.874/abstract;jsessionid=364E386ABCD27DA0D592DC7810CA558A.f01t03)

~~~
em441
Hi! This is Emily, the stats brains behind this. I will definitely look into
this and other Bayesian methods. Thanks!

~~~
btilly
I would also suggest switching from chi-square to the g-test.

If you look at the history, I believe that you'll find that Pearson originally
came up with the g-test as an approximation to an exact test, and then found
the chi-square as an easier to compute alternative. This mattered back in the
days of pencil and paper, but there is no excuse today to use the worse
technique.

I'm going to avoid long discussions about the advisability of taking multiple
looks at results with a classical statistical test. But see my incomplete
series at [http://elem.com/~btilly/ab-testing-multiple-
looks/index.html](http://elem.com/~btilly/ab-testing-multiple-
looks/index.html) for some of the considerations.

I never got into Bayesian statistics in there. In general they depend on the
existence of a prior distribution. Careful treatments will talk about this.
Sloppy ones assume one, don't talk about the one that they assume, and then
quote results without letting you know about this important assumption. As
long as you accept that assumption, they work well. But sometimes can be
confusing to explain. (Until people "get" it. Then it can become irritating
getting them to STOP explaining it!)

If you want to discuss these issues more, my email is my name at gmail.com.

~~~
em441
"Upgrading" to the g-test is certainly something we could implement in the
future - Pearson's Chi Square was simply a starting place. I will definitely
have a look over the link you provided. Thanks!

------
_deh
Here's another paper that's relevant to this topic:
[http://www.qubitproducts.com/sites/default/files/pdf/most_wi...](http://www.qubitproducts.com/sites/default/files/pdf/most_winning_ab_test_results_are_illusory.pdf).
And discussion here:
[https://news.ycombinator.com/item?id=7287665](https://news.ycombinator.com/item?id=7287665)

------
vincentbarr
There are a number of testing tools relying on this methodology. This is
useful: Evan's Awesome A/B Tools ([http://www.evanmiller.org/ab-
testing/](http://www.evanmiller.org/ab-testing/)). It includes a Chi-Squared
test, sample size calculator, two sample T-test, and Poisson Means test.

------
robdoherty2
At my company we've implemented a Bayesian A/B test in order to minimize the
amount of time a test has to run.

[http://visualrevenue.com/blog/2013/02/tech-bayesian-
instant-...](http://visualrevenue.com/blog/2013/02/tech-bayesian-instant-
headline-testing.html)

~~~
em441
I will definitely have a look through this. Thanks!

------
adamcowley
Nice to see someone questioning old statistical methods and then bringing new
methods to the masses.

