
On the pitfalls of A/B testing - loarake
http://www.stavros.io/posts/pitfalls-b-testing/?new
======
kevinconroy
tl;dr: Don't bother with confidence intervals. Use a G-test instead.

Calculate it here: [http://elem.com/~btilly/effective-ab-testing/g-test-
calculat...](http://elem.com/~btilly/effective-ab-testing/g-test-
calculator.html)

Read more here:
[http://en.wikipedia.org/wiki/G-test](http://en.wikipedia.org/wiki/G-test)

And plain English here:
[http://en.wikipedia.org/wiki/Likelihood_ratio_test](http://en.wikipedia.org/wiki/Likelihood_ratio_test)

------
cocoflunchy

        When A/B testing, you need to always remember three things:
    
        The smaller your change is, the more data you need to be sure 
        that the conclusion you have reached is statistically significant.
    

Is that a mathematically provable result? It seems hard to conceptualize what
a 'small' or 'big' change is. I would have expected another argument along the
lines of "If you make more than one change at a time, you are not going to be
able to know which one of your changes caused the result".

~~~
ska
This property is quite intuitive. Small and big here are relative to the
variance of the underlying distributions.

Simple case: think about trying to decide if as normal distribution has mean 0
or mean 1. If the std dev is 0.001, it won't take you very many samples to be
fairly confident to this resolution, but of the deviation is 1000, you'll need
a lot of samples.

Similarly if the std deviation is only 1, but you are trying to decide if the
mean is 0 or 0.001, Far more samples needed.

The intuition generalizes quite well. In the OP case, typically requires
sample size estimates will be proportional to the square of the ratio between
the size you want to measure and a deviation estimate.

------
RyanZAG
I think the big issues people see in A/B testing is because of a fairly tricky
reason: the underlying distribution of the data. The usual ways of estimating
how big your sample size are have one huge giraffe of a problem hiding in
them: they assume the underlying distribution is normal.

The correct way to estimate your sample size is to use the cumulative
distribution function of your underlying distribution. See a brief explanation
from Wikipedia here:
[http://en.wikipedia.org/wiki/Sample_size_determination#By_cu...](http://en.wikipedia.org/wiki/Sample_size_determination#By_cumulative_distribution_function)

Now what's the problem with A/B testing? Most of the stuff we test A/B for is
incredibly non-normal. Often 99% of visits do not convert. We're looking at
extremely skewed data here. Generally the more skewed the distribution, the
more samples we need.

For a very basic understanding of why: consider a very simple distribution
with 99.99% of the time you get $0 and 0.01% of the time you get $29 - fairly
similar to what we A/B test. Do you think a sample of 1000 or 10000 is going
to be anywhere near enough here? Of course not.

~~~
noloqy
In statistics there is a "golden rule", that when np > 5 and n(1-p) > 5, then
the normal distribution is a good approximation for the binomial distribution.
Here n is the number of experiments and p is the conversion rate.

Our A/B testing data results from a Bernoulli experiment, and thus is
binomially distributed. So indeed, if we use tests that assume a normal
distribution, if we want to approximate the binomial distribution when p is
0.0001, n needs to be roughly 50k.

