
Statistical significance & other A/B test pitfalls - japetheape
http://www.cennydd.co.uk/2009/statistical-significance-other-ab-test-pitfalls/
======
shalmanese
It's disturbing to me how p < 0.05 is used somewhat unthinkingly as the test
for statistical significance simply because it's ubiquitous in science.

It seems to me if you have even a somewhat popular app, you're gathering
enough data that you can afford to use p < 0.001 and avoid a lot of the
complexities of statistical analysis that comes from p < 0.05. If you don't
have enough data to reach p < 0.001, it's probably better to work more on
increasing traffic than getting the piddling gains from A/B testing so early.

~~~
btilly
People blindly stopping at 0.05 is doubly worrying given that people tend to
stop an A/B test as soon as it shows significance. That gives them multiple
chances to be wrong. Furthermore if you are getting close to significance very
fast, then strong significance is close behind, so why not wait?

That said, if a test has been running for a while and you don't have an
answer, it can run for a looong time before it finishes. In my A/B testing
tutorial I explored that starting at <http://www.elem.com/~btilly/effective-
ab-testing/#slide59> (just use the arrow keys to move forwards and backwards
through those slides). I found that depending on whether random fluctuations
took you in the same direction as the underlying bias or the opposite, there
tends to be an order of magnitude difference in how long the test takes to
run. Furthermore whichever is leading after many observations is usually
really better, and in the worst case is overwhelmingly likely to not be much
worse. Therefore there are times when it really is better to declare an answer
and move on.

If you wish to formalize this, you could use the strategy used by some medical
trials where they decide in advance what confidence levels will cause them to
cut off early after 100 trials, 1000, trials, 10,000 trials, or to go to (say)
50,000 trials. And then they arrange that the sum of the odds that they make
an early mistake are below some acceptable threshold.

------
_delirium
Another common issue he doesn't mention: using observed differences (or
observed significance-test values) as the stopping criterion. The common
statistical-significance tests _don't_ work if the decision when to stop
collecting data is dependent on the observed levels of significance. Instead,
you must ahead of time decide how many trials to do, and stick to that
decision, or use more complicated significance tests. (This is the "multiple
testing" problem.)

For example, it works to flip two coins 50 times each, and then run a
statistical-significance test. It does _not_ work to flip two coins 50 times
each, run a test; if no significance yet, continue to 100, then 150, etc.
until you either find a significant difference or give up. That greatly
increases the chance that you'll get a spurious significance, because your
stopping is biased in favor of answering "yes": if you found a difference at
50, you don't go on to 100 (where maybe the difference would disappear again),
but if you _didn't_ find a difference at 50, you _do_ go on to 100.

Put differently, it's using separate p-values for "what is the chance I
could've gotten this result in [50|100|150|...] trials with unweighted coins?"
to reject the null hypothesis each time, as if they were independent, but the
null hypothesis for the entire series has to be the union, "what is the chance
I could've seen this result at _any_ of the 50, 100, 150, or 200, ... stopping
points with unweighted coins?", which is higher. Yet that's exactly how many
A/B tests are done: you start collecting data, and let the trials run until
you find "significant" differences or give up.

(It's possible to set up a series of tests where you choose when to stop based
on observed values, but you have to use different statistical machinery than
the common significance-tests.)

------
ugh
Wait, so people who do A/B tests didn’t already do that? It drives me
absolutely crazy when I don‘t have any measure to assess how likely or
unlikely it is for some difference to be random.

~~~
btilly
No, people who do A/B tests have known this for years. It is the wannabes who
haven't sat down and figured out the statistics who run into trouble. See
<http://elem.com/~btilly/effective-ab-testing/> for an OSCON tutorial that I
did on the topic a couple of years ago, which includes all the gory
statistical detail you could want.

Furthermore I note with interest that 2 of the 3 statistical techniques he
named (Student's t test and ANOVA) only apply to cases where the observed
variables are themselves normally distributed. Which is _not_ a good
description of binary yes/no outcomes. As for the remaining test, it is
appropriate to use a chi-square, but statisticians tell us that the g-test is
preferable.

~~~
sesqu
I don't see the problem. The total is very nearly normally distributed by the
central limit theorem, is it not?

~~~
btilly
The total is indeed nearly normally distributed, but the rate of convergence
(particularly in the tails) is not fast enough to avoid having those very
sensitive tests give wrong results.

Were it otherwise there would have been no need to develop the chi-square
test. It would have been entirely redundant. (It actually _is_ redundant
because we have the g-test. But evaluating the chi-square test just involves
taking squares, while the g-test involves taking natural logarithms. This made
the less accurate chi-square test much easier to do when people didn't have
computers to calculate it on. Today we should use the g-test, but few people
have heard of it.)

~~~
sesqu
Ah, right. I spent a while drawing up a proper plot of the likelihood of the
difference and the normal approximation of the difference, and saw that the
normal had too small a variance. The effect is still pretty credible in the
OP, though.

~~~
sesqu
Noprocrast caught out my attempt to edit. The normal variance is too _large_.

Here's the plot. Black for discretized(n=1000) binomial likelihood, red for
normal approximation. The effect is clear, but a t-test won't show it. I'm not
familiar with the theory behind the g-test, but there's clearly a lot of room
for improvement at these sample sizes.

<http://img693.imageshack.us/img693/4880/bindiff.png>

~~~
btilly
It looks like you forgot to rescale the binomial distribution.

If X_i is a series of independent, identically distributed random variables
with mean m and variance v, then X_1 + X_2 + ... + X_n is approximately a
normal variable with mean nm and variance vn. Therefore

(X_1 + X_2 + ... + X_n - nm)/sqrt(vn)

is approximately a standard normal.

If you draw that graph, visually the two lines should lie right on top of each
other. To see the problem you need to zoom in on the tail and blow it up, and
only then will you see the issues with the convergence.

~~~
sesqu
You're right, I did have the wrong scale, and should have realized. I believe
this plot to be correct. Black is binomial, red is normal, both for the
difference in conversion rates.

[http://img267.imageshack.us/img267/6691/binnormdifflikelihoo...](http://img267.imageshack.us/img267/6691/binnormdifflikelihoodra.png)

The normal approximation seems reasonable, and so a t-test shouldn't cause
problems. I made a plot of the relative error of the normal approximation; the
tails are indeed too fat. In particular, there's a bump to the left of 0, so a
t-test would slightly overestimate p.

[http://img205.imageshack.us/img205/6691/binnormdifflikelihoo...](http://img205.imageshack.us/img205/6691/binnormdifflikelihoodra.png)

~~~
btilly
The problem is that when you're near confidence you're in the tail, and the
student t-test is _extremely_ sensitive to the shape of said tail. If n is
large enough this difference will be washed away as everything converges to
normals. But with smaller sample sizes the difference can be quite
significant.

------
seis6
I have a test to make.

Many people think they will get will become millionaires if they follow the
style of person X.

Person X is like a trial in which a coin was tossed 10000 times and got 6000
heads.

Since there is no information about the others persons, the others trials,
many choose to follow the illogical thinking that they will succeed in the
same way.

