
A/B testing significance calculator (spreadsheet in Excel) - joshuacc
http://visualwebsiteoptimizer.com/split-testing-blog/ab-testing-significance-calculator-spreadsheet-in-excel/
======
nhebb
It's been years since I last cracked a statistics book. Would a t-test be more
accurate than a z-test or chi-squared for small samples? I've love to see a
write-up on the statistics equations for A/B testing and guidelines on when to
use which.

~~~
btilly
_Would a t-test be more accurate than a z-test or chi-squared for small
samples?_

No. Here is a brief overview of the different tests.

The z-test depends on the assumption that the sum of random events is well
approximated by a normal distribution. This happens with a large enough sample
size. But is wrong for small sample sizes.

A further weakness with the z-test is that the same sample data is being used
to estimate variance and averages. Depending on the specifics of your actual
sample, both estimates can be off in a way that confuses the test. The t-test
corrects for this possibility _but_ assumes that the individual measurements
you are making are normally distributed. It is quite sensitive to this
assumption. Unfortunately in A/B testing the measurements are binary yes/no,
1/0 decisions, which are decidedly not normally distributed.

The g-test is designed to be a very good approximation to the exact problem
that A/B testing tries to solve. Unfortunately it requires you to compute
logs, which was hard before computers.

The chi-square test is designed as an easily computed good approximation to
the g-test. It only needs you to square numbers and add them together. It
therefore became the most widely used algorithm, and is the one that people
have heard of.

Now that we have computers, the g-test is the test we should use. But nobody
has heard of it. (However you're still free to use it! And should.)

If you get a large enough body of measurements then all tests converge to the
z-test.

 _I've love to see a write-up on the statistics equations for A/B testing and
guidelines on when to use which._

<http://elem.com/~btilly/effective-ab-testing/> goes through this in some
detail.

~~~
equark
A/B testing is a Bayesian decision problem. Hypothesis testing is really the
wrong way to frame this problem. Dynamically it is a bandit problem and
statically it's just about maximizing expected utility, where the expectation
is taken with respect to the posterior distribution. This becomes especially
crucial when doing lots of tests or experimenting sequentially. All the
frequentist proposals don't have their advertised coverage rates in these
situations. Unfortunately most write-ups seem to be completely ignorant of
Bayesian statistics.

~~~
btilly
I mostly agree with you. However if you've ever tried to explain those
concepts to a bunch of PMs, you'll quickly discover that it isn't worth the
complication of trying to describe things the "right" way. There are more
important problems to deal with.

My one point of disagreement is that dynamically it is a bandit problem. The
reason that I disagree is that the things we are tracking and trying to
optimize have very strong time dependencies that aren't obvious. (Behavior
varies a lot by time of day, day of week, week of month, and month of year.)
Therefore attempting to dynamically optimize can very, very easily go astray.
That's not to say you can't do it. For instance I am quite supportive of using
dynamic techniques to explore a multi-variate space. But in the end I wouldn't
be comfortable unless you did a classic A/B test to validate the results.

------
ashishk
If you're interested in a web-based calculator, I've found this one to be
useful: <http://www.usereffect.com/split-test-calculator>

~~~
paraschopra
Yup, but do note that the calculator you linked is based on Chi-square and the
one in excel is based on z-test. So, results may differ slightly!

In case you need to use a web-based calculator (z-test), here is one on my
site: [http://visualwebsiteoptimizer.com/ab-split-significance-
calc...](http://visualwebsiteoptimizer.com/ab-split-significance-calculator/)

~~~
btilly
Why did you choose to use the z-test?

~~~
paraschopra
Z-test gives you confidence intervals on conversion rate with G-test or Chi-
square test doesn't

~~~
btilly
Good point. Confidence intervals that shouldn't be trusted too much, but
confidence intervals none the less.

