
A/B Testing Significance Calculator - bvanvugt
http://www.jeffpickhardt.com/abcalc
======
martian
I agree with other critiques here that this A/B testing calculator does little
to add to the conversation, and someone who uses it would be mislead in how to
interpret results.

The procedure my team uses is:

1) hypothesize an expected conversion rate and whether to use a one-sided test
(if we're testing that the state of the world remains unchanged) or a two-
sided test (if we're testing that that the state of the world has changed as a
result of the variation)

2) run those numbers through this power/sample size calculator to determine
the number of visitors we need before we can analyze the experiment:
[http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html](http://www.stat.ubc.ca/~rollin/stats/ssize/b2.html)

3) wait for traffic

4) after enough visitors have come to the funnel, pass the resulting
conversion numbers through ABBA
[http://www.thumbtack.com/labs/abba/](http://www.thumbtack.com/labs/abba/) [1]
to see confidence intervals on our results

For further reading, I highly recommend:

[http://visualwebsiteoptimizer.com/split-testing-blog/how-
to-...](http://visualwebsiteoptimizer.com/split-testing-blog/how-to-calculate-
ab-test-sample-size/)

[1] disclaimer: my colleague wrote ABBA

~~~
blueblob
I also agree. This is effectively 2 lines of R code:

    
    
        binom.test(500,2100)
        binom.test(1000,2000)
    

With a boxplots/confidence interval plots (which are not many more lines).
Whether this is useful or not depends on how much people know about statistics
I guess.

~~~
jrpt
This calculator makes reasonable assumptions. For example, it doesn't use
bootstrapping, and it assumes a normal distribution. I don't see how making
these assumptions makes the calculator bad, it just means if you are a stats
person who wants to do something different, you'll need to implement your own
script. Outside of the HN crowd, there are people who don't code, who
nonetheless work with a/b tests, and still need a measure of statistical
confidence.

Since a binomial distribution approaches a normal distribution for large N,
that's a valid assumption to make.

The arguments in this thread over which is the precise distribution are over
analyzing, in my opinion. If I had an experiment that gave one winner with a
binomial distribution, and a different winner with the normal approximation,
I'd feel little confidence in the results and want to run it again or collect
more data. I don't see how that would happen though.

Bootstrapping would be better, sure, but it would be confusing for an a/b test
calculator to give you different numbers each time you refresh the page.

You're free to write your own R scripts if you want to do it yourself.

I agree with martian's advice about figuring out a minimum number of visitors
you expect to need before analyzing the data, or else you're at risk of
biasing the experiment by declaring a winner prematurely.

------
altrego99
The confidence region appears symmetric around the mean. This indicates they
are using the normal approximation.

Exact confidence regions can be found, if you do it using Binomial instead.

The question to be asked, if m out of n is the frequency, is for which values
of p do you have P(m>=M) = 0.975 and for which value of p do you have P(m<=M)
= 0.975, where M ~ Binimial(n,p).

It can be solved easily too.

~~~
cheesycheese
The normal approximation is notoriously [1][2] bad when the binomial
proportion you're trying to estimate is small and you don't have a big n. If
one still wants a closed-form out-of-the-box formula (to avoid the exact
resolution you suggested), one can use the Wilson approximation for instance.

[1]:
[http://en.wikipedia.org/wiki/Binomial_proportion_confidence_...](http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval)

[2]:
[http://projecteuclid.org/download/pdf_1/euclid.ss/1009213286](http://projecteuclid.org/download/pdf_1/euclid.ss/1009213286)

------
noelwelsh
Pretty animation, but I have some reservations about the maths:

\- Normal approximation, as already noticed, ain't no good. Use the Wilson
score instead.

\- No power calculation? Type II errors are far more important IMHO in typical
web applications because switching costs are small.

\- non-overlapping 95% confidence interval do not imply p < 0.05. It's
actually much lower than that. 83% CI is more like p of 0.05. (Errors add in
quadrature.)

There is a tension between making something simple for the lay person and
providing knobs for the expert to twiddle. I can see the case for removing the
knobs but the choices should at least be documented.

[It's late here so this post is a bit slim on details. If you're interested
sign up to [http://bandits.mynaweb.com/](http://bandits.mynaweb.com/) as the
next section covers confidence intervals.]

~~~
keithwinstein
Arguably the "best" confidence interval for this situation is the Blyth-Still-
Casella interval (preferred by StatXact), and the "best" hypothesis test is
Barnard's test.

Here is code to calculate both:
[https://github.com/keithw/biostat](https://github.com/keithw/biostat)

I say "arguably" literally -- there is a huge body of literature on confidence
intervals for binary proportions, much of it in disagreement about what is
important. The Wilson score interval and Agresti-Caffo and whatever else are
fine approximate methods that came of age when ease of calculation was a big
concern. But if you have a computer and you're baking one thing into a
library, may as well make it the best one you can.

Of course there is also plenty of merit to just picking some prior
distribution and integrating over the conditional probability distribution
given the data, aka a Bayesian approach.

In practice I don't think this (stats geekery about the merits of different
confidence or credible intervals) is the most important part. The numerical
results from all these techniques will be pretty similar.

The important part is in the design of the experiment, the interpretation, and
playing by the rules. If you want to dynamically tune a Web site to make the
most money as more information rolls in, that calls for a different experiment
than a standard hypothesis test. (Even if you want to peek early at the
results and possibly abort the test as a result, that calls for different
tools and different rules.)

~~~
noelwelsh
Hadn't heard of the Blyth-Still-Casella interval or Barnard's test before, so
thanks for that.

By "If you want to dynamically tune a Web site to make the most money as more
information rolls in, that calls for a different experiment than a standard
hypothesis test." you're talking about minimising regret / bandit algorithms?

------
alexgolive
Try out [http://www.evanmiller.org/ab-testing/chi-
squared.html](http://www.evanmiller.org/ab-testing/chi-squared.html). That's
what we use at SimplePrints all the time. Great library of functions.

------
vii
This methodology is flawed, because in practice the conversion rate changes
over time. There are different effects that cause this temporal dependence not
including the natural effect on the product of certain techniques (for example
testing a very loud and painful to escape from upsell will cause some people
to agree into the upsell and never see it again, and others to feel pissed
off). Other causes of temporal dependence are different mix of traffic in
terms of geography and demographics at different times of day and week.

Even using proper Wilson confidence intervals with good methodology with tens
of millions of impressions per group, we would see day to day variations in
rate outside confidence intervals of the previous day, way more frequently
than one would expect (one would expect a 95% confidence interval to be
exceeded once every few weeks instead of every couple of days).

The proper methodology is to estimate by bootstrapping on a good selection of
dangerous variables, including time.

------
tedsanders
Sorry to be another typical HN nitpicker, but one problem I see with many of
these approaches is that the significance level (95% in this case) is picked
out of thin air. The reality is that even with a single data point, you have
information. The information may not be reliable, but it's information
nonetheless. The only reason that people don't redesign based on unreliable
information is that redesigns have costs: costs for the developer and costs
for the users. Given that different sites have different cost functions, they
should also have different significance thresholds.

One size does not fit all.

~~~
jessicaraygun
The recommendation of 95% significance level is quite common across A/B
testing resources. The main goal here is to be able to say which variant will
consistently perform best (in this case, 95% of the time).

I agree that it would be useful to be able to change the confidence level
based on a site's specific needs - this is a feature I am looking to implement
in Confidence.js soon.

------
bertil
It's great!

Although most real-life cases that I'm familiar with also include an average
basket, even repeated purchases. Explaining that the anecdotal big purchase on
version A is actually anecdotic, and that you need to consider the
distribution, escalation slope, rhythm of purchases… all that is very
difficult, especially with simple tools around like this one that make it
sound like such tests are actually simple and can work for any level of
audience. Having a decent separation of expected LTV on a four-pronged A/B/C/D
test, especially when your conversion rate is around 1% and your re-purchase
well under 20%… that’s a challenge that requires millions of users for months.

------
pkeane
[http://www.experimentcalculator.com/](http://www.experimentcalculator.com/)

------
timedoctor
This is a better tool with more information provided (in an excel spreadsheet)

[http://visualwebsiteoptimizer.com/split-testing-blog/ab-
test...](http://visualwebsiteoptimizer.com/split-testing-blog/ab-testing-
significance-calculator-spreadsheet-in-excel/)

------
lifeisstillgood
Ok I give up even pretending to understand statistics anymore

I'm going to pick up a neglected "Think Stats" from OReilly and would
appreciate anyone's feedback on Stats Moocs on coursera or similar

(I'm finding long division difficult these days)

~~~
bvanvugt
I'm in the same boat - I've given up and leave these things to the experts :)

~~~
lifeisstillgood
Well I'm not prepared to leave that much to the experts - I should at least
know how to design an experiment that gets me a 95% confidence result, how to
not pick if my null hypothesis is right or wrong, and at least be able to
remmeber the difference in calculating permutations and combinations

It's like I know there is a castle over there and I know it has a turret, but
I have never gone in. Different from not knowing what is out there at all.

------
hammock
I wish these things would let you change the confidence interval. 95% is
really only used in an academic context. If you think about decisions that you
make in a typical business context, you are not using more than 80% or so

~~~
jessicaraygun
This is a feature that I would love to add to Confidence.js sometime in the
near future.

------
dalek2point3
really? now we have to have an entire website dedicated to the t-test? surely
this is something you could learn to do in a spreadsheet? no?

