

Announcing Evan's Awesome A/B Tools - EvanMiller
http://www.evanmiller.org/announcing-evans-awesome-ab-tools.html

======
gostevehoward
These are great! Thank you Evan. Your sample size calculator is wonderful and
beats the hell out of the 90s tool I've been using which cautions me that my
browser must support "JavaScript" to use it :)

As an alternative to the Chi-squared calculator, people might want to check
out ABBA, a tool I wrote here at Thumbtack:

<http://www.thumbtack.com/labs/abba/>

It shares the visual component and the linkability, two great features you've
nailed. It lacks the live updating and the slider, which is really cool and
something I've wanted to add to ABBA for a long time. On the other hand, it
supports multiple groups compared against the baseline simultaneously and
incorporates a correction for multiple testing into its p-values and
confidence intervals, which can be handy. It also uses different mathematics
under the hood, but that's not going to be a concern for most users.

Glad to see another step towards a more statistically-aware world!

~~~
yahelc
You'll be interested to hear that the digital analytics team at the Obama
campaign sometimes made use of your tool for sharable A/B calculations.

I even built a bookmarklet for quickly grabbing numbers off off of a page
(usually for Google Analytics data) and passing them to your calculator.
<https://github.com/yahelc/ABBA-bookmarklet>

~~~
gostevehoward
Wow this bookmarklet is awesome! I'm going to start using it. Thanks!

------
mcfunley
You scooped mine by like 20 minutes, which is weird, so I'll just put this
here:

<http://www.experimentcalculator.com/>

*edit: yours is awesome. Nice work.

~~~
arkitaip
I tried and liked the simplicity of yours! I also learned that as a small
ecommerce - 5k/day - you may never have enough visitors to effectively run
reliable tests :(

~~~
drewda
That's also why experimental psychologists have mixed feelings about running
power analyses* to figure out how many participants they'll need in a study to
to yield statistically meaningful results--it's almost always a humbling high
figure.

* <http://en.wikipedia.org/wiki/Statistical_power>

------
bryanh
Great tools!

My goto reference is still the wonderful btilly presentation about how to A/B
test properly, with nice examples and code snippets:
<http://elem.com/~btilly/effective-ab-testing/>

He provided a full on javascript tool that isn't as polished but works great:
[http://elem.com/~btilly/effective-ab-testing/g-test-
calculat...](http://elem.com/~btilly/effective-ab-testing/g-test-
calculator.html)

------
binarysolo
As a math/stats/data person who doesn't dabble much in web optimization -- can
someone explain to me what's awesome about this?

Not to belittle this nice package -- it looks like a basic stats calculator
for calculating sample size confidence levels with friendly visualization, and
I'm just trying to understand what's being valued on the market/industry right
now. Is it because current A/B testing software doesn't provide these basic
calculations? Or is it that it's well presented and visualized to a lay crowd?

~~~
christopheraden
Like all of A/B testing, it's applying a _very_ old statistical method (Chi-
Square was one of the first modern statistical techniques--by that I mean it's
113 years old) to an area where statistics has not commonly been used. This
makes it seem wonderful and novel as countless people suddenly realize that
statistics can be applied to fields that were previously untouched by
quantitative analyses.

The statistics being used in the A/B testing world is stuff you would've
learned in your very first statistics class, it seems. From the success of
Optimizely and VWO that the focus is definitely more on the viz and
presentation than it is on using any cutting-edge techniques.

~~~
binarysolo
Gotcha, and thanks -- it just seemed trivial and I was under the (false)
assumption that confidence levels and selecting appropriate sample size should
be common knowledge, given how much polls are used in day-to-day life.

Good to know there's plenty of opportunity to bring better stats to high tech.
Of course, I understand a lot of the value comes from making those things
applicable and meaningful to the users...

~~~
christopheraden
Power analysis and CI's should be elementary, but I would assert that they are
actually not commonplace. Most people have a very surface-level understanding
of the latter, and little understanding of the former. In my opinion, A/B
Testing has actually done a great service to power analysis. I have seen many
experiments in the academic world (social sciences are somewhat notorious for
this) forgoing the power analysis for various reasons (fear: they would not be
able to get the sample size needed for 80% power, inability to control sample
size: you take whatever you can get with a convenience sample). As a
statistician, I breathe a sigh of relief with the amount of emphasis power
analysis receives in the A/B world. It's a step in the right direction (if
you're an acolyte to the dark world of Neyman-Pearson).

As for bringing better stats to high tech, I've thought of this as a wonderful
challenge. I'd especially like to see more focus on not violating modeling
assumptions (more non and semi-parametrics), and using some more modern
techniques from the ML and Bayes literature.

Hypothesis testing is so last century :). Would love to discuss it further
with some similarly-inclined HN folks.

Sorry for all the parentheticals. You'd think I was a lisp programmer with the
amount of parenthesis I used.

------
christopheraden
"Use a dedicated statistical package from the '80s" Is your Wizard app not
also a dedicated statistical package? Also, I'm being pedantic here, but how
many "dedicated statistical packages" are actually from the 80s? The only ones
that come to mind are Stata and Statistica.

Is there a way to view the source code or formulas you use on your pages?
There's been a strong push in the academic statistics world for reproducible
research, which means public data, open source statistical code.

I ask because I'm curious about your two-sample t-test. Does it pool the
variances for all values of the two standard deviations? Pooling doesn't make
sense when one sd is 50 and the other is 2...

------
WA
That is really awesome. One suggestion though: Make "relative" the default as
well as set "1−β" to 90 or 95%.

I assume that most people have a conversion rate of X, say 30%, and want to
increase this by Y, say 20% (30% to 36%). If I consider the type of headline
many blog posts and reports have, they are like "How we increased sales,
trials, whatever by 50%". That's how they think.

And well, I usually aim for 95% significance.

------
twog
Hey Evan,

We met a few times during last years Gig Tank (I am one of the cofounders of
<http://banyan.co>). Awesome to see you killing it. Are you planning on coming
back to Chattanooga anytime soon? Would love to grab beers. My email is in my
profile, and I would love to reconnect.

------
pwr
Are there any usefull resources for learning the statistic methods needed for
A/B test evaluation for beginners?

------
viktorsr
MixPanel's Split Test calculator supports multiple groups, but doesn't have
anything visual:

<https://mixpanel.com/labs/split-test-calculator>

They use two-proportion z-test.

------
heliostatic
These look great. For the sake of the permalinks, I'd love to see these hosted
on another domain. Maybe a Wizard app domain, to increase brand awareness?

~~~
cpsaltis
A github page would also be a good idea

------
aresant
Evan these are incredible tools, thank you for contributing another brick in
the wall for those of us that bleed A/B testing :)

I would love to see Optimizely and VWO embrace similarly non-ambiguous and
functional reporting as a default.

EG - just introducing Chi-Squared testing into a discussion with clients or
teams that think that they're A/B testing properly by following Optimizely's
graphs usually turns the discussion on it's head - "you mean there's a RANGE?
well how can we be certain?" etc.

Great work, thank you!

~~~
archildress
Curious - for someone statistically ignorant like me, can you provide some
detail on how we can use Evan's tools for split testing improvements?

I guess the question I'm boiling this down to is... Why are graphs and
comparisons of results that Optimizely or VWO produces not good enough?

~~~
aresant
The short answer is that they don't provide much depth.

I've seen Optimizely call something a "winner" with 95% confidence after
48hrs.

The triangular method we use w/the off-the-shelfers is something like:

a) Optimizely base stats

b) Convergence point analysis (useful to correct for day-of-week / unique
traffic swings)

c) Chi-Squared Testing which provides a range so that you can actually assess
the risk of a high-confidence test. eg look at the example in evan's tool
which shows 8.5%-22% and 13.%-28.9%. This means that Sample 1 could be as HIGH
as 22.1% conversion and Sample 2 could be as LOW as 13.3%. If this was rated
as a high-confidence test that Sample 2 was rated higher than Sample 1 you
could be potentially risking a signifigant conversion decrease if you went
with Sample 2. EG needs more data and don't just buy into the "this one is
better"

