
Simple Sequential A/B Testing - revorad
http://www.evanmiller.org/sequential-ab-testing.html
======
warkon
A much simpler approach is to AABB test instead of AB test. Rather than
splitting your users into 2 buckets (A and B), split them into 4 buckets (A1,
A2, B1, B2). Give groups A1 and A2 one variation and groups B1 and B2 the
other variation. When A1 equals A2 and B1 equals B2 then you know you have
statistical significance and you can compare A1+A2 to B1+B2.

~~~
SyneRyder
This is great advice. One of the best things about doing AABB testing is when
your two A groups & B groups _don 't_ converge, you can identify bugs in your
testing procedure or measure the margin of error (since you know those groups
are seeing the same thing and should be performing identically). Seeing two
identical A groups with wildly different results will make you more skeptical
of generic A/B results & make you more rigorous about your testing.

~~~
jfarmer
> since you know those groups are seeing the same thing and should be
> performing identically

That's not how A/B testing works. 95% confidence means you should expect a 5%
false positive rate, i.e., you should expect the difference measured in an A/A
test to be statistically significant 5% of the time. You'll always measure
_some_ difference, since no two random samples will be 100% identical in every
regard.

The procedure you and the parent propose is tantamount to selecting 1 out of
every 20 test results and discounting it for no real reason. It adds extra
cost to your A/B testing without producing more reliable results.

See also: [https://xkcd.com/882/](https://xkcd.com/882/)

It's a different matter if you're running multiple A/A-type tests over an
extended period of time to ensure that the false positive rate is actually 5%,
a kind of meta-statistical test. As a sanity check this is sound, but vastly
more expensive than what the OP is proposing (for example). I've never seen
anyone use A/A, A/A/B, A/A/B/B, etc. tests in this way. Rather, I've only ever
seen them used as you and the OP suggest: the two A buckets should be "the
same" and if they aren't, the results should be thrown out.

------
tristanz
Why not just estimate p(A - B | observed data) and be done with it?

~~~
yummyfajitas
Because that's a lot harder. At VWO we do exactly this in our new SmartStats
tool - we believe Bayesian stats are far more intuitive to the end user. But
we've got a whole compute cluster running these calculations 24/7\. For
comparison, our old frequentist calculations were done in PHP at page render
time.

A formula that can be easily plugged into Excel does have value over a formula
that requires someone to write a python script and do a monte carlo
simulation.

Evan's method is also nice in that it doesn't require a lot in terms of
knowing your (possibly very small) conversion rate ahead of time.

