
A/A testing - luu
http://jvns.ca/blog/2015/02/06/a-a-testing/
======
thinkmoore
You might be interested in the statistical technique called "bootstrapping":
[http://en.wikipedia.org/wiki/Bootstrapping_(statistics)](http://en.wikipedia.org/wiki/Bootstrapping_\(statistics\))

The "A/A" method described is not a terribly robust way to estimate variance,
but the basic idea of using subsamples to estimate variance is what
bootstrapping does more systematically.

~~~
czep
Yes bootstrapping should be introduced far sooner. Most of the variables we
are interested in, say revenue per session, are in no way normally
distributed, thus violating the assumptions of classical two-sampled t-tests.
Bootstrapping, and MC methods, provide a better solution than parametric
tests.

~~~
yummyfajitas
Most classical tests don't require the variable to be normally distributed,
they require the _test statistic_ to be. I.e., you don't need revenue/session
to be normal, you need sum(revenue_per_session) to be normal. As long as you
don't have any long tails and your variables are IID, that will happen:
[https://en.wikipedia.org/wiki/Central_limit_theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)

More interestingly, things like revenue/visitor have a known probability
distribution. It's not normal, but it is known. You can use a LOT fewer
samples if you use a parametric test (either Bayesian or SPRT) based on the
correct distribution.

If you use bootstrapping instead, you'll a) give up all your finite-sample
guarantees and b) wind up using a LOT more samples than you need.

~~~
eanzenberg
But how useful is comparing sum(revenue_per_session) when you want to test
significance of one batch to the other? Aren't you then just comparing 2
values and seeing which is greater?

If you compare the 2 batches of revenue/session distributions using a monte-
carlo simulation you can calculate the probability that one is significantly
different than the other. This generalizes beyond the 2 sample t-test because
those underlying distributions are non-normal.

Please let me know if I'm thinking of this correctly (or not)

~~~
yummyfajitas
Ok, to test one relative to the other, you might test
W=sum(revenue_per_session_A - revenue_per_session_B). Interpret the
subtraction as a vector op. (Adjust a bit if you want to do a Welch test.)
Assuming the CLT holds this statistic is normally distributed. Assuming the
null hypothesis holds, it has mean 0.

Thus, you can do all your normal Stats 101 tests on it.

 _If you compare the 2 batches of revenue /session distributions using a
monte-carlo simulation you can calculate the probability that one is
significantly different than the other._

A frequentist test (which includes most bootstrap methods) can never tell you
this. Frequentist statistics doesn't even acknowledge this as a legitimate
question to ask.

Now I agree, if you can use the exact distribution of revenues directly in the
test, you can get answers even before you have enough samples for the CLT to
apply. But if you use a nonparametric method like bootstrap, you'll need to
use up a lot of samples unnecessarily.

------
skrebbel
I really love how this very clearly _visualises_ the need for statistical
significance. To me, a novice, the A/A/B chart is wildly more illustrative of
a point than the an A/B chart based of the same sample data with some
significance number next to it. I understand from some of the comments here
that there's all kinds of ways in which this A/A/B thing is subpar. But, if
the chance of someone misinterpreting a chart decreases more than the chance
of the chart itself being misleading, then isn't it a big win?

I'm really nerd sniped here. Is there any branch of statistics that focuses on
human understanding? For example, there's all kinds of blogs and stories out
there about how doctors routinely make wrong choices because they don't
understand statistics well enough. Is there any serious body of knowledge that
explores ways of getting these doctors to make these mistakes less frequently,
without having to send them to sites with titles like "An Intuitive
Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem"?

~~~
marvy
I have no idea. But I will chime in with one factoid I've heard a few times.
If you say things like "the false positive rate is 9%", people's intuitions
lead them astray, but if you say things like "9/100 healthy people are labeled
as sick by this test", then intuitions work much better.

------
chrisconley
Another cool way to cover your bases is to run monte carlo simulations.

At my previous employer, we open sourced a command line utility that we used
to validate our statistical models if anyone's interested:
[https://github.com/monetate/monte-carlo-
simulator](https://github.com/monetate/monte-carlo-simulator)

------
alxv
A/A tests (also known as Null tests) are useful to validate that users are
assigned to the control and experiment groups without bias.

Offline resampling methods, like bootstrapping, are better if you're looking
to robustly estimate the variance of the experiment.

~~~
chucklarge
Yes, a Null test really should be the first test you run if you're looking to
get into A/B testing.

You should also run an ongoing A/A test across your site or app to have
confidence your bucketing, data pipeline, stats tests and effect on metrics is
working as expected over time.

------
dude_abides
Bootstrapping is a simple-to-explain and extremely powerful statistical method
that is essentially the equivalent of an A/A/A/A/A/A/.../B/B/B/B/B/B/... test
in OP's terminology.

What is especially powerful about bootstrapping is that it does n't make any
simplifying assumptions about the underlying distribution, unlike other
methods to obtain confidence intervals.

------
std_throwaway
Your example shows that simple statistical tests aren't always that simple.
Especially when talking about small effects.

What you really want are confidence intervals which show what would be a
significant change. You can calculate that from your A-data and from your
B-data. If they overlap you probably aren't quite there yet.

Comparing A/A vs. B or A/A/...A/A vs. B/B/...B/B is a poor man's approach to
visualize the distribution of the mean values.

Things get further complicated when doing a lot of tests. If you do hundreds
of A/B-Tests and a handful show a weakly significant result that may actually
be a statistical fluke. The likelyhood that a wrong seemingly significant
result is present when doing hundreds of tests can actually be pretty high.
You should rerun these tests with fresh data and check for consistency, which
in itself is some kind of A/A/B/B-test.

------
peeplaja
A/A testing is a waste of precious testing time
[https://plus.google.com/105925791633746539648/posts/EhFuZ6Fh...](https://plus.google.com/105925791633746539648/posts/EhFuZ6FhFSX)

~~~
blowski
Although the link to which you have referred to here does say A/A testing is a
waste of time, the OP article said the headline was a bit of a misnomer -
they're actually referring to A/A/B testing.

------
WA
_and tells you how long you’ll need to run your experiment for to see
statistical significance._

I always thought that statistical significance isn't something that should be
tried to achieve, but merely a performance indicator how good the experiment
was. Isn't it odd to try to "achieve significance over time"?

Shouldn't it be: "Your experiment requires 5,000 visitors and after that we'll
check if the result was significant enough to not be merely due to random
chance"?

Could someone with more statistical understanding elaborate this a bit?

~~~
tansey
_> Shouldn't it be: "Your experiment requires 5,000 visitors and after that
we'll check if the result was significant enough to not be merely due to
random chance"?_

That's basically what is happening with the tool, I think. It is asking for
how many users per day you get in order to approximate the sample size for x
days, then it's asking how much power you want. Power is the likelihood of
detecting a difference if there is one. It also asks what confidence level you
want. All of those together give you an approximate answer to the amount of
time, assuming # of users/day is roughly constant.

~~~
czep
There are 4 inputs needed to estimate sample size for a test: power,
confidence level, expected difference, and variance. You need all 4 before you
run any test. You use the A/A test to estimate variance. Power is the
probability of detecting an x% difference when one really exists. Typically
you see .8 or .9. Confidence level is the probability of detecting a
difference when one really does not exist, typically .05. The 4th item is the
expected difference of the test. If you want to detect a 1% difference, you
will need more sample than if you want to detect a 5% difference.

You have to know all 4 before you do a test. A test is designed specifically
to detect a certain difference. You cannot launch a test without knowing that
as part of your hypothesis.

~~~
tansey
Yep, that's a more precise version of what I was saying w.r.t. estimating
sample size. I think the tool makes some assumption about variance, but the
other 3 are things you supply. Note that I wasn't saying anything about the
A/A test article, just the sample size estimator that's linked to.

------
czep
Please see the papers here: [http://www.exp-
platform.com/Pages/default.aspx](http://www.exp-
platform.com/Pages/default.aspx)

These are from the team that built amazon's weblab. The foundation of large
scale web experimentation.

To be working in this field and not be familiar with this work, eg. the
concept of A/A testing, is like deciding to build jet engines without having
heard the idea of a bypass ratio.

~~~
blumkvist
That's an excellent link! Wasn't aware of it, despite heavy involvement in
analytics.

------
ucha
A box plot is another - I think better - way to represent the variance in a
control group A. See
[http://www.mathworks.com/matlabcentral/fileexchange/screensh...](http://www.mathworks.com/matlabcentral/fileexchange/screenshots/9061/original.jpg)

------
darkxanthos
Interesting stats hack definitely. I love the point the other comments are
making: Learn more about simulation and bootstrapping. It'll still require a
little probability but all of the results will make a ton of sense.

~~~
blumkvist
But, but, but... Optimizely?

------
calinet6
How would this be different from simply increasing the size of the control
group? Or subdivide the control into N groups of sufficient size in order to
more effectively visualize the variation?

~~~
hcarvalhoalves
If you just have one big control, this still does not tell about variance
inside the control. You can subdivide the control in N groups, but I think you
quickly increase the noise to signal ratio.

A control group split into two is a good compromise, and intuitive to reason
about, like the author points.

~~~
calinet6
But statistically, they're exactly equivalent. I guess I don't get the
advantage.

------
TheLoneWolfling
Huh.

A quick and dirty way to avoid having to do much of any stats. Interesting.

------
lcedp
It's like a reverse of Simpson's paradox

------
confiscate
seriously? how is this any more accurate than pure "hunch" or eyeballing?

