
Implications of use of multiple controls in an A/B test - jsnell
https://blog.twitter.com/2016/implications-of-use-of-multiple-controls-in-an-ab-test
======
gwern
> Some experimenters at Twitter proposed running A/A/B instead of A/B tests,
> to reduce the false positive rate due to “biased” buckets. An A/A/B test is
> one in which in addition to the provided control (A1), an experimenter would
> add another control group (A2) to their experiment.

I wonder what intuitions drove that suggestion? Would they be even happier
with 3 buckets (each a third of the size)? With 4 (quarters)? 5 (fifths)?
..._n_ buckets, 1 for each sample? At what point did those experimenters think
it would stop being an improvement over a single control group?

~~~
infogulch
Multiple control groups will highlight the degree that a result doesn't have
statistical significance yet. We know that peeking is a problem[1], e.g. you
peek and see that B is outperforming A by 4% you might be more inclined to
stop than if you saw that _A is also outperforming A by 4%_ ; it makes you
stop and think.

Re: N buckets: you'll get diminishing returns and it will take N times more
data points to get a statistically significant result in any bucket.

[1]: [http://www.evanmiller.org/how-not-to-run-an-ab-
test.html](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

~~~
btilly
So you have a random chance of noticing what statistics ALREADY tells you.
Except that only sometimes do you notice the problem by chance, and statistics
can let you always know about it.

This is a net loss.

~~~
infogulch
Oh I totally agree, people _should_ just be using statistics correctly. But I
would argue that the poor tool that's used is more effective than the perfect
tool that's never touched. Perhaps it would be a teaching tool/stepping stone
to get people to understand and use statistics correctly.

------
LukaAl
It is interesting to see the math behind why it did not works and probably it
is a good approach to never discard an idea just because it sounds stupid. But
I see really a little value in this line of thinking.

One of the big issues with experiments is how much value they really drive,
which is the real effects. In many situations we do A/B testing for changes
whose real impact is a fraction of a percent of the tracked metric. If anyone
followed the discussion about p-hacking in science [1] should really start
wondering if the way we run experiments, considering the little changes we
see, is not nearing p-hacking, especially when I see this ideas floating, and
if we don't need to find more significant way of running our experiments.

[1] One of the first article I found about the issue
[http://journals.plos.org/plosbiology/article?id=10.1371/jour...](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106)

~~~
squarecog
co-author of blog post here.

We covered some of this, at a high level, here:
[https://blog.twitter.com/2015/the-what-and-why-of-product-
ex...](https://blog.twitter.com/2015/the-what-and-why-of-product-
experimentation-at-twitter-0)

Certainly not everything needs to be A/B tested -- depending on the nature of
the change and the kind of insight one is looking to get, MAB may be better,
if it applies, or a number of other approaches can be more beneficial,
including not running an experiment at all. After all, Twitter itself came
from a side project at the podcasting company Odeo. If Twitter was a/b tested,
I'm sure it wouldn't move any podcasting metrics in a meaningful way :).

I'm obviously biased -- but I've seen really important insights gained from
A/B tests, and wouldn't dismiss them too easily.

------
p4wnc6
It saddens me and depresses me greatly regarding the job prospects of working
with statistics in modern tech companies when I read articles like this.

I know others will disagree with me, but my feeling after working in a number
of jobs in which large systems were built to automate large-scale frequentist
hypothesis testing operations is that these methods just wholly fail, and that
business politics leads people to read whatever conclusions they'd like to
from the results, and give credit back to the statistical framework even if it
is not deserved.

I should be careful not to over-state this too much. I understand there is a
mathematical framework for these tools and despite being trained in machine
learning and Bayesian methods, almost all of my professional experience has
been working with frequentist statistics. Sometimes you have enough data that
it would probably swamp out whatever prior you might choose; often you don't
have (or want to invest in getting) expert priors, and "non-informative
priors" and that ilk sort of rope in some of the same problems as what
frequentism has. You can get a limited amount of evidence from doing simple
frequentist things, but you also open the door for all of the
misunderstandings of p-values, ignoring effect sizes, doing model selection by
directly comparing t-stats, etc., which just seems to be way, way less of a
problem if you use other methods.

Some of the main sources of enlightenment for me on this have been:

[0]
[http://lesswrong.com/lw/g13/against_nhst/](http://lesswrong.com/lw/g13/against_nhst/)

[1]
[http://www.columbia.edu/~gjw10/achen04.pdf](http://www.columbia.edu/~gjw10/achen04.pdf)

[2]
[http://www.indiana.edu/~kruschke/articles/Kruschke2012JEPG.p...](http://www.indiana.edu/~kruschke/articles/Kruschke2012JEPG.pdf)

The last article, by Kruschke, spends a lot of time on exactly the type of
issue from the Twitter article -- something like two-group mean testing.

In general, I cannot understand why people continue to use frequentist methods
on policy problems -- that is, problems where you are inherently looking for
Pr(Hypothesis | Data), and where, due to all of the many, many pitfalls from
the above sources and elsewhere, you can be sure that Pr(Data | Hypothesis) is
not at all a suitable proxy for Pr(Hypothesis | Data).

I would view any labor costs associated with figuring out why A/A/B testing is
not a useful idea to be pure waste, not because the math is wrong or anything
(the math is interesting and good!) but because, from first principles, this
just seems like totally and completely the wrong way to approach this kind of
applied problem.

And yet, everyone is doing it, and baking it right into their huge production
analytics systems -- most notoriously all of the modern fly-by-night marketing
tech startups.

~~~
squarecog
Co-author of the Twitter post here.

I don't necessarily disagree with you! Though I hope you are not calling
Twitter a fly-by-night marketing startup :).

As a company, we've found A/B testing very valuable, but it certainly has
limitations, and pitfalls that one must be fairly experienced to avoid. You
can see some descriptions of what we do to achieve that in other posts in that
series. But it's fair to say that this takes some work (like the work
described in the "second control" post) to avoid badness, and it's fair to
question if other methods can get us to the same end result -- ability to
evaluate impact of proposed product changes -- cheaper, or more accurately.

~~~
p4wnc6
> Though I hope you are not calling Twitter a fly-by-night marketing startup
> :)

That remains to be seen.

