
The multiple comparisons problem - akerl_
https://www.chrisstucchio.com/blog/2015/ab_testing_segments_and_goals.html
======
aidanf
One of the big problems with AB testing is that the people running tests
within a company often don't have a strong understanding of the statistics
behind the tests. And unfortunately, they often have little interest in
performing statistically valid tests.

They might be from the marketing department, they might be web-designers, they
might even be growth hackers (eek), but most likely they are not
statisticians.

Their motivations may not align with running tests correctly. E.g. they may
want to strengthen the argument for their design over someone else's competing
design, or they may be looking for number to serve up to their boss at their
monthly review meeting. But performing a statistically-valid test is low on
their list of priorities.

I love articles like this that go into the minutiae of how to perform AB tests
correctly. But I think they only speak to a small portion of the people who
are doing AB tests.

~~~
rfergie
I agree that this is a problem but I wouldn't say it is a big problem in
comparison to the problem that most companies have - they don't run any tests
at all.

I actively need to prevent myself from going too deep into the maths and
caveats of A/B testing with clients. Running tests with bad maths trumps
running no tests at all in my experience

------
noelwelsh
This is an interesting problem. There are so many ways to perform A/B testing
incorrectly that I expect most A/B tests performed in the wild yield
questionable results.

I think there is a lot of pseudo-scientific talk around A/B testing that leads
people to either put too much credence in their results, or, paradoxically,
avoid advanced techniques due to fear their results will be inaccurate.

I'm coming to the conclusion that one should treat A/B testing from more of an
economic point-of-view. You have so many views this month -- how are you going
to spend them on tests? I think one should look at A/B tests as a source of
information, not of truth, and thus avoid chasing after statistical rigor that
won't, in fact, be realised in most cases.

~~~
btilly
_I 'm coming to the conclusion that one should treat A/B testing from more of
an economic point-of-view. You have so many views this month -- how are you
going to spend them on tests?_

I'm curious, did our conversations a few years ago help shape this opinion? My
guess is that we've both moved somewhat towards a middle position.

~~~
noelwelsh
Yes, I think so.

It might be of interest to you that there is now some published work on bandit
algorithms for when all you have is ranking, not absolute ordering. This is
close to what we were talking about:
[http://arxiv.org/pdf/1312.3393.pdf](http://arxiv.org/pdf/1312.3393.pdf)

------
ayy88
Another neat approach to dealing with multiple comparisons is Bayesian
hierarchical models. Instead of correcting significance thresholds after the
fact, all comparisons are represented as parameters in the model from the
start. You would still need to 'preregister' segments as you suggest. See
Gelman's paper,
[http://www.stat.columbia.edu/~gelman/research/unpublished/mu...](http://www.stat.columbia.edu/~gelman/research/unpublished/multiple2.pdf)

------
fiatmoney
The Kelly criterion doesn't get enough of a shout-out in discussions of A/B
testing and multi-armed bandits. It's particularly apt when you're discussing
conversion rates from customer acquisition campaigns - turning ads into
customers into ads into customers.

[http://en.wikipedia.org/wiki/Kelly_criterion](http://en.wikipedia.org/wiki/Kelly_criterion)

------
trjordan
In many cases, I think this is fine.

Take the case given in the article: 2 designs that appear equal, but
segmenting reveals (incorrectly) that one performs better on Android and the
other on iPhone. Statistically, it doesn't matter which one you pick, because
you don't have a strong result. Anecdotally, you might go down the wrong path
and waste some time doing something like showing design A to Android and B to
iPhone. But the only thing lost is time -- probably a small amount of time
compared to creating the different designs.

Best case, you realize there's something technically wrong with your changes
in a certain segment because things are widely different. Worst case, you make
some changes that don't matter. Overall, the risk / reward of segmenting seems
worthwhile.

~~~
yummyfajitas
This is very true - my post is mainly coming from a classical frequentist
hypothesis testing perspective. From a Bayesian/portfolio optimization
perspective, all that segmentation isn't hurting you much.

But from a business perspective, it's a huge distraction. Your marketers
shouldn't be wasting time segmenting - there is no reason to believe it's
making any money. And due to multiple comparisons, it's virtually guaranteed
that your A/B tests will steer the marketers in the direction of more
segmentation.

------
btilly
Here is a simpler back of the envelope solution to the problem of
segmentation.

Automatically ignore all statistics on segments that are below a certain
sample size. Say, 4k successful conversions. If there are no real differences,
you'll still draw lots of wrong conclusions. But your odds of wrongly drawing
significantly harmful decisions will be very low, and some of the conclusions
you draw will be very good.

Your procedures may make statisticians curse and swear. But it provides a
simple rule of thumb that is easy for non-statisticians to understand. And
mitigates the really bad mistakes.

~~~
yummyfajitas
My strong disagreement with you here is a good illustration of the practical
differences between Bayesians and Frequentists. As a Bayesian, I'm horrified
by the idea of doing anything other than building a segment model and letting
the fatness of the posterior handle the low sample sizes.

But yeah, that is a pretty good way for a marketer to avoid being _too_
stupid. And I can't say I haven't done this a bunch of times, particularly
when doing quick exploratory calculations.

------
acveilleux
So 10 years ago when I was in academia, this was a prime way to sink a paper's
conclusion in 5 minutes. Obviously the tech world has to relearn lessons from
other fields the hard way.

------
mattyfo
This is really helpful with understanding where data analysis goes wrong but
he totally mis-characterizes proper segmentation strategies.

------
plg
[http://xkcd.com/882/](http://xkcd.com/882/)

