
p-Hacking and False Discovery in A/B Testing - gwern
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3204791
======
boron1006
I would be shocked if it were as low as 57%. As an intern, I found that the
analysts in charge of A/B tests often didn't have a background in science or
running experiments, and didn't really care. There were a couple of data
analytics teams in the company, and I think a lot of the developers didn't
like my team because we were seen as "fussier" than the other one. We required
people to preregister hypotheses, and run experiments for predetermined
amounts of time.

I don't think the tech environment is very conducive to running experiments.
Everything moves too fast, by the time you figure out the results someone
gives you are bs, they've already got promoted 3 times and work as a director
at a different company.

I work in science now, and although people still p-hack like hell, there's at
least some sort of shame about it. There's a long term cost too, I've met a
couple researchers who have spent years trying to replicate some finding they
got early in their career through suspicious means.

~~~
tedsanders
Running experiments for preset lengths is a mistake. If an effect is strong,
it will show up earlier. If this is the case, you want to be able to switch
earlier. If you are running a drug trial of drug A vs a control, and drug A
kills 100% of the first 100 patients who take it while the control kills 0,
you end the trial immediately. You don’t continue to give it to 900 patients
just because you pre-registered to treat 1,000 patients, thinking that the
effect would be small. This is one reason I think Bayesian approaches are
better than frequentist approaches for A/B testing.

~~~
yichijin
Hey all, statistician from Optimizely chiming in here. Just wanted to point
out that this is exactly the right point.

I wanted to add one detail--there actually _are_ ways to do early stopping
while staying within a frequentist approach. For example, most clinical trials
methods are not Bayesian but rather are just fixed-horizon tests that have the
allowable amount of Type 1 error "spread out" amongst the multiple looks that
are planned in advance.

At Optimizely we essentially have a continuous version of this that does in
fact allow for multiple looks with rigorous control of Type 1 error. As
tedsanders mentions, the key upside is that if you start an experiment with a
larger-than-expected lift, you _can_ terminate it early. Then over many
repeated experiments, you gain a lot in terms of average time to significance.

The dissonance in this discussion mostly stems from the fact that this paper
(which we actually collaborated on!) uses data from 2014, before we rolled out
this new Stats Engine.

For more, I would encourage a look at our paper:
[http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-
tests-w...](http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-why-it-
matters-and-what-to-do-about-it)

~~~
smallnamespace
What's the tradeoff vs. just taking a direct Bayesian approach?

In fact, why use an inferential framework at all (estimating some sort of
probability and using it to guide action), rather than directly using a policy
learning framework, e.g. modeling this as Q-learning or multi-armed bandit
problem?

If at the end of the day you have some objective function (e.g. 'making
money'), some _known_ space of actions (e.g. move this widget up the page,
change the color, engage with user this way), and a reasonable way to
associate those two, then isn't the _company_ literally doing reinforcement
learning over time?

It seems one benefit of a reinforcement learning framework is it maintains a
set of actions that will still be explored in the future without forcing you
to prematurely 'choose' whether A or B is actually better—if A is better in
reality, then it will be explored more and more often and B will progressively
become downweighted over time.

~~~
srean
> If at the end of the day you have some objective function

That "If" often evaluates to false.

There are tough judgement calls involved in selecting what is that metric that
the org wants to optimize. It is very rare that business management commits to
a clear quantitative goal. Reasons are many -- weasel room is important
politically, selecting a metric that captures short term and long term goals
is difficult, there is a lot of uncertainty in the costs due to uncertainty on
how overhead should be billed etc etc.

This is fairly common. Typically, in these situations its the PMs who make the
final call. There the goal of the experiment is to glean as much knowledge as
possible, and present it to the PM. If that comes at the cost of exposing some
customers to bad choices, so be it -- in other words, explore at the cost of
losses in the opportunity to exploit.

------
aisscott
Hi, I am one of the authors. We found that people p-hack with traditional
t-tests. Most A/B tests were run this way in the past and some still are. The
paper is using Optimizely data (from 2014) before Optimizely introduced new
testing in 2015 designed to solve the issues we found in the paper.

If you want to know how Optimizely prevents p-hacking check out the math
behind Optimizely’s current testing here:
[https://www.optimizely.com/resources/stats-engine-
whitepaper...](https://www.optimizely.com/resources/stats-engine-whitepaper/)

~~~
boron1006
I'm curious about the wording "effects are truly null". I was always under the
impression that you can never really "accept the null", but rather "fail to
reject the alternative".

~~~
gwern
In your standard NHST test, sure. But you can do different models. If I'm
reading OP right, what they do is a mixture model, in which effects are
assumed to come either from a zero-mean distribution or a positive-mean
distribution with unknown probability _P_/1-_P_ and then you fit the
collection of >2k effect sizes to find out what value of _P_ best fits the
dataset as a whole. Apparently the best fit assumes that ~70% of effects are
actually ~0.

This can also be done nicely with a Bayesian mixture model or a spike-and-slab
multilevel model, and that is what is done in "What works in e-commerce - a
meta-analysis of 6700 online experiments", Brown & Jones 2017
[http://www.qubit.com/sites/default/files/pdf/qubit_meta_anal...](http://www.qubit.com/sites/default/files/pdf/qubit_meta_analysis.pdf)
(although they don't formulate it in terms of a sharp null but ask the more
relevant 'probability of a >0 beneficial effect', which for some kinds of A/B
test has a very low prior - like only 15% for 'back to top' A/B tests).

~~~
boron1006
Thanks, that was very helpful.

------
nanis
Once at a programming conference, I was talking with a very senior developer
at a well known company. He was going on and on about their A/B testing
efforts.

I asked how they decided how long they would run an experiment for. The answer
was "until we get a significant result."

I was shocked then, but now I am used to getting these kinds of responses from
developers ... That and a belief that false positives are not a thing.

~~~
yichijin
Hi, Jimmy from Optimizely here. The practice you describe is actually
perfectly fine, _so long as you 're not using a method designed to be checked
at a single point in time._

Take a look at clinical trials. Often in clinical trials there are multiple
phases, where early stopping is desirable in case the drug has higher-than-
expected efficacy (or more-harmful-than-expected side effects).

The type of test conducted in clinical trials explicitly allow for multiple
looks while maintaining correct control of the Type 1 error rate. At
Optimizely we essentially have a version of this where the monitoring can be
conducted contiuously with rigorous control of Type 1 error.

Check out this paper for more details:
[http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-
tests-w...](http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-why-it-
matters-and-what-to-do-about-it)

~~~
Jabbles
Presumably using your method takes longer/requires more samples than a method
that only checks once?

~~~
srean
I haven't looked at the KDD paper, but in general it is the other way round.
With sequential hypothesis testing expect to need less data on average.

~~~
computerphage
That's highly counter-intuitive to me. Can you try to motivate why that's the
case?

My intuition is that you could use any sequential (which I translated to
online) technique could be used in a non-sequential context. By that
reasoning, there's no way a sequential technique could do _better_ , at best
it could be the same.

~~~
srean
This is 1940s stuff. Checkout Wald.

Short answer: in sequential testing you can ask at intermediate stages whether
a satisfactory confidence has been reached. If yes you are done and if not you
can continue. On average you will hit a 'yes' sooner. For non sequential you
cannot do this if you care about correctness ( _). So the sample size needs to
be pessimistic for non-sequential protocols and then you are bound to that
commitment.

(_) If your method ensures correctness even after inspection at intermediate
stages then its a sequential method by definition. There is some confusion in
literature about Bayesian and sequential. They are orthogonal concepts. Both
Bayesian and Frequentist test of hypothesis can be sequential

~~~
computerphage
Ah! I get it. Thank you!

------
paraschopra
Hi, founder of VWO here. We revamped our testing engine to Bayesian in 2015 to
prevent the ‘peeking problem’ with frequentists approaches. You can read about
our approach [https://vwo.com/blog/smartstats-testing-for-
truth/](https://vwo.com/blog/smartstats-testing-for-truth/)

~~~
paraschopra
Here’s the math of our Bayesian testing engine (for those who are interested
in knowing how we do it)
[https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...](https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technical_whitepaper.pdf)

------
cle
Traditional A/B testing has very poor ergonomics. Experimenters are usually
put in awkward conflict-of-interest situations that create multiple strong
incentives not to perform rigorous, disciplined, valid experiments.

Null hypothesis significance testing is fundamentally misaligned with business
needs and is not a good tool for businesses. This is true in many fields of
science as well, but at least they have some mechanisms that try to ensure
that experiments are unbiased. Businesses often don't have the same internal
and external incentives that lead to those mechanisms, and so NHST is abused
even more.

~~~
havkom
Well, I have had well run experiments which showed that the “currently
internally hyped” way of doing things is completely inferior to the “old
boring” and inexpensive way. The project leader was extremely hyped about
doing the experiment to prove the superiority of the new hyped way. When he
got the results though, it was clear that this was not to be talked about and
this result would not be presented to his superiors.

~~~
tzahola
“We’re a data-driven organization! [as long as the data fits our agenda]”

It’s one of my favorite methodologies, next to “agile waterfall” and
“holocracy with managers, middle-managers and minibosses”.

------
raverbashing
People are conflating A/B tests are like testing a new revolutionary drug or
big discovery. But it isn't

Assuming 'B' is the new option, there are 3 possibilities, A is better than B,
A is equivalent to B, A is worse than B

If your p-hacked experiment tells you to change from A to B while the null
hypothesis was correct, you didn't get much worse off than you were in the
first place. And if your long term metrics were in place then you can get a
better measure for your experiment.

Not to mention experimental failures by unaccounted variables

~~~
dahdum
A large percentage of experiments I've run were intended to only test if
variant was worse than control, we didn't care much how much better the
variant may be.

Usually these would be positive consumer facing features we were concerned may
negatively affect conversion. The switch to Bayesian made that a lot easier to
run.

------
yichijin
Hi all. Jimmy, statistician from Optimizely chiming in.

We were excited to collaborate with the authors on this study. Keep in mind
the data used in this analysis is from 2014, before we introduced sequential
testing and FDR correction as to specifically address this p-hacking issue. I
expect these results are in line with any platform using fixed-horizon
frequentist methods.

Check out this paper for more details:
[http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-
tests-w...](http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-why-it-
matters-and-what-to-do-about-it)

------
gingerlime
I created an open source A/B test framework[0], which also uses Bayesian
analysis on the dashboard. IANAS(tatistician), but from what I understand it’s
still better to plan the check point in advance, rather than stop when
reaching significance.

A couple of articles worth reading [1] [2] (can’t exactly vouch for their
validity but seem to make some good arguments that appear thought out)

[0] [https://github.com/Alephbet/gimel](https://github.com/Alephbet/gimel)

[1] [http://varianceexplained.org/r/bayesian-ab-
testing/](http://varianceexplained.org/r/bayesian-ab-testing/)

[2] [http://blog.analytics-toolkit.com/2017/the-bane-of-ab-
testin...](http://blog.analytics-toolkit.com/2017/the-bane-of-ab-testing-
reaching-statistical-significance/)

------
emodendroket
Isn't randomly looking for a pattern and then slapping a hypothesis on it post
facto a form of "p-hacking"? Because that's completely commonplace and
unremarkable practice in technology

~~~
andreareina
I think that unnecessarily discounts exploratory work. There's nothing wrong
with forming a hypothesis after seeing a pattern in data. But remember that
it's just a hypothesis -- an unconfirmed guess. After the hypothesis is
formulated, then an experiment can be designed to test it, see if its
predictions hold up.

~~~
emodendroket
I don't know about you, but the "exploratory" work is, in my experience, the
start and end of it.

------
geoprofi
My very recent meta-analysis of 115 A/B tests reveals that a large proportion
are highly suspect for p-hacking: [http://blog.analytics-
toolkit.com/2018/analysis-of-115-a-b-t...](http://blog.analytics-
toolkit.com/2018/analysis-of-115-a-b-tests-average-lift-statistical-power/)

Going the Bayesian way, as suggested in some comments, is no solution at all,
as I am not aware of an accepted Bayesian approach to dealing with the issue:

[http://blog.analytics-toolkit.com/2017/bayesian-ab-
testing-n...](http://blog.analytics-toolkit.com/2017/bayesian-ab-testing-not-
immune-to-optional-stopping-issues/)

(feel free to run sims, if you do not trust the logic ;-)) as well as on a
more general level:

[http://blog.analytics-toolkit.com/2017/5-reasons-bayesian-
ab...](http://blog.analytics-toolkit.com/2017/5-reasons-bayesian-ab-testing-
debunked/)

------
baybal2
Seconding this, the number of e-commerce companies who got "A/B testing into
bankruptcy" on my memory approaches 20.

My take on this. Even in cases where such testing was done by disciplined
statisticians (which is not the case in at least 9 times out of 10. Yes, a
math or cs PhD major is not a professional statistician by any stretch,) the
value of advice made from that data is marginal at best.

As eCommerce is bread and butter of cheap electronics industry, I saw times
and times again that "science driven" outfits loose out to others. Not so much
because of their quality of decision making was demonstrably inferior, but
because their obsession with "statistical tasseography" drained their
resources, and shifted their focus away from things of obvious importance.

------
dahdum
After listening to Optimizely reps give a talk about their success with a
client (who was present) I suspect the support reps encourage these false
positives. They presented a few tests as fantastic wins, when they all had
basic flaws (like cold audience vs self selected for interest). Maybe that was
just one bad apple (doubtful)...but it was a large client and someone they
felt should represent the company as a speaker.

Concerns from the audience were dismissed and referred to follow up after the
talk. Never thought the same of Optimizely after that.

------
babl-yc
How can A/B testing tools be improved to prevent p-value hacking?

Could it be as simple as declaring your test duration before starting the
experiment, and having the tool add an asterisk to your results if you stop
the experiment early?

~~~
stenl
They should use Bayesian statistics, in which case it doesn’t matter when you
stop (more precisely, stopping when you get a result does not bias the
outcome; of course running the test longer will make the result more robust).
See [http://andrewgelman.com/2014/02/13/stopping-rules-
bayesian-a...](http://andrewgelman.com/2014/02/13/stopping-rules-bayesian-
analysis/)

------
User23
I had the enjoyable experience of sitting at a tech conference and listening
to the others in my group tell one of my friends that he had no idea what he
was talking about when he said they weren't designing a proper experiment.

I was the only one there who knew he's a particle physicist.

The OP is horrifyingly right.

------
raphaelrk
Optimizely being for large enterprises, curious how people do A/B tests at
their respective startups. Do most roll their own? How do you make sure your
science is sound?

~~~
t3scrote
We set audience criteria where the user account must be created after the test
launches, from there its a 50/50 split control/treatment experience (based on
user id). The metric we are optimizing for is almost always conversion rate.
We will turn the experiment off early if the treatment group is having really
poor numbers, otherwise once about 4000 accounts have been entered into the
experiment we plug the numbers into a bayesian calculator, and call it a
winner of there is a 90%+ probability that the treatment beats the control.
[https://www.abtestguide.com/bayesian/](https://www.abtestguide.com/bayesian/)

~~~
tzahola
So one in ten of your findings is bogus.

~~~
t3scrote
But isn’t that better than blindly introducing changes without testing them at
all?

------
nvahalik
ELI5... what’s a p-hack?

~~~
danielvf
“the misuse of data analysis to find patterns in data that can be presented as
statistically significant when in fact there is no real underlying effect.
This is done by performing many statistical tests on the data and only paying
attention to those that come back with significant results, instead of stating
a single hypothesis about an underlying effect before the analysis and then
conducting a single test for it.“

[https://en.m.wikipedia.org/wiki/Data_dredging](https://en.m.wikipedia.org/wiki/Data_dredging)

And here’s a great example of real life P-Hacking to get a catchy article
about the health benefits of chocolate:

[https://io9.gizmodo.com/i-fooled-millions-into-thinking-
choc...](https://io9.gizmodo.com/i-fooled-millions-into-thinking-chocolate-
helps-weight-1707251800)

~~~
vitus
FiveThirtyEight actually has a pretty good demo of p-hacking that demonstrates
how one underlying dataset can be used to derive any desired conclusion(s) by
deciding which factors to include / exclude.

[https://projects.fivethirtyeight.com/p-hacking/](https://projects.fivethirtyeight.com/p-hacking/)

