I don't think the tech environment is very conducive to running experiments. Everything moves too fast, by the time you figure out the results someone gives you are bs, they've already got promoted 3 times and work as a director at a different company.
I work in science now, and although people still p-hack like hell, there's at least some sort of shame about it. There's a long term cost too, I've met a couple researchers who have spent years trying to replicate some finding they got early in their career through suspicious means.
In my team we try to be methodical. I'm just a lowly engineer but one of my team mates is a statistician and another is a PhD student. We know we need to pre-register the hypothesis and at what significance we're testing for first, divide the groups randomly and run the experiment for a set time period.
We've gotten a lot more negative results and been proven wrong in our guesses than the other teams. For some reason I take pride in that.
I wanted to add one detail--there actually are ways to do early stopping while staying within a frequentist approach. For example, most clinical trials methods are not Bayesian but rather are just fixed-horizon tests that have the allowable amount of Type 1 error "spread out" amongst the multiple looks that are planned in advance.
At Optimizely we essentially have a continuous version of this that does in fact allow for multiple looks with rigorous control of Type 1 error. As tedsanders mentions, the key upside is that if you start an experiment with a larger-than-expected lift, you can terminate it early. Then over many repeated experiments, you gain a lot in terms of average time to significance.
The dissonance in this discussion mostly stems from the fact that this paper (which we actually collaborated on!) uses data from 2014, before we rolled out this new Stats Engine.
For more, I would encourage a look at our paper: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...
In fact, why use an inferential framework at all (estimating some sort of probability and using it to guide action), rather than directly using a policy learning framework, e.g. modeling this as Q-learning or multi-armed bandit problem?
If at the end of the day you have some objective function (e.g. 'making money'), some known space of actions (e.g. move this widget up the page, change the color, engage with user this way), and a reasonable way to associate those two, then isn't the company literally doing reinforcement learning over time?
It seems one benefit of a reinforcement learning framework is it maintains a set of actions that will still be explored in the future without forcing you to prematurely 'choose' whether A or B is actually better—if A is better in reality, then it will be explored more and more often and B will progressively become downweighted over time.
That "If" often evaluates to false.
There are tough judgement calls involved in selecting what is that metric that the org wants to optimize. It is very rare that business management commits to a clear quantitative goal. Reasons are many -- weasel room is important politically, selecting a metric that captures short term and long term goals is difficult, there is a lot of uncertainty in the costs due to uncertainty on how overhead should be billed etc etc.
This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment is to glean as much knowledge as possible, and present it to the PM. If that comes at the cost of exposing some customers to bad choices, so be it -- in other words, explore at the cost of losses in the opportunity to exploit.
Probably because of the maintenance cost of the code that was only explored but never exploited.
People don't seem to trust the system to make the right decisions even though you can do simulations and have the mathematics to show it is correct.
In a realistic scenario for us, bayesian and frequentist approaches will probably converge to a point that's close enough for a company that runs a website (i.e., we weren't killing anyone by leaving our experiment running). We also weren't getting massive fluctuations in effect size.
The cost of the Bayesian approach, in terms of learning a new system of statistics, programming everything up from scratch, and interpreting results, probably wouldn't be worth the efficiency gains. If we were creating an AB testing program, then I would probably do so.
1. Data collection is expensive (time or money)
2. Keeping with the status quo in the presence of new evidence is problematic (withholding a promising new drug)
3. Continuing with the experiment in the presence of new evidence is problematic (the new drug is hurting people)
Absent one of those features, it's probably not worth the added complexity.
Completely agree with Evan Miller here, thank you for sharing the link.
Suppose you're Facebook and you decide to test a new landing page on 1 million users. You roll out the test and notice that after 10,000 users, the new page is killing engagement. Whoops, turns out it had a bug and isn't even loading. Even though no medical patients are dying, this is still a very negative outcome for Facebook. Obviously they shouldn't test on 990,000 more users before fixing the bug, but that's what slavish adherence to pre-registered trial lengths would tell you to do, because it's 'cheating' to notice that there's a problem after the first 10,000 tests.
Some of you might say, sure, for extenuating circumstances like a bug you can break procedure. But in that case I think the logic slides down a slippery slope. What if in the example above instead of a bug we just had a feature performing terribly? In either case, the right move is to end the test early, since you don't need all the data points to measure a strong signal.
But if an effect shows up earlier, it is not necessarily strong. I think that's the point with running an experiment for a predeterimned length- so you know you didn't get un/lucky and hit a clump of results at the start of the experiments that will sort of average out later on.
Obviously, if a trial drug is killing your patients at a surprising rate, you need to stop the experiment. In fact, I believe experiments are sometimes stopped on ethical grounds when a drugs is found to heal the experiment group at a high rate, also, either so that the control group can also benefit, or just because it is hard to justify giving only half of your patients a life-saving drug and a placebo to the rest.
But those are ethical considerations - not practical ones. Such experiments are cut short without complete confidence to the results, when there is the merest hint of ethical issues down the line. At least that's my understanding.
If failures essential don't matter (e.g. number of bacteria killed in a petri dish), sure use frequentist p-value.
If failure do matter (e.g. number of patients killed), use Bayesian multi-armed bandit.
Wrote a blog post on it: https://www.lucidchart.com/blog/the-fatal-flaw-of-ab-tests-p...
The world would probably be a better place if we taught introductory statistics from a Bayesian perspective, but people get pretty set on their ways.
1. you see a random SKU spiking for n month in a row - good, keep stocking it, maybe even spend few adwords vouchers on it. BUT the fact that nobody in the company readily knows the nature of such spike is already an indication that the business fails at understanding its market.
2. Never in my career, I saw any of "coffee divination" level ideas coming from analysts ever being "life changing" for a company. I been through a number companies spending money on "algorithmic optimisation service" for banner ads of their clients. Yes, the "optimised" banners did score progressively more clicks with time, yet a single full time designer who was hired alongside the "optimiser" company could be scoring more clicks and purchases - without any input from any kind of data analysis
 I've only seen massive product movement when the analysts and idea folks work together. I've never seen an idea wonk hit the bullseye contrary to simple economic theory/reasoning on their customer target.
If you want to know how Optimizely prevents p-hacking check out the math behind Optimizely’s current testing here: https://www.optimizely.com/resources/stats-engine-whitepaper...
This can also be done nicely with a Bayesian mixture model or a spike-and-slab multilevel model, and that is what is done in "What works in e-commerce - a meta-analysis of 6700 online experiments", Brown & Jones 2017 http://www.qubit.com/sites/default/files/pdf/qubit_meta_anal... (although they don't formulate it in terms of a sharp null but ask the more relevant 'probability of a >0 beneficial effect', which for some kinds of A/B test has a very low prior - like only 15% for 'back to top' A/B tests).
All that you can really do is prove it wrong, by measuring an effect when there "should", by the hypothesis, be none.
Due to what is known as the "problem of induction", it's not sufficient to accept a hypothesis because you appeared to not measure an effect in the past, as that says nothing about whether an effect will occur the next time a measurement is made.
p-value is the "chance" of measuring an effect, given that no effect actually occurred.
I asked how they decided how long they would run an experiment for. The answer was "until we get a significant result."
I was shocked then, but now I am used to getting these kinds of responses from developers ... That and a belief that false positives are not a thing.
Take a look at clinical trials. Often in clinical trials there are multiple phases, where early stopping is desirable in case the drug has higher-than-expected efficacy (or more-harmful-than-expected side effects).
The type of test conducted in clinical trials explicitly allow for multiple looks while maintaining correct control of the Type 1 error rate. At Optimizely we essentially have a version of this where the monitoring can be conducted contiuously with rigorous control of Type 1 error.
Check out this paper for more details: http://www.kdd.org/kdd2017/papers/view/peeking-at-ab-tests-w...
Lotsa things are OK so long as you are doing X and Y etc.
Take a look at a portion clinical trial guidance from FDA. Note specifically the basic Stats guidance:
6.9.1 A description of the statistical methods to
be employed, including timing of any planned
6.9.2 The number of subjects planned to be
enrolled. In multicenter trials, the numbers of
enrolled subjects projected for each trial site
should be specified. Reason for choice of
sample size, including reflections on (or
calculations of) the power of the trial and
Of course, if every other patient is suffering serious consequences, or becoming miraculously well on the second day of the trial, you stop. In those cases, you generally don't need a statistical test to tell you that your a priori evaluation of the drug or intervention was wrong.
I fail to see what is so vital about some web site A/B test that one cannot be bothered to think ahead about what defines an observational unit, how many of those one might need to detect an improvement, and wait until after that sample has been attained to test (and, if the web site doesn't get enough visitors to fulfill your sample size requirement for that particular test, that is a different problem entirely).
My intuition is that you could use any sequential (which I translated to online) technique could be used in a non-sequential context. By that reasoning, there's no way a sequential technique could do better, at best it could be the same.
Short answer: in sequential testing you can ask at intermediate stages whether a satisfactory confidence has been reached. If yes you are done and if not you can continue. On average you will hit a 'yes' sooner. For non sequential you cannot do this if you care about correctness (). So the sample size needs to be pessimistic for non-sequential protocols and then you are bound to that commitment.
() If your method ensures correctness even after inspection at intermediate stages then its a sequential method by definition. There is some confusion in literature about Bayesian and sequential. They are orthogonal concepts. Both Bayesian and Frequentist test of hypothesis can be sequential
That is easier than explaining Bayesian methods to people who cannot handle classical Stats.
Done properly, it might be way more efficient than setting your parameters ahead of time, see  ; If I had gotten that response from him, I'd assume that's what he meant.
Null hypothesis significance testing is fundamentally misaligned with business needs and is not a good tool for businesses. This is true in many fields of science as well, but at least they have some mechanisms that try to ensure that experiments are unbiased. Businesses often don't have the same internal and external incentives that lead to those mechanisms, and so NHST is abused even more.
It’s one of my favorite methodologies, next to “agile waterfall” and “holocracy with managers, middle-managers and minibosses”.
Assuming 'B' is the new option, there are 3 possibilities, A is better than B, A is equivalent to B, A is worse than B
If your p-hacked experiment tells you to change from A to B while the null hypothesis was correct, you didn't get much worse off than you were in the first place. And if your long term metrics were in place then you can get a better measure for your experiment.
Not to mention experimental failures by unaccounted variables
Usually these would be positive consumer facing features we were concerned may negatively affect conversion. The switch to Bayesian made that a lot easier to run.
We were excited to collaborate with the authors on this study. Keep in mind the data used in this analysis is from 2014, before we introduced sequential testing and FDR correction as to specifically address this p-hacking issue. I expect these results are in line with any platform using fixed-horizon frequentist methods.
A couple of articles worth reading   (can’t exactly vouch for their validity but seem to make some good arguments that appear thought out)
This is a huge problem in science too. I have regularly been told to just see what happens and come up with hypotheses after, or others have been unable to say what their hypotheses actually are.
Scientists seem less and less likely to be trained in statistics, and in the scientific process. Technical knowledge is important, but understanding of the scientific process is much more important.
Going the Bayesian way, as suggested in some comments, is no solution at all, as I am not aware of an accepted Bayesian approach to dealing with the issue:
(feel free to run sims, if you do not trust the logic ;-)) as well as on a more general level:
My take on this. Even in cases where such testing was done by disciplined statisticians (which is not the case in at least 9 times out of 10. Yes, a math or cs PhD major is not a professional statistician by any stretch,) the value of advice made from that data is marginal at best.
As eCommerce is bread and butter of cheap electronics industry, I saw times and times again that "science driven" outfits loose out to others. Not so much because of their quality of decision making was demonstrably inferior, but because their obsession with "statistical tasseography" drained their resources, and shifted their focus away from things of obvious importance.
Concerns from the audience were dismissed and referred to follow up after the talk. Never thought the same of Optimizely after that.
Could it be as simple as declaring your test duration before starting the experiment, and having the tool add an asterisk to your results if you stop the experiment early?
I was the only one there who knew he's a particle physicist.
The OP is horrifyingly right.
Beysian results aren't p-values, a 60-70% probably that treatment beats control is just that, not a pvalue of .4 or .3 (which would say nothing).
"You’re a social scientist with a hunch: The U.S. economy is affected by whether Republicans or Democrats are in office. Try to show that a connection exists, using real data going back to 1948. For your results to be publishable in an academic journal, you’ll need to prove that they are “statistically significant” by achieving a low enough p-value."
And here’s a great example of real life P-Hacking to get a catchy article about the health benefits of chocolate:
Generally, a 0.05 p-value means that you would observe your result in 5% of experiments due to random sampling error. I.e. if I tested “is X correlated with cancer”, and my null hypothesis is “X isn’t correlated with cancer”, a 0.05 p-value would meet the threshold to reject that null hypothesis. Generally, a lower p-value means a more statistically significant result.
The problem is that 0.05 seems to be much too high of a p-value. I.e. clever experimental design and cherry picking can generate many results that are statistically significant at that level. Many academics advocate for moving to a 0.01 or even 0.001 significance threshold.
Recently, in some academic fields, there’s been widespread concern that many research studies were p-hacked. See for example, this paper that blew up last year in the finance community, because it suggests a significant number of finance papers, including some seminal ones, had p-hacked results: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3017677.
The counter-argument is that for certain scientific fields, you may never be able to reach a p-value threshold of 0.001. This means the vast majority of research couldn’t be published in journals, academics wouldn’t be able to get promoted etc.
> This. I can p-hack in my field, if I wanted to, up to a p-value of arbitrary strictness, given enough time.
I'm not a practicing scientist/academic, so I want to be careful here. But, I think both of you are being a little uncharitable/pedantic.
P-hacking is one contributor to the broader reproducibility crisis. Lowering the p-value to address the lack of reproducibility is not something that I made up. Yes, lowering the p-value threshold does not eliminate the motivations/techniques that are necessary for p-hacking, but it can make it a lot harder, and a lot less worthwhile. If you work in academia, and it takes you much longer to now cherrypick a sample to meet a much lower p-cutoff, it seems to follow that we would see less of it.
This is an excerpt from a paper soon to be published in Nature: https://imai.princeton.edu/research/files/significance.pdf. The key quote: 'We have diverse views about how best to improve reproducibility,and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P values. However, changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.'
With regards to the comment that you could p-hack up to any strictness, I'm not sure this is correct. If you accept the proposal laid out in that Nature paper, to lower the threshold to P<0.005, or if we go even lower to P<0.001 I don't believe that you'd be able to p-hack in any practical way. Yes, you could cherry pick a tiny sample, but any peer reviewer or colleague of yours is going to ask questions about the sample.
A perfectly designed, un-p-hacked study should still perhaps be held to a stricter p-value criteria than 0.05.
And I am correct - because I've done it. Presently working on a paper where, because I primarily work with simulations, I can translate minute and meaningless difference into arbitrarily small p-values. And I used arbitrary for a good reason - my personal record is the smallest value R can express.
Ironically, this isn't because I have a tiny sample, but because I can make tremendously large ones. All of this is because no where in the calculation of a p-value is the question "Does this different matter?"
I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc.
Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me. I think a better analogy would be trying to game something like your Pagespeed score. In order to get a higher score, you skimp on UX so the page loads faster, and cut out backend functionality because you want fewer HTTP requests. Making it harder to achieve a Pagespeed score forces you at some point to evaluate the tradeoffs of chasing that score.
I have two questions for you:
1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?
2) In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?
This is only true if you haven't collected your own data, and the size of the original sample is known - and that you used all of it. I would suggest that a fixed, known sample size is a relatively rare outcome for many fields.
"Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me."
The suggestion is they're unrelated. Changing to say, p = 0.005, will impact studies that aren't p-hacked, and does not p-hacking proof evidence. It potentially makes things more difficult, but not in a predictable and field-agnostic fashion.
"1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?"
It might take me more time - but I could also write a script that does the analysis in place and simply stops when I meet a criteria. The question is will it take me meaningfully more time - "run it over the weekend instead of overnight" isn't a meaningful obstacle.
"In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?"
My preference is to move past a reliance on significance testing and report effect sizes and measures of precision at the very least. If one must report a p-value, I'd also require the reporting of the minimum detectable effect size that could be obtained by your sample.
Pre-announcing sample size would...just be a huge pain in the ass, generally.
>I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way
Correct, but the most common methods of p-hacking involve changing the dataset size, either by repeating the experiment until the desired result is achieved (a la xkcd ), or by removing a large part of the dataset due to a seemingly-legitimate excuse (like the fivethirtyeight demo that has been linked already).
Pre-announcing your dataset size is pre-announcing your sample size. If you pre-announce your dataset, p-hacking is not possible. This is true. But most research doesn't use a public dataset that is pre-decided.
>Would it take you more time to p-hack a lower threshold
>In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc.
This doesn't follow.
E.g. if I say “I will do 10000 runs of my simulation”, what’s to prevent me from doing those runs multiple times, and selecting the one that gives me the desired p-value? For observational research, there’s obviously a physical limit to how many subjects you can observe etc. Would still love an answer from the grandparent comment.
>given enough time.
One nice thing about simulation based research is that it is often (more) reproducible, so a simulation can be run 10000 times, but then the paper might be expected to report how often the simulation succeeded. In other words, you can increase the simulation size to make p-hacking infeasible
Note that in practice, pre-announcing your sample size doesn't prevent p-hacking unless your sample size is == to a known sample. If you say "our sample size will be X", but you can collect 2 or 3x X data even, you can almost certainly p-hack.
Not to mention that I'm unaware of any field where people actually pre-announce their sample sizes. Does this happen on professor's web pages and I'm unaware, or as footnotes in prior papers?
The best preregistration plans will typically include a declared sample or population to observe (http://datacolada.org/64), or at least clear cut criteria for which participants or observations you will exclude.
I think for the type of economics/finance research I’m most familiar with, you often implicitly announce your sample when securing funding for a research proposal. E.g. if I’m trying to see if pursuing a momentum strategy with S&P 500 stocks is profitable (a la AQR’s work), it’s pretty obvious what the sample ought to be. This is partly why that meta study I linked to earlier was able to sniff out potential signs of p-hacking.
I’m not sure what’s incorrect about this statement? If you disagree with the “fix” to the problem that is most familiar to me, that’s fine. It’s one of many approaches.
But, at what point did I mislead the parent as to what p-hacking is? What’s your definition?
>p-hacking is a set of related techniques, whereby clever experimental design and cherry picking of data can generate results that falsely appear statistically significant.
There are a few important differences here:
1. The effect is not statistically significant. In fact often, there is no effect at all.
2. There is no mention of a specific significance level.
Those are both important.
A p-value should be chosen (before running the experiment!) based on how confident the researcher wants to be in their result.
For my high school statistics final project, I did an experiment to test whether a stupid prank/joke was funny. Had a pretty terrible experimental design (tons of bias) and tiny sample size (<10). Chose a p-value of 0.8 and ended up with a significant result (it was more amusing than our control). And that was fine, because (A) it was not a very important experiment, and (B) my report acknowledged all of this instead of trying to sweep it under the covers and pretend like I had a strong conclusion.
That would be wildly inappropriate if I were QA testing a new model of airbag or medication. But I wasn't, and I'm not going to use the results for anything other than sharing this anecdote, so it was fine.
Similarly, I'd say in some A/B testing scenarios, it's okay to use a lower standard of proof (though p-hacking is definitely not). Especially if you're just using the test as one piece of information too help you decide on the final design. The problem is when people do bad stats and then use the result as an excuse to throw out their human judgment.
I agree that over reliance on a single metric, like a p-value, gets us Goodhart’s Law type problems.
In econometrics, and really any other statistics adjacent field, if you’ve correctly estimated your standard errors, and are using something like https://en.m.wikipedia.org/wiki/Newey%E2%80%93West_estimator where appropriate, there is nothing wrong with using a p-value as a general approximation of significance.