Hacker News new | past | comments | ask | show | jobs | submit login

If you’ve taken some statistics or econometrics, you’ve probably heard of “significance levels” and “p-values”. For some reason, academia choose 0.05 as a threshold for “meaningful” or “significant” results.

Generally, a 0.05 p-value means that you would observe your result in 5% of experiments due to random sampling error. I.e. if I tested “is X correlated with cancer”, and my null hypothesis is “X isn’t correlated with cancer”, a 0.05 p-value would meet the threshold to reject that null hypothesis. Generally, a lower p-value means a more statistically significant result.

The problem is that 0.05 seems to be much too high of a p-value. I.e. clever experimental design and cherry picking can generate many results that are statistically significant at that level. Many academics advocate for moving to a 0.01 or even 0.001 significance threshold.

Recently, in some academic fields, there’s been widespread concern that many research studies were p-hacked. See for example, this paper that blew up last year in the finance community, because it suggests a significant number of finance papers, including some seminal ones, had p-hacked results: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3017677.

The counter-argument is that for certain scientific fields, you may never be able to reach a p-value threshold of 0.001. This means the vast majority of research couldn’t be published in journals, academics wouldn’t be able to get promoted etc.




This is wrong. P hacking has nothing to do with the p being too lenient.


This. I can p-hack in my field, if I wanted to, up to a p-value of arbitrary strictness, given enough time.


> This is wrong. P hacking has nothing to do with the p being too lenient.

> This. I can p-hack in my field, if I wanted to, up to a p-value of arbitrary strictness, given enough time.

I'm not a practicing scientist/academic, so I want to be careful here. But, I think both of you are being a little uncharitable/pedantic.

P-hacking is one contributor to the broader reproducibility crisis. Lowering the p-value to address the lack of reproducibility is not something that I made up. Yes, lowering the p-value threshold does not eliminate the motivations/techniques that are necessary for p-hacking, but it can make it a lot harder, and a lot less worthwhile. If you work in academia, and it takes you much longer to now cherrypick a sample to meet a much lower p-cutoff, it seems to follow that we would see less of it.

This is an excerpt from a paper soon to be published in Nature: https://imai.princeton.edu/research/files/significance.pdf. The key quote: 'We have diverse views about how best to improve reproducibility,and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P values. However, changing the P value threshold is simple, aligns with the training undertaken by many researchers, and might quickly achieve broad acceptance.'

With regards to the comment that you could p-hack up to any strictness, I'm not sure this is correct. If you accept the proposal laid out in that Nature paper, to lower the threshold to P<0.005, or if we go even lower to P<0.001 I don't believe that you'd be able to p-hack in any practical way. Yes, you could cherry pick a tiny sample, but any peer reviewer or colleague of yours is going to ask questions about the sample.


I'm not being nitpicky - they are components of a problem of reproducibility, but orthogonal to each other. Bad UI design and a poor backend are both reasons "X website sucks!" but that doesn't mean they're the same.

A perfectly designed, un-p-hacked study should still perhaps be held to a stricter p-value criteria than 0.05.

And I am correct - because I've done it. Presently working on a paper where, because I primarily work with simulations, I can translate minute and meaningless difference into arbitrarily small p-values. And I used arbitrary for a good reason - my personal record is the smallest value R can express.

Ironically, this isn't because I have a tiny sample, but because I can make tremendously large ones. All of this is because no where in the calculation of a p-value is the question "Does this different matter?"


First off, I don’t have any experience with publishing based off the results of simulations. My (short) time in writing papers centered around economics research with observational datasets.

I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc.

Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me. I think a better analogy would be trying to game something like your Pagespeed score. In order to get a higher score, you skimp on UX so the page loads faster, and cut out backend functionality because you want fewer HTTP requests. Making it harder to achieve a Pagespeed score forces you at some point to evaluate the tradeoffs of chasing that score.

I have two questions for you:

1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?

2) In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?


"I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way. My colleagues would’ve asked why the gigantic dataset we purchased had 1/4 of its observations thrown out etc."

This is only true if you haven't collected your own data, and the size of the original sample is known - and that you used all of it. I would suggest that a fixed, known sample size is a relatively rare outcome for many fields.

"Your claim that the threshold and the practice of p-hacking are orthogonal (independent?) is still puzzling to me."

The suggestion is they're unrelated. Changing to say, p = 0.005, will impact studies that aren't p-hacked, and does not p-hacking proof evidence. It potentially makes things more difficult, but not in a predictable and field-agnostic fashion.

"1) Would it take you more time to p-hack a lower threshold, or do all your results yield you a ~0.0000 p-value?"

It might take me more time - but I could also write a script that does the analysis in place and simply stops when I meet a criteria. The question is will it take me meaningfully more time - "run it over the weekend instead of overnight" isn't a meaningful obstacle.

"In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc. wouldn’t address what you say you’re able to do. What can be done to fix it?"

My preference is to move past a reliance on significance testing and report effect sizes and measures of precision at the very least. If one must report a p-value, I'd also require the reporting of the minimum detectable effect size that could be obtained by your sample.

Pre-announcing sample size would...just be a huge pain in the ass, generally.


Not the above poster, but...

>I can tell you that given a fixed size dataset, it is not possible to p-hack below a certain threshold in any meaningful way

Correct, but the most common methods of p-hacking involve changing the dataset size, either by repeating the experiment until the desired result is achieved (a la xkcd [0]), or by removing a large part of the dataset due to a seemingly-legitimate excuse (like the fivethirtyeight demo that has been linked already).

Pre-announcing your dataset size is pre-announcing your sample size. If you pre-announce your dataset, p-hacking is not possible. This is true. But most research doesn't use a public dataset that is pre-decided.

>Would it take you more time to p-hack a lower threshold

Yes.

>In simulation based research like yours, it seems to me that even other p-hacking “fixes” like forcing there to be a pre-announcement or sample size, sample structure, etc.

This doesn't follow.

[0]: https://xkcd.com/882/


Sorry if the second question was unclear. My point was that for simulation based research, it doesn’t seem that pre-announcing your sample size would do much for preventing p-hacking.

E.g. if I say “I will do 10000 runs of my simulation”, what’s to prevent me from doing those runs multiple times, and selecting the one that gives me the desired p-value? For observational research, there’s obviously a physical limit to how many subjects you can observe etc. Would still love an answer from the grandparent comment.


I believe that's where the original post's

>given enough time.

comes in.

One nice thing about simulation based research is that it is often (more) reproducible, so a simulation can be run 10000 times, but then the paper might be expected to report how often the simulation succeeded. In other words, you can increase the simulation size to make p-hacking infeasible

Note that in practice, pre-announcing your sample size doesn't prevent p-hacking unless your sample size is == to a known sample. If you say "our sample size will be X", but you can collect 2 or 3x X data even, you can almost certainly p-hack.

Not to mention that I'm unaware of any field where people actually pre-announce their sample sizes. Does this happen on professor's web pages and I'm unaware, or as footnotes in prior papers?


Again, academia/research is not my profession. But, some cool efforts in this area include osf.io, which is trying to be the Arxiv or Github of preregistration for scientific studies.

The best preregistration plans will typically include a declared sample or population to observe (http://datacolada.org/64), or at least clear cut criteria for which participants or observations you will exclude.

I think for the type of economics/finance research I’m most familiar with, you often implicitly announce your sample when securing funding for a research proposal. E.g. if I’m trying to see if pursuing a momentum strategy with S&P 500 stocks is profitable (a la AQR’s work), it’s pretty obvious what the sample ought to be. This is partly why that meta study I linked to earlier was able to sniff out potential signs of p-hacking.


The parent asked very straightforwardly what is p hacking and you replied with a red herring. If I'm being pedantic, you're being unhelpful.


> clever experimental design and cherry picking can generate many results that are statistically significant at that level.

I’m not sure what’s incorrect about this statement? If you disagree with the “fix” to the problem that is most familiar to me, that’s fine. It’s one of many approaches.

But, at what point did I mislead the parent as to what p-hacking is? What’s your definition?


p-hacking has nothing to do with any specific significance level. You can p-hack at a significance of .5 or .05 or .000005. A better definition would be

>p-hacking is a set of related techniques, whereby clever experimental design and cherry picking of data can generate results that falsely appear statistically significant.

There are a few important differences here:

1. The effect is not statistically significant. In fact often, there is no effect at all.

2. There is no mention of a specific significance level.

Those are both important.


If you've got the money, you can always just increase your sample size until significance is achieved.


The problem with 0.05 isn't how lenient it is, but rather the fact that a default exists at all.

A p-value should be chosen (before running the experiment!) based on how confident the researcher wants to be in their result.

For my high school statistics final project, I did an experiment to test whether a stupid prank/joke was funny. Had a pretty terrible experimental design (tons of bias) and tiny sample size (<10). Chose a p-value of 0.8 and ended up with a significant result (it was more amusing than our control). And that was fine, because (A) it was not a very important experiment, and (B) my report acknowledged all of this instead of trying to sweep it under the covers and pretend like I had a strong conclusion.

That would be wildly inappropriate if I were QA testing a new model of airbag or medication. But I wasn't, and I'm not going to use the results for anything other than sharing this anecdote, so it was fine.

Similarly, I'd say in some A/B testing scenarios, it's okay to use a lower standard of proof (though p-hacking is definitely not). Especially if you're just using the test as one piece of information too help you decide on the final design. The problem is when people do bad stats and then use the result as an excuse to throw out their human judgment.


If you chose a p-value threshold of 0.8, and your tested result came in around there, that would suggest that your null hypothesis had a ~80% chance of being true. So in any case, you do not have a strong conclusion.

I agree that over reliance on a single metric, like a p-value, gets us Goodhart’s Law type problems.

In econometrics, and really any other statistics adjacent field, if you’ve correctly estimated your standard errors, and are using something like https://en.m.wikipedia.org/wiki/Newey%E2%80%93West_estimator where appropriate, there is nothing wrong with using a p-value as a general approximation of significance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: