This proposal is a great pragmatic step forward. Like they say in the paper, it doesn't solve all problems, but it would be an improvement with reasonable cost and tremendous benefits.
>Such an increase means that fewer studies can be conducted using current experimental designs and budgets. But Fig. 2 shows the benefit: false positive rates would typically fall by factors greater than two. Hence, considerable resources would be saved by not performing future studies based on false premises.
So this proposal is really the opposite of pragmatic. Pragmatic would be requiring effect size estimates and confidence intervals in all published papers. It is surprising how many papers will talk about highly significant effects without actually discussing how large the estimated effect is thought to be, which gives authors a lot of leeway when exaggerating the importance of their findings.
Using p-values as our primary metric means an overemphasis on finding small effects (which are usually not clinically relevant anyway) and unduly low focus on things with big effects.
If an effect is real, but very small, that too may well cause replicability problems because it suggests the effect may not be very robust to small changes in experimental conditions, whereas a big effect would be more likely to be robust.
If you think about the really important scientific findings -- the ones that made a big impact and are indisputably true -- statistics usually aren't necessary to prove them, because the effect size is so large it is simply obvious. I'm not against using statistics anyway, of course, but the point is that we should be looking mainly for effects with big effect sizes if we are after important findings, IMO. It is only a major bonus that big effect sizes are most likely to be replicable.
I'm practice it's harder than simple number crunching due to methodological differences between studies, but it seems a shame to call a study essentially worthless.
I like to be a little optimistic and think, "Data is data and data is good. The challenge is in interpreting it."
This is a reasonable intuition, but is hard to square with the increased risk of publication bias for small studies (i.e. we should question whether the typical small study published is really "data" in the sense we'd like).
Since small studies are more sensitive to noise and "fishing" for an effect, and since non-results are rarely published, what happens with meta-analysis of many published small studies is that you end up primarily looking at outliers rather than typical results. Since most research questions have a directionality to what counts as "interesting", the outliers also tend to be clustered on one side rather than evenly split.
The two papers I recommend anyone read if they care about these problems are:
Why Most Published Research Findings are False
The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time
Alt-Text: "Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'."
I think you could replace correlation with "small studies" or "anecdotes". These are all things that suggest there may be some effect, and there may be merit in further study.
Good science requires a tension between hypothesis generation and skepticism. Perhaps if we rewarded the _debunking_ of findings as much as we do the discovery of findings, things would change.
The funding bodies etc, who want "quantitive" measures of research look at publications. Why would we expect debunking papers be published if they are debunking something interesting?
That said, it also seems like low hanging fruit. At least in some fields, replication should be a lot cheaper than doing things in the first place -- because a lot of the cost in pissing about trying to find something that even seems to work.
For funding bodies to explicitly support replication studies, even if each gets only 20% of the amount the original studies get should be a winner at reasonably low cost.
"In the review process they did not disagree with my points at all, but they refused to publish the correction on the grounds that they only publish the very best submissions to ASR."
I think the situation would improve with better teaching of philosophy of science and statistics (this would educate better reviewers too).
On the other hand, you might expect that new discoveries by nature have less data since data is likely more expensive for brand new research, and by extension a lower likelihood of meeting these sorts of stringent statistical requirements. Decreasing the p-value threshold may be counter-productive if we dismiss legitimate new discoveries due to essentially economic constraints with data gathering, which would have the impact of making it less likely to get funding to pursue the problem in more depth, thereby slowing the advance of discoveries.
I could see the reverse happening, where higher p-value standards lead to normalization of deviance in the form of worse p-hacking.
Statistics is just another method and almost every method can be hacked or abused. Science is not about putting checkmarks in tables but about reading and understanding ideas and reproducing the results. Tweaking some numeric values is not going to help the review process which is fundamentally broken these days.
This is a classic tradeoff between exploration and exploitation in active learning.
If your view of the world is that there are only a very few hypotheses worth exploring, and you have a good lay of the scientific land, then requiring higher bar of proof is probably good.
If it's a new field that's extremely complex and where very little is known of the governing principles, then requiring very high stats could severely slow progress and waste lots of research dollars.
I completely agree that rather than setting arbitrary barriers for significance, it would seem much better to let people actually understand what was found, at whatever significance it was. Even setting up the null model to get a p-value requires tons and assumptions. The better test is reproducibility and predictive models that can be validated or invalidated. That's where the science is, and not in the p.
I am not at all in favor of this proposal, but one thing it may do is stem the tidal wave of misinformation.
The very practical concern is that entire areas of research have been based on studies replicated and backed up entirely through p-hacking and selectively publishing only papers with positive results. This is a proven issue today. See https://en.wikipedia.org/wiki/Replication_crisis for more.
It may be that there is a pendulum that needs to swing a few times to get to a good tradeoff. But it is clear, now, which direction it needs to swing.
It's something that affects a few fields, not all science. And the problem has been completely 100% overblown.
If the problem is that things aren't replicating, changing the p-value cutoff for significance isn't going to fix everything. It can just as easily be a bad null model that's the problem,in which case you can't trust any p-value. The MRI scan problem was closely related to that.
It's a field-specific and null-model specific thing. Broadly changing the a p-value cutoff for everybody isn't going to fix this issue.
Just the fact that so few replications have been published indicates huge cultural problems. When I did biomed, in my tiny area of expertise there had been 1-2 thousand papers published since the 1980s. Out of these maybe 2-3 were close to direct replications. None of those showed the results were reproducible, but no one cared...
Usually there were "minor differences" in the methods so it resulted in stuff like:
"Protein P has effect E by acting through receptor R in cell line L from animal A of sex S and age Y when in media M".
However if you changed L, A, S, Y, or M apparently totally different things were going on (there were then supposedly dozens of receptors for each ligand, each receptor having dozens of ligands in different circumstances, etc).
In the end I found that E was nearly perfectly correlated with the molecular weight of P (using data from one of the most cited papers on the topic, in which they specifically claimed there was no correlation with any physical properties of the ligands).
So the effect has nothing to do with specific ligand-receptor interactions at all, but no one cared. Situations like this (with few published direct replications, the ones that are published are contradictory, the results are all being misinterpreted anyway, and everyone just continues on their way when problems are pointed out) are totally standard for biomed. The replication aspect of the issue is really only the tip of the iceberg of problems.
As a Psychology student, this is a well-known initiative: https://cos.io/prereg/
(Though I can't confirm or deny its widespread usage.)
The publication bias is harder, and pre-registration won't solve this. But I think this is a separate issue, and it's important to address each issue in its own right.
I've seen the proposal from TFA before and with my very limited knowledge, I'm still fairly certain it will never come to pass in Psychology, as nearly half of all modern studies have reproducibility issues (!). It would be beneficial to our field, in the way that a band-aid is beneficial to a gaping wound, but it would require a lot more rigor than has been evidently been displayed so far (and more rigor is more work, and time is limited).
So... Don't hold your breath.
(Sorry if my comment sounds pessimistic, I don't know much and I'm open to being corrected. I still have enough critical thought to be skeptical of some researchers' dedication to intellectual rigor.)
This is necessary, but not sufficient. What's needed is a way to know for sure that the hypothesis was not changed after data collection. I think predeclaring the hypothesis is the way to go.
Not that education can fix all these (you can't prevent evil), but if reviewers and journals and conferences started to accept more the negative results, the incentive in lying would quickly decrease. And people would probably start to "disprove" interesting theories, instead of trying to "prove" niche results...
Fabricating data is essentially fraud. And while it does happen, most of the problems with reproducibility are not problems of fraud.
It won't protect against outliers, but removing outliers will not solve most problems. It'll happen, but again, I don't think the majority of irreproducible studies are due to misuse of outliers.
>Plus it's almost impossible to imagine a world where, before any experiment in any field, you predeclare it.
Not at all. I'm not saying you predeclare every experiment - just every experiment you try to publish.
The way it works is:
1. You make observations (i.e. collect data - no predeclaring anything). If you see interesting patterns, you'll form a hypothesis.
2. This is the stage where you predeclare your hypothesis, and the criterion of falsification.
3. You now collect new data and test it against your hypothesis.
The hard part is ensuring people won't use some of the old data and claim they collected after their declaration. It's a hard problem, but not an impossible one.
People are in the habit these days of collecting a lot of data, seeing patterns, and publishing them. That's really not how a lot of early science was done. Once you see the patterns, you need to conduct more experiments to falsify them.
The opposite hypothesis is the null hypothesis which is "gene X is NOT important in disease Y" or "priming DOESN'T affect outcome Z".
Since we assume that most interventions will not affect most outcomes, these are much less surprising and interesting results. They are seen as "water is wet" type of findings and are thus hard to publish because no one is interested.
Now if your hypothesis is something like "X will cause Y to go up" and you actually find it causes Y to go down, that IS publishable. It is only when X has no effect on Y that you will have problems.
This is assuming predeclaring ever becomes the norm.
For example, here sample size is huge, USA population gets significant increased risk, while EU population does not. Mixing the two together would result in a smaller but still significant increased risk.
Given the size, it's quite clear that USA population has many other confounding factors that cannot be eliminated by mathematics alone (there is no control).
Similarly, the technical solution involves technology that does not require drivers and has no risk of human error anymore. The pragmatic solution is to just limit the acceptable speeds.
As opposed to how the pedantry points out, that if "wreckless" were a word, it would actually be more of the opposite of "reckless."
Edit: or rather, it would limit false positives that show up as a result of accidental p-value hacking, if not the process itself.
Any game can be gambled. Bayesian statistics just isn't there yet.
I prefer the PDF version for print outs.
1) The p-value filter leads to publication bias.
-You should publish your results anyway, or the study wasn't designed/performed correctly. The raw data and description of methods should be valuable.
2) The null hypothesis is (almost) always false anyway.
-Everything in bio/psych/etc has a real (not spurious) non-zero correlation with everything else, so the significance level just determines how much data needs to be collected to reject it.
3) Rejection or not of the null hypothesis does not indicate whether the theory/explanation of interest is correct, so is inappropriate for deciding whether a result is interesting to begin with.
-Usually the null hypothesis is very precise and the "alternative statistical hypothesis" that maps to the research hypothesis is very vague, so many alternative research hypotheses may explain the results.
Imagine psychology... done properly. The beginning of a science.
(I realize that "beginning" is too harsh, but psychology does have very serious problems with replicability. At the moment, it deserves its tarnished reputation.)
But whether some amount of evidence is "significant" or not is entirely dependent on your prior. If you believe something has about a 50:50 chance of being true to start with, then a factor 20 of evidence is quite enough. Now you believe it 20:1 likely to be true.
But for something like xkcd's "green jelly beans cause cancer", your prior should be something like 1 to 100,000 or even smaller. After all, there are a lot of possible foods and a lot of possible diseases. Unless you believe a significant number of them are dangerous, your prior for any specific food causing any specific disease must be pretty low. And then even a factor 200 of evidence is nowhere near enough to convince me that green jelly beans cause cancer.
Of course, this system only worked because academia was a bastion of the male WASP elites that didn't have much pretense of serving the broader public. But at least you didn't have the torrent of mediocre papers that you see today.
Have things really changed? I suspect there are fewer males, but any job that demands 20 years of full-time concerted effort is likely to be dominated by men. Similarly, the western world is overwhelmingly caucasian, so again... the best predictor (now as then) is that white male professors will be represented disproportionately.
> at least you didn't have the torrent of mediocre papers that you see today.
That certainly is true. Stats for the humanities and social sciences are that 80% of the papers have zero citations. i.e. they have no contribution to the greater body of human work.
In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer.
I would say that if a discipline is dominated by uncited papers, then that discipline is probably a waste of time. And the professors who work in it are a net drain on society.
> In Physics (my background), most papers have 2-3 citations, and only a small percentage have 1 or fewer
Does that account for self-citations?
No, it refers to White Anglo-Saxon Protestants, an ethno-religious group that cuts across socio-economic class divides and includes plenty of people that are neither old money nor descended from families that have been in the US since the founding (and excludes some old-money, from-the-founding families.)
But then you get the problem of selecting those people without easily measured objective indicators. That's why it worked reasonably well when those were slightly low paying jobs restricted to a caste.
This seems unintuitive and the claim is unreferenced. Can anyone explain why this is the case (if true)?
Sounds scientific, doesn't it?
> [...] we judge to be reasonable.
And tomorrow someone else judges it differently?
Maybe they should not try to redefine significance but simply introduce something called 'well-reproducible' or so.
Money saved by using a small sample size is wasted trying to replicate a false positive result and by groups around the world that rely on that false result.
Requiring larger sample sizes would mean fewer experiments are carried out but we will have more confidence in the positive results produced. The outcome is fewer experiments wasted on following up on false positives. None of this requires a change in funding.
For instance, in the field I work in you have to spend days to months waiting for tumors to grow and then go and treat the animals every day for a couple of weeks with an IV drug (weekends too!). That is a a lot of work and at the end only tells you one piece of information about the drug: does it slow tumor growth in this one experimental model. It may in fact do that -- and you may get a really great p-value if you increase the number of mice -- but you still need to study the drug's pharmacokinetics, tissue distribution, in vivo mechanism of action (assuming you already know the in vitro mechanism of action). These are not just optional experiments that we require today to publish: this kind of work is essential to presenting a story about a new drug. It's not just about what it does, but how it works and universalizable it is.
Is this along the lines of what you were hoping to find? Here's more: http://andrewgelman.com/2016/12/13/bayesian-statistics-whats...
You can take a look at the three axioms people use to justify statistics. If you are willing to accept them, all else that relies on them (without using new axioms) must be true:
This same logic is used to justify development in pure mathematics: choose a set of axioms which you accept as ground truths, and prove things using them. As long as you are unable to prove your axioms are contradictory, and the axiom choice seems acceptable, then the work that you've done (with respect to them) is philosophically justified.
One entry into this set of ideas is what Peter Walley has called the "Bayesian dogma of precision" - that every event has a precise probability, that every outcome has a known cost. There are real-world situations when these probabilities cannot be assessed, or may not even exist; same for utilities.
Some examples are in betting and markets (asymmetric information, bounded rationality), and in complex simulation environments having so many parameters and encoded physics that the interpretation of their probabilistic predictions is unclear.
You mean you aren't happy with...probability?