Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.
Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.
The problem with this picture is that it's showing publication as the end of the scientific story, and the acceptance of the finding as fact.
Publication should be the start of a the story of a scientific finding. Then additional published experiments replicating the initial publication should comprise the next several chapters. A result shouldn't be accepted as anything other than partial evidence until it has been replicated multiple times by multiple different (and often competing) groups.
We need to start assigning WAY more importance, and way more credit, to replication. Instead of "publish or perish" we need "(publish | reproduce | disprove) or perish".
Edit: Maybe journals could issue "credits" for publishing replications of existing experiments, and require a researcher to "spend" a certain number of credits to publish an original paper?
Tim Lash, the editor of Epidemiology, has some particularly cogent thoughts about replication, including some criticisms of what is rapidly becoming a "one size fits all" approach.
Suppose all experiments have a p-value of 0.05. Suppose scientists generate 400 true hypotheses and 400 false hypotheses. One experiment on each hypothesis validates 380 true hypotheses and 20 false ones, for a cost of 800 experiments. If we do one layer of replication on each validated hypothesis, then, among the validated hypotheses, the 380 true will become 361 doubly-validated true hypotheses and 19 once-validated-once-falsified (let's abbreviate "1:1") true hypotheses; the 20 false will become one 2:0 false hypothesis and 19 1:1 hypotheses; all this increases the cost by 50%. Then it seems clear that doing a third test on the 38 1:1 hypotheses would be decently justified, and those will become 18.05 2:1 true hypotheses, 0.95 1:2 true hypotheses, 0.95 2:1 false hypotheses, and 18.05 1:2 false hypotheses. If we then accept the 2:0 and 2:1 hypotheses, we get 379.05 true and 0.95 false hypotheses at the cost of 1238 experiments, vs the original of 380 true and 20 false at the cost of 800 experiments; the cost increase is 54%.
On the other hand, suppose scientists generate 400 true and 4000 false hypotheses. The first experiments yield 380 1:0 true and 200 1:0 false hypotheses, at the cost of 4400 experiments. The validation round yields 361 2:0 true, 19 1:1 true, 10 2:0 false, and 190 1:1 false, costing 580 extra experiments; re-running the 1:1s, we get 18.05 2:1 true, 0.95 1:2 true, 9.5 2:1 false, and 180.5 1:2 false, costing 209 extra experiments. Taking the 2:0 and 2:1s, we get 379.05 true and 19.5 false hypotheses for 5189 experiments, instead of 380 true and 200 false hypotheses costing 4400 experiments; the cost increase is 18%.
So it's clear that, in a field where lots of false hypotheses are floating around, the cost of extra validation is proportionately not very much, and also you kill more false hypotheses (on average) with every experiment.
What is the "cost" of believing false hypotheses? It depends on what one does with one's belief. Hmm.
It would be nice if someone made a stab at estimating the overall costs and benefits and making a knock-down argument for more validation.
This would cripple small labs, unless people's startup packages come with potentially millions of dollars in funding to get their first few "credits".
If a null hypothesis is invariably true, it's impossible to reject it. Which means the scientists will not be able to find any statistic or data to support any of their bad, original hypotheses. Not 5%, not 0.005%, nor whatever.
p-values are not flawed. They are a useful tool for a certain category of jobs: namely to check how likely your sample is, given a certain hypothesis.
The argument in the original post is a bit of a straw man fallacy.
"I want to know the probability that the null is true given that an observed effect is significant. We can call this probability "p(null | significant effect)"
OK, hypothesis testing can't answer this type of questions.
Then "However, what NHST actually tells me is the probability that I will get a significant effect if the null is true. We can call this probability "p(significant effect | null)"."
Not quite correct. It's "p(still NOT a significant effect whatever it means | null)".
EDIT. Fixed the last sentence.
Why argue when you can simulate:
> n <- 50
> simulations <- 10000
> sd <- 1
> se <- sd/sqrt(n)
> crit <- 1.96 * se
> mean(abs(colMeans(sapply(rep(n, simulations), rnorm))) > crit)
May I ask you why you chose to use the normal distribution in your example or any distribution at all, for that matter? What I was replying to was
">they only test null hypothesis that are true."
Which means that the null hypothesis is always true no matter what data you collect trying to reject it. It does not depend on the null distribution (normal in your example), the value of the test statistic (the mean of the sample in your example), or the threshold (crit in your example). In fact, the null distribution in this case is not a distribution at all since there's no randomness in the null hypothesis. We know for a fact that it is always true (in the hypothetical situation we are considering).
It's more like
> rep(FALSE, simulations) # is the null hypothesis false? nope
> abs(colMeans(sapply(rep(n, simulations), rnorm))) > +Inf
> n <- 50
> simulations <- 10000
> x <- sapply(rep(n, simulations), rnorm)
> p <- sapply(apply(x, 2, FUN=t.test), function(tt) tt$p.value)
> pa <- p.adjust(p, method="fdr")
> boot.out <- boot(pa, function(d, i) mean(d[i]), R=1000)
> boot.ci(boot.out, conf=0.95, type="basic")
boot.ci(boot.out = boot.out, type = "basic")
95% ( 0.9774, 0.9780 )
Calculations and Intervals on Original Scale
P.S. p-values are great when used appropriately.
The distribution is not important, any other data generator would do.
> Which means that the null hypothesis is always true no matter what data you collect trying to reject it.
The idea behind the thought experiment was that we live in a world in which researchers always investigate things that will turn out not to exist / be real, but the researchers themselves don't know this!, otherwise they wouldn't bother to run the investigations in the first place.
> In fact, in your example, since you are essentially running 1000 hypothesis tests on different samples, multiple hypothesis correction would solve the "problem" with p-value.
They're not multiple tests. They're multiple simulations of the same test, to show how the test performs in the long run.
Perhaps you're a wonderful statistician, I wouldn't know, but nothing you have said thus far about null hypothesis significance testing makes any sense or is even remotely correct.
You've never heard of random error? Just because a null hypothesis may accurately describe a data generating phenomenon doesn't mean you will never get samples that aren't skewed enough to have a significant effect.
Pretend we are comparing neighborhoods. Say the true age of the people in my neighborhood and your neighborhood is actually equal, at 40, but my alternative hypothesis is that the average age of residents in my neighborhood is younger than yours (thus the null is they are the same, which unbeknownst to me is the truth). You are claiming that no matter how many random samples of residents of our two neighborhoods we take, they will always be close enough in average age that we will always fail to reject the null. That's obviously not the case.
In fact, by definition, the p-value is stating that we should expect 5% of samples we draw to indicate my neighborhood is significantly younger than yours, even though that isn't true, solely due to the randomness of our samples. That's literally the purpose of the p-value.
But the thing is the public and the scientific community has to be the one who are going to judge the extraordinariness of a claim. If an experimenter were to wrap their results in their own belief in the likelihood of the hypothesis, the observer wouldn't be able to judge anything. So it seems like experimenters reporting p-values is as good a process as any. It's just the readers of results need to be critical and not assume .05 is a "gold standard" in all cases.
Precisely. That's the point. Hypothesis testing is inherently absurd.
What's impossible is thinking that just the output of a single experiment gives hypothesis certainty, or a fixed probability of a hypothesis or anything fully quantified.
You're alway going to have the context of reality. Not only will you have the null hypothesis you'll competing hypotheses to explain the given data.
But the point of science isn't blinding constructing experiments but instead forming something you think might be true and doing enough careful experiments to convince yourself and others in the context of our overall understanding of the world that the hypothesis is true. Common sense, Occam's Razor, the traditions of a given field and so-forth go into this.
In the cartoon, the scientists are making multiple comparisons which is something strictly forbidden in frequentist hypothesis testing. One way to get around it is to apply a correction by dividing the significance theshold ("alpha") by the number of comparisons being made, in this case 20. The cartoon does not state it's actual p-value as most journals will require, but the hope would be that by dividing by the corrective factor the significance of that particular comparison goes away.
So p-value methods still lead to a lot of Type I and Type II errors, but in the past they have been the best science has been able to come up with. Actually, probably the greatest issue with false results in the scientific literature is that null results are not publishable. This leads to a case where 20 scientists might independently perform the same experiment where the null is true, for only one to find a significant result. The demand for positive results only acts as a filter where only Type I errors get made! This is just one problem with the publishing culture, and doesn't take into account researchers' bias to manipulate the data or experiment until p < .05.
An alternate approach to the frequentist methodology of using p-values is the Bayesian method, which has its own problems. First there are practical concerns such as choosing initial parameters that can affect your results despite sometimes being arbitrarily chosen, and also the high computational demand to calculate results (less of an issue in the 21st century, which is why the method is seeing a revival in the scientific community). Probably their main problem right now is that practitioners simply aren't familiar with how to employ Bayesian methods, so there's some cultural inertia preventing their immediate adoption.
It seems odd to talk about "results" as an average across all fields, rather than for a specific field. It's much more common for people to claim that psychology rather than physics has a reproducibility crisis, and thus I don't think it makes sense to talk about the combined reproducibility across both fields. What research are you referencing, and what fields did they look at? Given the differences across fields, if the average is 2:1 reproducible, I'd guess that some fields must be lower than 1:1.
Originally it was intended that peer review in published journals and study reproduction would verify findings. In a small community where all results are treated equally, this works fine. In a world without data systems to organize data and documents, this was really the only reasonable method, too.
However, we don't live in that world anymore. The community isn't small, and information science and data processing are much advanced. Unfortunately, since careers are built on novel research, reproduction is discouraged. Since studies where the null hypothesis is not rejected are typically not published at all, it can be difficult to even know what research has been done. There are also a large enough number of journals that researchers can venue shop to some extent, as well.
Many researchers are abandoning classic statistical models entirely in favor of Bayes factors [https://en.wikipedia.org/wiki/Bayes_factor]. Others are calling for publishing more studies where the null hypothesis is not rejected (some journals specializing in this like [http://www.jasnh.com/] have been started). Others are calling for all data to be made available for all studies to everyone (open science data movement). Others are trying to find ways to make reproduction of studies increasingly important.
It's really a very complicated problem.
Look at how much damage was done by science misleading people about nutrition in regards to carbs and fats. How often, especially from the social sciences, does some scientific finding get reported by popular media as some major finding which should have drastic effects on social/legal policy, only for the root experiment to be a single study with a p < 0.05 where the authors caution against drawing any conclusions other than 'more research is needed'? Violence and media is a good example, and even more so when we consider the more prurient variants thereof.
I think this is the basis of why I am more willing to trust new research in physics more than in sociology.
Also training your scientists - especially those outside the physical sciences - that effects likely aren't fixed in a meaningful sense (i.e. the effect of smoking on lung cancer isn't a universal constant in the way the speed of light is), at which point multiple estimates of the same effect from different groups and populations has value.
And if you extend the hypothetical such that everyone in that society always had true nulls, there wouldn't even be a need for science. We'd all be too used to never being wrong.
More seriously you do make a good point which is all scientists lie on a spectrum from always generating true hypotheses, to always generating false hypotheses. Scientists in different fields tend to lie more to one or the other of the extremes. My experience is the observational sciences are more shifted to the always false end than the experimental sciences.
More seriously the social sciences do have a lot of problems, some driven by the methodologies used, some by ideology, and some by the inherent noisiness and unreliability of the data available. Not an easy area to do science in.
In the experimental sciences you can get far using the rule of thumb that if you need statistics you did the wrong experiment, while in the observational sciences the use of statistics is inherent.
When we start to treat the hypotheses as "true" instead of "likely", we fall into a trap of not being able to reconsider the past evidence in the light of new evidence. We hold onto the "truths" of previous hypotheses instead of taking a fresh look.
An example of this is the current model used for astrophysics, where the basic "truth" that is the consensus of the majority working in the area is that "gravity" is the only significant force operating at distances above macroscopic. I use "gravity" because there is much debate in various areas as to what this force actually is.
There is evidence that our explanations are either incomplete or wrong. Yet this fundamental "truth" is unquestioned in the majority and where it is questioned, those questioners become personae non gratae.
It happens in the climate change debate. Here the "truth" is that the causes are anthropomorphic. So if you question that "truth", you are persona non grata. Yet, the subject is so complex that we do know to what extent, if much at all, human activity changes the climate over long periods of time. To question the "truth" of essential anthropomorphic causes to climate change means that detailed investigations into the actual causes do not get undertaken if they do not support the "truth" hypothesis.
In real life, scientists are people with the same range of foibles and fallibilities as everyone else. Just because one is "smart" doesn't mean one is clear-headed and logical. Just because the "scientific consensus" is for one model or another doesn't make that "scientific consensus" any more true than an alternative model that explains what we see.
We need to stop getting uptight about our favourite models and those who dispute them. We need to be able to take fresh looks at the data and see if there are alternatives that may provide a better working model. We also need to get away from considering successful models as "truth" and more as the "current successful working" models.
The same seems to happen in the climate change debate: there is a huge range of experiments, where anthropomorphic warming is the maximum likelihood model. Many people select a single experiment, find a model with a better fit and then loudly proclaim that anthropomorphic warming is a conspiracy. However, their model is a terrible fit to the other experiments which they did not perform due diligence in checking.
Scientists grow tired of playing politics. If you have an alternate model, it needs to fit a vast set of observations, not a cherry picked one. If you only test against one observation and make a press release about it, you will definitely not be seen as a serious scientist.
To say that "the fundamental truth that there is dark matter..." is problematic from the get go. No experiment has demonstrated that "dark matter" of any kind exists. You cannot say that there is a fundamental "truth" anywhere in science. We have observation, we develop hypothesis which should suggest experiments to test said hypothesis and with further evidence we develop theory. At no point is either hypothesis or theory "truth". Unless, of course, your intention is to make science into a religion.
When it boils down to it, science is a way of developing understanding of the physical world about us. It may lead to changes in one's philosophical or religious viewpoint, but it doesn't have to. It is not the be all and end all of anything. It is simply a means of hopefully increasing one's understanding. Sometimes it does and sometimes it doesn't. There are many examples of experiments and the results that have been considered anathema to the consensus view that the scientists who did those experiments have been made pariahs. This is very problematic as politics and religion become the driving forces that maintain the orthodox view.
There has been and is a significant push for science to be the authoritative voice as to what one should believe. However, science gives no guidance on any matters relating to human interaction or action. If anything, it is a cause of significant problems for human interaction and action.
However if you are speaking in terms of what "we know", I think you have to acknowledge that the scientific consensus is that AGW is real. That doesn't prove it is true -- nothing in our world outside of math is ever truly proven. But it puts the burden of proof on doubters to not only provide a different/better theory, but also to explain why everyone else is wrong.
If your position is that everyone else is wrong, but the "actual causes" are not known yet, then you just end up looking like someone who has their thumb on the scale, and is invested in a particular outcome.
Just biting a bit: depends on the logic you use. The logic might not be true in our Universe, just in mathematicians' heads/idealistic Universe. And even on those idealistic Universes there is no real consensus if they/which are true, or just useful.
Imagine you add Maybe as third value between True and False. Later you might find one Maybe is not enough, you might need four different Maybes. Then suddenly it dawns on you that countable amount of Maybes is the minimum. Then you throw away such logic because it's practically useless, even if it models reality better with the side effect of breaking established math as well. Then you wonder why simple binary logic is quite good in describing many things in real Universe, but you have no means to prove any relation between this logic, math derived from it, and reality you live and observe.
It should be irrelevant what the consensus view may be. If an alternative model or theory is proposed, then the model or theory should stand on its merits not on whether or not it agrees with the consensus view.
My view is that science is about gaining some understanding of the universe about us. If a model or theory is useful in that understanding then good, it is useful. But if a theory or model develops big holes in it then mayhaps we should be looking for alternatives that have lessor holes.
Take the example of study of standard model of sub-atomic physics. Within it, there are some quite large holes that are papered over with theoretical mathematics. Yet, if one steps back and takes another look at what is being seen there are some interesting observations to be made that raise questions about the validity of the standard model.
As for the Standard Model - scientists would dearly love to find observations thst challenge it, but so far there’s been no consistent, high quality evidence of physics beyond it.
The standard model requires a couple of base assumptions that are contradictory and problematic.
I have questions that I have posed to climate scientists and if a reasonable answer comes back then anthropogenic climate change is on the cards. But in fifteen years, nary an answer to those questions have come back, so, any prognostications by climate scientists based on their models are, as far as I am concerned, worthless.
As far as the evidence is concerned, it may or may not support an anthropogenic causal regime. But, on the basis of that evidence, I lean towards a non-anthropogenic majority cause for climate change.
As far as how science works, climate scientists make many assumptions about their proxies that have not been verified as being conclusively accurate. There is sufficient evidence, if you actually look around for it, to say that the interpretation of the proxy evidence is either incomplete or wrong or meaningless.