Hacker News new | past | comments | ask | show | jobs | submit login
Why I've lost faith in p values (ucdavis.edu)
354 points by anacleto 10 months ago | hide | past | web | favorite | 173 comments

Here's a more simple thought experiment that gets across the point of why p(null | significant effect) /= p(significant effect | null), and why p-values are flawed as stated in the post.

Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.

> But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

The problem with this picture is that it's showing publication as the end of the scientific story, and the acceptance of the finding as fact.

Publication should be the start of a the story of a scientific finding. Then additional published experiments replicating the initial publication should comprise the next several chapters. A result shouldn't be accepted as anything other than partial evidence until it has been replicated multiple times by multiple different (and often competing) groups.

We need to start assigning WAY more importance, and way more credit, to replication. Instead of "publish or perish" we need "(publish | reproduce | disprove) or perish".

Edit: Maybe journals could issue "credits" for publishing replications of existing experiments, and require a researcher to "spend" a certain number of credits to publish an original paper?

That's a good idea: encourage researchers to focus on a mix of replication and new research. When writing grants, a part of that grant might be towards replicating interesting/unexpected results and the rest for new research. Moreover, given that the experiment has already been designed, replication could end up demanding much less effort from a PI and allow his students to gain some deliberate practice in experiment administration and publication. On the other hand, scholarly publication might have to be changed in order to allow for summary reporting of replication results to stave off a lot of repition.

My field has less of a "You publish first or you're not interesting" culture than many others, and part of what that is is recognizing that estimating an effect in a different population, with different underlying variables, is, itself, an interesting result all its own.

Tim Lash, the editor of Epidemiology, has some particularly cogent thoughts about replication, including some criticisms of what is rapidly becoming a "one size fits all" approach.

Let's think about costs.

Suppose all experiments have a p-value of 0.05. Suppose scientists generate 400 true hypotheses and 400 false hypotheses. One experiment on each hypothesis validates 380 true hypotheses and 20 false ones, for a cost of 800 experiments. If we do one layer of replication on each validated hypothesis, then, among the validated hypotheses, the 380 true will become 361 doubly-validated true hypotheses and 19 once-validated-once-falsified (let's abbreviate "1:1") true hypotheses; the 20 false will become one 2:0 false hypothesis and 19 1:1 hypotheses; all this increases the cost by 50%. Then it seems clear that doing a third test on the 38 1:1 hypotheses would be decently justified, and those will become 18.05 2:1 true hypotheses, 0.95 1:2 true hypotheses, 0.95 2:1 false hypotheses, and 18.05 1:2 false hypotheses. If we then accept the 2:0 and 2:1 hypotheses, we get 379.05 true and 0.95 false hypotheses at the cost of 1238 experiments, vs the original of 380 true and 20 false at the cost of 800 experiments; the cost increase is 54%.

On the other hand, suppose scientists generate 400 true and 4000 false hypotheses. The first experiments yield 380 1:0 true and 200 1:0 false hypotheses, at the cost of 4400 experiments. The validation round yields 361 2:0 true, 19 1:1 true, 10 2:0 false, and 190 1:1 false, costing 580 extra experiments; re-running the 1:1s, we get 18.05 2:1 true, 0.95 1:2 true, 9.5 2:1 false, and 180.5 1:2 false, costing 209 extra experiments. Taking the 2:0 and 2:1s, we get 379.05 true and 19.5 false hypotheses for 5189 experiments, instead of 380 true and 200 false hypotheses costing 4400 experiments; the cost increase is 18%.

So it's clear that, in a field where lots of false hypotheses are floating around, the cost of extra validation is proportionately not very much, and also you kill more false hypotheses (on average) with every experiment.

What is the "cost" of believing false hypotheses? It depends on what one does with one's belief. Hmm.

It would be nice if someone made a stab at estimating the overall costs and benefits and making a knock-down argument for more validation.

Some of those false hypotheses were very expensive. Especially those related to nutrition science.

"Maybe journals could issue "credits" for publishing replications of existing experiments, and require a researcher to "spend" a certain number of credits to publish an original paper?"

This would cripple small labs, unless people's startup packages come with potentially millions of dollars in funding to get their first few "credits".

It depends on the field and the policy would best be followed in an area like experimental psychology, where replication is not extremely costly (and where it might be an especially large program).

>they only test null hypothesis that are true.

If a null hypothesis is invariably true, it's impossible to reject it. Which means the scientists will not be able to find any statistic or data to support any of their bad, original hypotheses. Not 5%, not 0.005%, nor whatever.

p-values are not flawed. They are a useful tool for a certain category of jobs: namely to check how likely your sample is, given a certain hypothesis.

The argument in the original post is a bit of a straw man fallacy.

"I want to know the probability that the null is true given that an observed effect is significant. We can call this probability "p(null | significant effect)"

OK, hypothesis testing can't answer this type of questions.

Then "However, what NHST actually tells me is the probability that I will get a significant effect if the null is true. We can call this probability "p(significant effect | null)"."

Not quite correct. It's "p(still NOT a significant effect whatever it means | null)".

EDIT. Fixed the last sentence.

> If a null hypothesis is invariably true, it's impossible to reject it. Which means the scientists will not be able to find any statistic or data to support any of their bad, original hypotheses. Not 5%, not 0.005%, nor whatever.

Why argue when you can simulate:

    > n <- 50
    > simulations <- 10000
    > sd <- 1
    > se <- sd/sqrt(n)
    > crit <- 1.96 * se
    > mean(abs(colMeans(sapply(rep(n, simulations), rnorm))) > crit)
    [1] 0.0494
Lo and behold, we reject the null hypothesis that the mean of a normal distribution is equal to zero in 5% of all simulations, even though the null hypothesis is in fact true. (`rnorm` defaults to 0 mean and 1 sd)

It's always refreshing to meet a fellow R hacker on HN!

May I ask you why you chose to use the normal distribution in your example or any distribution at all, for that matter? What I was replying to was

">they only test null hypothesis that are true."

Which means that the null hypothesis is always true no matter what data you collect trying to reject it. It does not depend on the null distribution (normal in your example), the value of the test statistic (the mean of the sample in your example), or the threshold (crit in your example). In fact, the null distribution in this case is not a distribution at all since there's no randomness in the null hypothesis. We know for a fact that it is always true (in the hypothetical situation we are considering).

It's more like

     > rep(FALSE, simulations) # is the null hypothesis false? nope
or, if you insist on using the normal distribution,

     > abs(colMeans(sapply(rep(n, simulations), rnorm))) > +Inf

In fact, in your example, since you are essentially running 1000 hypothesis tests on different samples, multiple hypothesis correction would solve the "problem" with p-value. This is how I would do it.

     > n <- 50
     > simulations <- 10000
     > x <- sapply(rep(n, simulations), rnorm)
     > p <- sapply(apply(x, 2, FUN=t.test), function(tt) tt$p.value)
     > pa <- p.adjust(p, method="fdr")
     > library(boot)
     > boot.out <- boot(pa, function(d, i) mean(d[i]), R=1000)
     > boot.ci(boot.out, conf=0.95, type="basic")

CALL : boot.ci(boot.out = boot.out, type = "basic")

Intervals : Level Basic 95% ( 0.9774, 0.9780 ) Calculations and Intervals on Original Scale

P.S. p-values are great when used appropriately.

> May I ask you why you chose to use the normal distribution in your example or any distribution at all, for that matter?

The distribution is not important, any other data generator would do.

> Which means that the null hypothesis is always true no matter what data you collect trying to reject it.

The idea behind the thought experiment was that we live in a world in which researchers always investigate things that will turn out not to exist / be real, but the researchers themselves don't know this!, otherwise they wouldn't bother to run the investigations in the first place.

> In fact, in your example, since you are essentially running 1000 hypothesis tests on different samples, multiple hypothesis correction would solve the "problem" with p-value.

They're not multiple tests. They're multiple simulations of the same test, to show how the test performs in the long run.

Perhaps you're a wonderful statistician, I wouldn't know, but nothing you have said thus far about null hypothesis significance testing makes any sense or is even remotely correct.

> If a null hypothesis is invariably true, it's impossible to reject it. Which means the scientists will not be able to find any statistic or data to support any of their bad, original hypotheses. Not 5%, not 0.005%, nor whatever.

You've never heard of random error? Just because a null hypothesis may accurately describe a data generating phenomenon doesn't mean you will never get samples that aren't skewed enough to have a significant effect.

Pretend we are comparing neighborhoods. Say the true age of the people in my neighborhood and your neighborhood is actually equal, at 40, but my alternative hypothesis is that the average age of residents in my neighborhood is younger than yours (thus the null is they are the same, which unbeknownst to me is the truth). You are claiming that no matter how many random samples of residents of our two neighborhoods we take, they will always be close enough in average age that we will always fail to reject the null. That's obviously not the case.

In fact, by definition, the p-value is stating that we should expect 5% of samples we draw to indicate my neighborhood is significantly younger than yours, even though that isn't true, solely due to the randomness of our samples. That's literally the purpose of the p-value.

It seems like your reasoning and the reasoning of the author could be applied to any statistic testing the reliability of a hypothesis, not simply p values. Further, you could mitigate this problem if you knew the prior probability, sure. But how do you expect a bad hypothesis generator to be good at knowing the prior probability. The usual standard is "extraordinary claims require extraordinary evidence." The less likely a hypothesis, the stronger the evidence, measured as p-values or otherwise.

But the thing is the public and the scientific community has to be the one who are going to judge the extraordinariness of a claim. If an experimenter were to wrap their results in their own belief in the likelihood of the hypothesis, the observer wouldn't be able to judge anything. So it seems like experimenters reporting p-values is as good a process as any. It's just the readers of results need to be critical and not assume .05 is a "gold standard" in all cases.

> It seems like your reasoning and the reasoning of the author could be applied to any statistic testing the reliability of a hypothesis, not simply p values.

Precisely. That's the point. Hypothesis testing is inherently absurd.

Hypothesis testing is "soul" of science.

What's impossible is thinking that just the output of a single experiment gives hypothesis certainty, or a fixed probability of a hypothesis or anything fully quantified.

You're alway going to have the context of reality. Not only will you have the null hypothesis you'll competing hypotheses to explain the given data.

But the point of science isn't blinding constructing experiments but instead forming something you think might be true and doing enough careful experiments to convince yourself and others in the context of our overall understanding of the world that the hypothesis is true. Common sense, Occam's Razor, the traditions of a given field and so-forth go into this.

Then, hypothesis testing was born in the context of industrial quality control, where the true data generating process is very close to being well-known and deviation from the norm raises a red flag rather than suggests new knowledge about how breweries work.

XKCD already imagined such universe: https://xkcd.com/882/

While intended as light humor, this actually seems like a really damning argument to me. It's conceptually similar to overfitting a machine learning model by aggressively tuning hyperparameters without proper cross-validation, etc. What serious defenses are there after this sort of attack?

Serious defense: p-values have been in use for a long time, and while they are error prone a larger number of true results has been found than false results (according to recent research the ratio is 2:1 reproducible to non-reproducible).

In the cartoon, the scientists are making multiple comparisons which is something strictly forbidden in frequentist hypothesis testing. One way to get around it is to apply a correction by dividing the significance theshold ("alpha") by the number of comparisons being made, in this case 20. The cartoon does not state it's actual p-value as most journals will require, but the hope would be that by dividing by the corrective factor the significance of that particular comparison goes away.

So p-value methods still lead to a lot of Type I and Type II errors, but in the past they have been the best science has been able to come up with. Actually, probably the greatest issue with false results in the scientific literature is that null results are not publishable. This leads to a case where 20 scientists might independently perform the same experiment where the null is true, for only one to find a significant result. The demand for positive results only acts as a filter where only Type I errors get made! This is just one problem with the publishing culture, and doesn't take into account researchers' bias to manipulate the data or experiment until p < .05.

An alternate approach to the frequentist methodology of using p-values is the Bayesian method, which has its own problems. First there are practical concerns such as choosing initial parameters that can affect your results despite sometimes being arbitrarily chosen, and also the high computational demand to calculate results (less of an issue in the 21st century, which is why the method is seeing a revival in the scientific community). Probably their main problem right now is that practitioners simply aren't familiar with how to employ Bayesian methods, so there's some cultural inertia preventing their immediate adoption.

while they are error prone a larger number of true results has been found than false results (according to recent research the ratio is 2:1 reproducible to non-reproducible)

It seems odd to talk about "results" as an average across all fields, rather than for a specific field. It's much more common for people to claim that psychology rather than physics has a reproducibility crisis, and thus I don't think it makes sense to talk about the combined reproducibility across both fields. What research are you referencing, and what fields did they look at? Given the differences across fields, if the average is 2:1 reproducible, I'd guess that some fields must be lower than 1:1.

You're right, it definitely depends on the field. The paper I am referencing looked at psychology, I believe. It is likely that a social science would have greater issues with reproducibility than a physical science.

Oh, it's definitely damning. The real joke in the XKCD comic is that, if we assume each panel is a different study, the only study that would be published in a journal is the one where p < 0.05.

Originally it was intended that peer review in published journals and study reproduction would verify findings. In a small community where all results are treated equally, this works fine. In a world without data systems to organize data and documents, this was really the only reasonable method, too.

However, we don't live in that world anymore. The community isn't small, and information science and data processing are much advanced. Unfortunately, since careers are built on novel research, reproduction is discouraged. Since studies where the null hypothesis is not rejected are typically not published at all, it can be difficult to even know what research has been done. There are also a large enough number of journals that researchers can venue shop to some extent, as well.

Many researchers are abandoning classic statistical models entirely in favor of Bayes factors [https://en.wikipedia.org/wiki/Bayes_factor]. Others are calling for publishing more studies where the null hypothesis is not rejected (some journals specializing in this like [http://www.jasnh.com/] have been started). Others are calling for all data to be made available for all studies to everyone (open science data movement). Others are trying to find ways to make reproduction of studies increasingly important.

It's really a very complicated problem.

As you point out, there is already a major issue when dealing with honest scientists who have to work in a publish or perish model where funding is based on getting results. But if we were to tweak the parameters so that there are at least some biased scientists and that the finding sources are biased for certain results (other than just any result where p < 0.05), and we take into account a subset of society looking for 'scientific support' of their personal convictions, the issue becomes much worse.

Look at how much damage was done by science misleading people about nutrition in regards to carbs and fats. How often, especially from the social sciences, does some scientific finding get reported by popular media as some major finding which should have drastic effects on social/legal policy, only for the root experiment to be a single study with a p < 0.05 where the authors caution against drawing any conclusions other than 'more research is needed'? Violence and media is a good example, and even more so when we consider the more prurient variants thereof.

I think this is the basis of why I am more willing to trust new research in physics more than in sociology.

Effect estimation, rather than relying on p-values, is one approach that provides far more context than just "Is or is not significant".

Also training your scientists - especially those outside the physical sciences - that effects likely aren't fixed in a meaningful sense (i.e. the effect of smoking on lung cancer isn't a universal constant in the way the speed of light is), at which point multiple estimates of the same effect from different groups and populations has value.

Off topic, but is there an XKCD comic about a time where there isn't an XKCD comic for a subject?

the ones about red spiders

This example implies only statistically significant results get published. But 'proving' a null may also have value depending on how non-trivial it is to the non-scientists in that society.

And if you extend the hypothetical such that everyone in that society always had true nulls, there wouldn't even be a need for science. We'd all be too used to never being wrong.

I don’t have to imagine that society since they already exist and are called social scientists.

More seriously you do make a good point which is all scientists lie on a spectrum from always generating true hypotheses, to always generating false hypotheses. Scientists in different fields tend to lie more to one or the other of the extremes. My experience is the observational sciences are more shifted to the always false end than the experimental sciences.

Did you observe social scientists being wrong or was it verified experimentally?

I do hope you are not being serious, but assuming you are not, then quite an amusing response.

It was in fact a joke, but with some truth. You're making serious claims about a vast body of literature and methodologies without having actually understood their entirety. This is exactly what you're criticizing social scientists for doing: drawing conclusions based on observations from systems no one has fully isolated for experimentation. If you think this is methodologically unsound, that's fine, but you shouldn't then do it yourself.

I was making the pointed armchair observation that all hypothesis being tested in the social sciences are false. Of course none of the down voters seemed to noticed that my hypothesis is a social science hypothesis. Subtly is lost on HN most of the time.

More seriously the social sciences do have a lot of problems, some driven by the methodologies used, some by ideology, and some by the inherent noisiness and unreliability of the data available. Not an easy area to do science in.

This is being down-voted for the shots fired, but the underlying point is almost certainly true to a degree. People aren't ideologically invested in (say) the weight of electrons in the same way that they are in IQ curves across demographic groups.

Ideology certainly plays an important role in generating false hypothesis, but all the observational sciences suffer from the problem that you can't run experiments to rigorously test the robustness of your hypothesis.

In the experimental sciences you can get far using the rule of thumb that if you need statistics you did the wrong experiment, while in the observational sciences the use of statistics is inherent.

It is not about true or false hypotheses generation. It is about likely and false hypotheses.

When we start to treat the hypotheses as "true" instead of "likely", we fall into a trap of not being able to reconsider the past evidence in the light of new evidence. We hold onto the "truths" of previous hypotheses instead of taking a fresh look.

An example of this is the current model used for astrophysics, where the basic "truth" that is the consensus of the majority working in the area is that "gravity" is the only significant force operating at distances above macroscopic. I use "gravity" because there is much debate in various areas as to what this force actually is.

There is evidence that our explanations are either incomplete or wrong. Yet this fundamental "truth" is unquestioned in the majority and where it is questioned, those questioners become personae non gratae.

It happens in the climate change debate. Here the "truth" is that the causes are anthropomorphic. So if you question that "truth", you are persona non grata. Yet, the subject is so complex that we do know to what extent, if much at all, human activity changes the climate over long periods of time. To question the "truth" of essential anthropomorphic causes to climate change means that detailed investigations into the actual causes do not get undertaken if they do not support the "truth" hypothesis.

In real life, scientists are people with the same range of foibles and fallibilities as everyone else. Just because one is "smart" doesn't mean one is clear-headed and logical. Just because the "scientific consensus" is for one model or another doesn't make that "scientific consensus" any more true than an alternative model that explains what we see.

We need to stop getting uptight about our favourite models and those who dispute them. We need to be able to take fresh looks at the data and see if there are alternatives that may provide a better working model. We also need to get away from considering successful models as "truth" and more as the "current successful working" models.

Having worked quite closely with cosmologists I can tell you that you have the wrong impression. Cosmologists perform maximum likelihood parameter estimations of models. Often included in these models are parameters that control deviations from general relativity or parameters that completely switch from GR to another form of gravity. The fundamental truth that there is dark matter is the fundamental fact that GR + visible matter alone is a terrible fit, GR + visible matter + invisible matter is an amazing fit and all other models tried so far are also bad fits if multiple distinct experiments are compared. They continue to try to replace the invisible matter term with terms from first principles all the time. However, often someone comes along and fits a model to a single dataset and proclaims loudly that they have solved the dark matter or dark energy problem. However, there are many distinct datasets which also need to be modeled, and invariably when this is performed the model was seen to be a worse fit than GR + visible matter + invisible matter. I've been involved in various alternate model discussions with cosmologists and I wasn't even a cosmologist, so it is definitely not true that testing alternatives to gravity is the third rail.

The same seems to happen in the climate change debate: there is a huge range of experiments, where anthropomorphic warming is the maximum likelihood model. Many people select a single experiment, find a model with a better fit and then loudly proclaim that anthropomorphic warming is a conspiracy. However, their model is a terrible fit to the other experiments which they did not perform due diligence in checking.

Scientists grow tired of playing politics. If you have an alternate model, it needs to fit a vast set of observations, not a cherry picked one. If you only test against one observation and make a press release about it, you will definitely not be seen as a serious scientist.

My apologies that it has taken some time to respond to your points. The problem I have is that cosmologists incorporate entities that have not been experimentally verified or are impossible (at least at this time) to be experimentally verified. Just because the models appear to work actually means nothing when you cannot get any experimental verification of all the elements on which a theory or model depends. Proxy evidence is used to enhance the belief in some specific entities, but proxies are only proxies and the use of such can be very misleading.

To say that "the fundamental truth that there is dark matter..." is problematic from the get go. No experiment has demonstrated that "dark matter" of any kind exists. You cannot say that there is a fundamental "truth" anywhere in science. We have observation, we develop hypothesis which should suggest experiments to test said hypothesis and with further evidence we develop theory. At no point is either hypothesis or theory "truth". Unless, of course, your intention is to make science into a religion.

When it boils down to it, science is a way of developing understanding of the physical world about us. It may lead to changes in one's philosophical or religious viewpoint, but it doesn't have to. It is not the be all and end all of anything. It is simply a means of hopefully increasing one's understanding. Sometimes it does and sometimes it doesn't. There are many examples of experiments and the results that have been considered anathema to the consensus view that the scientists who did those experiments have been made pariahs. This is very problematic as politics and religion become the driving forces that maintain the orthodox view.

There has been and is a significant push for science to be the authoritative voice as to what one should believe. However, science gives no guidance on any matters relating to human interaction or action. If anything, it is a cause of significant problems for human interaction and action.

I think you make a good point overall, and I think anthropogenic global warming should be open to questioning. It should be be able to prevail on its merits in the face of competing theories.

However if you are speaking in terms of what "we know", I think you have to acknowledge that the scientific consensus is that AGW is real. That doesn't prove it is true -- nothing in our world outside of math is ever truly proven. But it puts the burden of proof on doubters to not only provide a different/better theory, but also to explain why everyone else is wrong.

If your position is that everyone else is wrong, but the "actual causes" are not known yet, then you just end up looking like someone who has their thumb on the scale, and is invested in a particular outcome.

> nothing in our world outside of math is ever truly proven

Just biting a bit: depends on the logic you use. The logic might not be true in our Universe, just in mathematicians' heads/idealistic Universe. And even on those idealistic Universes there is no real consensus if they/which are true, or just useful.

Imagine you add Maybe as third value between True and False. Later you might find one Maybe is not enough, you might need four different Maybes. Then suddenly it dawns on you that countable amount of Maybes is the minimum. Then you throw away such logic because it's practically useless, even if it models reality better with the side effect of breaking established math as well. Then you wonder why simple binary logic is quite good in describing many things in real Universe, but you have no means to prove any relation between this logic, math derived from it, and reality you live and observe.

None of the logics mentions even touch the idea of quantifiers. Which is the way most math proofs are written nowadays. It is strictly more powerful than any multivalued logic.

Anthropogenic global warming has been extremely thoroughly questioned, both rationally and irrationally, and it still stands.

I agree. My point was only that such questioning is healthy and necessary. But even so, people should not misrepresent what the consensus is.

The problem, as I see it, is that the "consensus" view is taken to be true. As has been pointed out elsewhere, the "97% of scientists" who believe that climate change is anthropogenic comes from a study of papers. From what I understand, the 97% is 97% of the 1/3 of papers on climate change that made any reference to climate change being anthropogenic. The other 2/3's made no reference to climate change being anthropogenic or not.

It should be irrelevant what the consensus view may be. If an alternative model or theory is proposed, then the model or theory should stand on its merits not on whether or not it agrees with the consensus view.

My view is that science is about gaining some understanding of the universe about us. If a model or theory is useful in that understanding then good, it is useful. But if a theory or model develops big holes in it then mayhaps we should be looking for alternatives that have lessor holes.

Take the example of study of standard model of sub-atomic physics. Within it, there are some quite large holes that are papered over with theoretical mathematics. Yet, if one steps back and takes another look at what is being seen there are some interesting observations to be made that raise questions about the validity of the standard model.

You’re confusing 97% of scientists with 97% of papers - which isn’t a very scientific thing to do.

As for the Standard Model - scientists would dearly love to find observations thst challenge it, but so far there’s been no consistent, high quality evidence of physics beyond it.

The question is 97% of what group of scientists? Secondly, where did the figure 97% come from in the first place?

The standard model requires a couple of base assumptions that are contradictory and problematic.

In the beginning, the consensus view was that human activity was not a significant factor in climate change. The consensus came about because of overwhelming evidence. In this matter, the causal relationship is the opposite of what you state, and you are making a false claim about how science works because you refuse to accept the evidence.

In the beginning the consensus view was that we were heading for an imminent "ice age" and then that view changes to "hockey stick global warming" and now to cover all bets, climate change.

I have questions that I have posed to climate scientists and if a reasonable answer comes back then anthropogenic climate change is on the cards. But in fifteen years, nary an answer to those questions have come back, so, any prognostications by climate scientists based on their models are, as far as I am concerned, worthless.

As far as the evidence is concerned, it may or may not support an anthropogenic causal regime. But, on the basis of that evidence, I lean towards a non-anthropogenic majority cause for climate change.

As far as how science works, climate scientists make many assumptions about their proxies that have not been verified as being conclusively accurate. There is sufficient evidence, if you actually look around for it, to say that the interpretation of the proxy evidence is either incomplete or wrong or meaningless.

One of the best articles covering this issues is Meehl[1][2]. You can find discussion in various places like Gelman[3] and Reinhart[4].

[1] Meehl, Paul E (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.

[2] http://meehl.umn.edu/files/144whysummariespdf

[3] http://andrewgelman.com/2015/03/23/paul-meehl-continues-boss...

[4] https://www.refsmmat.com/notebooks/meehl.html

'The fundamental problem is that p values don't mean what we "need" them to mean, that is p(null | significant effect).'

From Bayes' theorem, this more useful probability is given by p * x, where x = p(null) / p(significant effect). Maybe we could just lower the accepted threshold for statistical significance by several orders of magnitude so that, for statistically significant p, p * x is still small even for careful (i.e. big) estimates of x (e.g. maybe a Fermi approximation of the total number of experiments ever performed in the field in question). This doesn't necessarily imply impractically big sample sizes, although obviously this depends on the specifics (I believe the p value for a given value of the t-statistic decays exponentially with sample size).

I don't follow your argument. You've got two premises:

1) You are saying that people are committing the transposing the conditional fallacy: p(H0|data) != p(data|H0):

- OK

2) You say to use Bayes theorem to get the value we want:

- OK, but actually a better formulation is

  p(H_0|data) = p(H_0)*p(data|H_0)/[p(H_0)*p(data|H_0) + p(H_1)*p(data|H_1) + ... + p(H_n)*p(data|H_n)]
You probably don't need to add up all the way to hypothesis n since the terms eventually become negligible and can be dropped from the denominator. The point is that you have to compare how likely the result would be under other hypotheses, not just H_0.

3) You propose lowering the threshold for "significance"

- How does this follow from the premises? Lets say you get a very low value for p(H_0)p(data|H_0), this can still be much higher than p(H_1)p(data|H_1), etc so it is still the best choice. Ie, you can get a low p-value given H_0 but if there is no better model out there you should still keep H_0.

I’m assuming the question we are trying to answer is not “which H_n is most probable”, but rather “how safe is it to conclude that H_0 is not true”. For example, say we are concerned with whether the difference between two groups is less than a certain amount or greater than a certain amount.

>“how safe is it to conclude that H_0 is not true”

I would just take "very safe" as a principle, there is even the truism "all models are wrong".

>"the difference between two groups is less than a certain amount or greater than a certain amount"

You are ignoring a lot of the model being tested here (eg, normality, independence of the measurements, etc) and only considering one parameter.

“how safe is it to conclude that H_0 is not true”

What I meant here by H_0 was the hypothesis that the difference between groups is less than some particular threshold. I think if you made the threshold large enough then it would not be safe to conclude that H_0 is not true.

"You are ignoring a lot of the model being tested here (eg, normality, independence of the measurements, etc) and only considering one parameter. "

I said enough so that if you were arguing in good faith you could fill in the gaps yourself.

I am arguing in good faith. There may be some cases in physics where there is some theoretical distribution derived related specifically to the problem, and they believe the model to be actually 100% true. Otherwise, the model should be assumed to only be an approximation at best.

Yes, the t-test does assume normality and you can never be sure of perfect normality if that's what you are getting at (although I believe that simulation tests of the robustness of the t-test against deviations from normality generally show that this isn't too much of a practical concern). I wasn't trying to address every potential weakness with the t-test (or p-values in general); I was addressing the one stated in the article.

I'm saying that unless you have bothered to derive a statistical model from your theory, and you believe that theory may actually be correct, then you know that you will reject your model if enough time/money is spent on testing it.

Ok, I will assume that by "model" here you mean a probability distribution on the parameters relevant to your experiment. In that case I agree with what you just said: knowing exactly the correct model is impossible in a similar way that knowing someone's height to an infinite degree of precision is impossible. But I never said anything to the contrary. The H_0 I gave corresponds to an infinite set of models and not a single one (note that I said the difference is less than a certain threshold, not that the difference is 0 (although it still would be an infinite set in that case, but the probability would be 0)).

And in anticipation of the rebuttal that the probability is still 0 because it's never exactly normal, what I really meant originally but didn't write out explicitly for brevity and because I assumed it would be implicit: when I talk about giving an upper bound on p(null | data) from p(data | null), what I really mean is giving an upper bound on p(null | data, normal) from p(data | null, normal) where normal is the assumption that the distribution of whatever parameter we are looking at is normally distributed and null is the event that the difference in means between the two groups we are looking at it is less than some predetermined positive threshold. Or, for a 1-sample test, that the mean of a single group is within that threshold of some default value.

If you write out the actual calculation you will see normality (which was just one example of an assumption) is actually part of the null model being tested. It is not something different or outside of it.

That is just a trivial semantics issue, and yes I am familiar with the calculation.

Where do you think the flaw is specifically? Say we are doing a 1-sample test.

0. (setup) Suppose we have a real number mu and a positive epsilon. Define the interval I as [mu – epsilon, mu + epsilon]. For each “candidate mean” within this interval, we have a corresponding t-statistic. Let the statistic t_0 be the inf of all these t-statistics. Let T be the event that t_0 is at least as big as the observed value.

1. You can use Student’s t-distribution to compute an upper bound for the probability of T under the assumptions that the observations are iid normal and the mean lies in I. I will call this probability p(T | null, normal, iid), where “null” is the event that the mean exists and is in I. It makes no difference that it is more typical to lump these assumptions together as “null” because in math you can define things however you want as long as you are consistent.

2. We have that p(null | T, normal, iid) = p(T | null, normal, iid) * p(null | normal, iid) / p(T | normal, iid).

3. Therefore, if we have an upper bound for x = p(null | normal, iid) / p(T | normal, iid) then we can get an upper bound for p(null | T, normal, iid). That is my main claim.

Which of the above statements do you object to?

>"You can use Student’s t-distribution to compute an upper bound for the probability of T under the assumptions"

I'm not sure what you are arguing anymore. I am saying you will never test a parameter value in isolation, it is always part of a model with other assumptions. There is simply no such thing as testing a parameter value alone. To define a likelihood you need more than simply a parameter...

You seemed to be disagreeing with that, but are now acknowledging the presence of the other assumptions.

“I'm not sure what you are arguing anymore.”

It’s the claim I make in 3, and then the secondary claim that making our upper bound on p(null | T, normal, iid) small for significant p-values (i.e. p(T | null, normal, id)) could be used as a criterion for whether our threshold for statistical significance is small enough.

“You seemed to be disagreeing with that”

I’m not sure what I said that gave that impression. I didn’t mention anything about the normal / iid assumptions initially not because I thought we weren’t making these assumptions but because I didn’t think these details were essential to my point.

"Let the statistic t_0 be the inf of all these t-statistics. Let T be the event that t_0 is at least as big as the observed value." Oops, I meant the inf of their absolute values, and T is the event that t_0 is at least as big in absolute value as the observed value.

Also, every probability mentioned should also include in the list of conditions that the number of samples observed matches our experiment.

>"The H_0 I gave corresponds to an infinite set of models and not a single one"

How do you calculate a p-value based on this infinite set of models? Normally it is done using just one.

Ok for simplicity let’s assume this is a 1-sample test. So there is a certain “default” mean, say mu, and we are concerned with whether the mean of some random variable on whatever population we are sampling from is within, say, epsilon of this default value. For every number in [mu – epsilon, mu + epsilon] we can get a p-value giving the probability that we would have observed the data if this was the true mean. In order to get a probability that we would have observed the data given that the mean was somewhere in this interval, we need some prior distribution on the means in this interval, which we don’t have. However, we can just take the sup of all the p-values for each mean in this range to get an upper bound. (I think this is similar to how confidence intervals work also but take that with a grain of salt)

Looking closer at this you are describing one model but testing different values of one of the model parameters.

“Looking closer at this you are describing one model but testing different values of one of the model parameters.”

I am inferring from context that this is probably supposed to be a criticism but it doesn’t make much sense to me. Of course we have to consider different values of the mean, the whole point is to get a p-value corresponding to a range of potential different means.

But anyways, I do think that an explanation based on the t_0 statistic I defined in my other post is better.

1. We can define a statistic t_0 that is the infimum of the absolute value of all t-statistics for every candidate mean in the interval.

2. Suppose the mean is some value mu’ in the interval. Whenever the t_0 statistic is at least as big as its observed value, the t-statistic corresponding to mu’ is also at least as big in absolute value as the observed value of t_0, by definition.

3. So we can give an upper bound for the probability of the former event by the probability of the latter event. But the probility of the latter event is the same no matter what mu’ is, and can be computed using Student’s T distribution.

4. Therefore, we have an upper bound for the probability of t_0 attaining a value at least as big as the observed value, assuming that the mean is somewhere in the specified interval (plus the other standard assumptions). This is the p-value.

Do you agree with those assertions? If not, specifically where is the problem?

“Looking closer at this you are describing one model but testing different values of one of the model parameters.”

Furthermore, this is even done with the usual 1-sample t-test because the variance can be anything.

By the way, it's not normally done using one either. With a 1-sample t test for example, the null hypothesis is that the underlying distribution has a certain prescribed mean, but the variance can be anything.

Ok, you seem to accept that there is an assumption that the data is generated by a distribution with a mean, so start with that. This is not necessarily true: https://en.wikipedia.org/wiki/Cauchy_distribution

I use that only as an example. If you look closer you will find many other assumptions being made as well that are used to derive the actual calculation (for whatever statistical test you choose to look at).

I've already addressed that. See "in anticipation of the rebuttal" post.

Minor point, but you are missing p(H_0)*p(data|H_0) in the denominator in bullet point 2.

Thanks, fixed.

This was basically the suggestion here:


Also previous HN discussion:


Responding to this but getting rid of the intense nesting:

  “I'm not sure what you are arguing anymore.”
  It’s the claim I make in 3, and then the secondary claim that making our upper bound on p(null | T, normal, iid) small for significant p-values (i.e. p(T | null, normal, id)) could be used as a   criterion for whether our threshold for statistical significance is small enough.

  “You seemed to be disagreeing with that”

  I’m not sure what I said that gave that impression. I didn’t mention anything about the normal /   iid assumptions initially not because I thought we weren’t making these assumptions but because I   didn’t think these details were essential to my point."

Please give some example code or calculation steps for what you are talking about.

You mean for computing the p-value associated with a range of means rather than a single null mean value? That’s the only thing I can think of for which providing code or calculation steps would be applicable (well, there’s also getting a Fermi approximation for the total number of experiments ever performed in a discipline, which I gave as a preliminary suggestion for a cautious estimate of x, but I don’t have time to do that and the main difficulty of that isn’t the math anyways). Anyways I would be happy to provide that if you can confirm that this is what you meant and that you are asking out of genuine curiosity.

>"our upper bound on p(null | T, normal, iid) small for significant p-values (i.e. p(T | null, normal, id))"

This sounds like nonsense to me so I would like to see an example of what you mean.

Ooops, I forgot to include something in the list of conditions. For every probability I mentioned in the post you just quoted and its grandparent post, we should add to the list of conditions that the number of samples used is the same as the number of samples used in our experiment.

Anyways, if it sounds like nonsense then I will try rephrasing. Right now the threshold for statistical significance is p = 0.05, sometimes lower depending on the field. Let’s say we want to lower this threshold, and we need some way of determining how low it should be. I am suggesting that this could be done by deciding on an acceptably safe threshold, say q, for p(null | T, num samples, normal, iid), and an acceptably safe upper bound, say x_0, for p(null | num samples, normal, iid) / p(T | num samples, normal, iid). Then we use q / x_0 as the threshold for statistical significance.

We would also be using p-values for a null hypothesis corresponding to a range of means rather than a single value (or a range of differences in means, or something else suitably adapted to the type of test we are doing), because with a single value, having an upper bound on the probability of the null hypothesis is vacuous, as you very eagerly point out.

You could argue that this is useless because what we really want to bound is p(null | num samples, T), not p(null | num samples, T, normal, iid). I would respond that, yes this would be the more desirable quantity to know, but we need to make some simplifying assumptions in order to make this problem tractable, and while the normal and iid assumptions probably aren’t true, there are at least situations where we can be confident that they approximate reality reasonably well.

Also, as an aside, I’m not sure what you mean specifically when you say “model”. The usual definition is a set of probability distributions on a common sample space. But when you say things like “all models are wrong”, I’m assuming you are just using “model” as a synonym for “distribution”, because this claim is vacuously false using the usual definition of model (just take the set of distributions to be every distribution on that sample space; this must contain the true distribution by definition). But then when you say “Looking closer at this you are describing one model but testing different values of one of the model parameters”, this seems to make more sense if you are using the usual definition of model. So I’m not sure what definition you are using.

Also, in case this clears up some confusion, I was using the nonstandard definition in some posts because I inferred that this was the definition you were using.

Here is an example of a model.

The other day I got a robocall from one of those spoofers that uses the same area code + three digits as the number being called. This call happened to be a number I did know. Did they just happen to hit upon this number, or is something more sinister going on (eg using a hacked address books)?

Say I've gotten nCalls calls like this so far, whats the probability at least one of them would be from a number I know?

The probability of any given number being used will be 1/9999, if I know 5 numbers with the same first digits as my own it would be 5/9999, etc. This leaves 1 - nKnown/9999 other possible numbers to be used. The probability they keep using unknown numbers will then just be eg, (9998/9999)^nCalls and we take 1 minus this value to get the probability that at least one of the calls will be from a known number. Here is the model:

  model_1 = 1 - (1 - nKnown/9999)^nCalls
Lets say I've gotten about one call like this every day for the last two years. So nCalls = 730, and I know 3 numbers that share the same digits including my own (I am assuming they also spoof the number they are calling). Then the probability of getting at least one call from the known numbers would be 20%. I made simulation in R and see the same results:

  # Returns percent of calls coming from a known number for nSim experiments
  sim <- function(nSim, nCalls, nNum, nKnown, replace = FALSE){
    res = replicate(nSim, sample(1:nNum, nCalls, replace = replace) %in% 1:nKnown)

  model_1 = mean(sim(1e4, 730, 9999, 3, T) > 0)
I would guess that robocallers are pretty cheap so are probably going with the simplest approach possible. But perhaps they are slightly more advanced and they avoid reusing the same number (ie, if it went unanswered and I never answer unknown numbers) and avoid using the callees number. This can be done by changing a few arguments in the sim:

  model_2 = mean(sim(1e4, 730, 9998, 3, F) > 0)
The two models predict nearly the same thing in the range of 730 calls, so p(data|model_1) = p(data|model_2) = 20%. I think the simpler model is still more probable though, so lets say p(model_1) = 0.75 and p(model_2) = 0.24 and p(model_x) = .01. Model x is that something more shady like using the hacked contact info is going on, this is pretty vague and can explain anything so give p(data|model_x) = 1.

Then we use Bayes' rule:

  p(model_1|data) = .75*.2/(.75*.2 + .24*.2 + .01*1) = 72%
  p(model_2|data) = .24*.2/(.75*.2 + .24*.2 + .01*1) = 23%
  p(model_x|data) = .01*1 /(.75*.2 + .24*.2 + .01*1) = 5%


Actually, I made an error for model 2. Since I am not including my own number, nKnown should be 2 instead of 3.

    model_2 = mean(sim(1e4, 730, 9998, 2, F) > 0)
This gives p(data|model_2) = .14, and:

  p(model_1|data) = .75*.2 /(.75*.2 + .24*.14 + .01*1) = 77%
  p(model_2|data) = .24*.14/(.75*.2 + .24*.14 + .01*1) = 17%
  p(model_x|data) = .01*1  /(.75*.2 + .24*.14 + .01*1) = 5%

Here is a better way to think about this.

The proper role of data is to update our existing beliefs about the world. It is not to specify what our beliefs should be.

The question that we really want to answer is, "What is the probability that X is true?" What p-values do is replace that with the seemingly similar but very different, "What is the probability that I'd have the evidence I have against X by chance alone were X true?" Bayesian factors try to capture the idea of how much belief should shift.

The conclusion at the end is that replication is better than either approach. I agree. We know that there are a lot of ways to hack p-values. Bayesian factors haven't caught on because they don't match how people want to think. However if we keep consistent research standards, and replicate routinely, the replication rate gives us a sense of how much confidence we should have in a new result that we hear about.

(Spoiler. A lot less confidence than most breathless science reporting would have you believe.)

This is like Functional programming , and people have a very hard time with it. Instead of passing around numbers "95% true" or whatever, we're passing around function "It's 2x as likely as you though it was, please insert your own prior and update", but even worse, it's "please apply this complicated curve function at whatever value you chose for your prior". It's just too hard for people to manage. Computers can do it (but it's hard for them too, very computationally intensive), and you have to really trust your computer program to be working properly (and you have to put your ego in the incinerator!) to hand over your decision-making to the computer.

I question whether computers can do it at all in useful practice.

Take a look at the results quoted in https://en.wikipedia.org/wiki/Bayesian_network#Inference_com... about how updating a Bayesian net is an NP hard problem, and even an approximation algorithm that gets the probability right to within 0.5 more than 0.5 of the time is NP-hard.

> The proper role of data is to update our existing beliefs about the world. It is not to specify what our beliefs should be.

Create the schema beforehand, I get that. But feature extraction does work, providing models from data. Sometimes takes much time to analyze and understand the models.

If you have enough data and a strong enough signal, then all reasonable belief systems should converge on the same answer. Do not let that fact fool you into believing that raw data is the only thing necessary to make good decisions when faced with realistic situations.

My fear is that in 10 years time, we will have learned to hack Bayesian factors.

The way that you hack Bayesian factors is selectively including data. And the selection process can be as simple as publication bias causing some results to be published and others not.

This is the fundamental weakness of meta analysis.

My favorite probability theory problem is related to this article.

You have a test for a disease that is 99% accurate. This means that 99% of the time the test gives a correct result. You test positive for the disease and it is known that 1% of the population has the disease. What is the probability you have the disease?

The answer is not at all the one most people think at first when given this problem. This problem is why getting two tests is always a good thing to do when testing positive for a disease.

EDIT: I updated the statement of the problem to be one that can be answered!

>The answer is not at all the one most people think at first when given this problem.

The answer depends on the disease!

If it's the common cold, it's probably close to 99%. If it's huntington's disease, the likelyhood is much lower. (when asked, this question is normally posed as "you are given a test, which is 99% accurate, for some rare and deadly disease", the "rare" part is important)

Sorry, you are correct. I'll update the problem.

I'll follow up by mentioning that even as you have it worded now, it's not precise enough to answer. You need to explicitly describe the false positive and negative rates separately. As is, a test that is just "return false" (100% false-negatives, 0% false-positives) will be 99% accurate, but gives no information, whereas a test with 1% false positive rate can have a 1% false negative rate and still be correct 99% of the time, and will provide much more information.

Statistics is weird and unintuitive.

I guess I should say that everyone takes the test. I'm not a statistician, I'm a mathematician. The way I understand the phrase, "the test is 99% accurate" is that this means: assuming everyone were to be tested then 99% of the time you get an accurate result. Thus .99(1%) = 0.99% of the people will correctly test positive and 0.01(99%) = 0.99% of the population will incorrectly test positive.

The problem still stands then.

I'm not sure you phrased the problem correctly. If we follow your explanation, then the probability of having the disease is indeed 99%.

If you want to show the implication of Bayes' Theorem then you need to be more precise : Say you have a 1% of false positive and false negative rates (99% reliability) and 1% of the population is sick. If you are tested positive, then the probability of being sick is much less than 99%.

> If we follow your explanation, then the probability of having the disease is indeed 99%.

This is not correct; the probability of having the disease is unknown. He didn't say what he meant by the test being "99% accurate", but that doesn't mean you can just make your own assumption.

Note that in your more precisely specified scenario, when the test has 99% reliability, it is perfectly true that "99% of the time the test gives a correct result", which immediately disproves the claim that, if we follow that definition, the probability of having the disease given a positive test result is 99%.

The problem is that "99% the time gives a correct result" is imprecise.

It can be understood as both:

- p(sick|positive) = 0.99 - p(positive|sick) = 0.99

We get totally different results, the first one is obvious (99% change of being sick), and the second one needs Bayes' Theorem (and is the one we want to use).

I would only interpret "the test gives a correct result 99% of the time" to mean that out of every 100 test results, 99 are correct and one is wrong. Neither of your interpretations matches that. You need all kinds of additional information to say anything more specific. "99% of results are correct" can easily be true while p(sick | positive) and p(positive | sick) each vary anywhere between 0 and 1.

I updated the problem. Sorry for the mistake.

Let's see... Let's say you test 10,000 people, so about 100 actually have the disease. Since the test is only 99% accurate, only 99 of those will test positive. Of the remaining 9,900 actually negative people, 99 will test falsely positive. So if you test positive, you have a 50% chance of actually having the disease?

As I understand the phrase, "99% accurate", this is correct. However, I gather that to statisticians this phrase could mean other things. I think the source of the ambiguity comes from whether or not everyone gets tested or something along those lines.

To me the phrase should mean that when I test people who have the disease then 99% of the time I get a positive result and when I test people who don't have the disease then 99% of the time I get a negative result. That seems the most reasonable interpretation but I'm not a statistician.

Great point. This is the effect of a low base-rate.


Here's a paper on it's impact on network intrusion detection.


> This problem is why getting two tests is always a good thing...

It's important to note that the tests results should ideally be as uncorrelated as possible. At worst a test always gives the same result as its first outcome, so further testing would give zero information.

In practice this means that you probably want tests that are based on completely different mechanisms.

Also that those tests are actually testing the same thing, which is tricky with your "two different mechanisms" requirement a lot of the time. My field is currently struggling to get a handle of what one should do in a particular circumstance when one test is negative and the other positive (for those interested, PCR + and toxin assay - for C. difficile).

The core issue is that p-values are cheaper to get than replicating the study, but replicating the study is the only reliable way to see if it's true or not. Sometimes the expensive/time-consuming way, is the only good way.

Replication by itself is not enough. You need pre-registration too. Otherwise you can p-hack the replications.

Not really disagreeing with you necessarily, but you can hack pre-registration too, in a way that becomes more tenable with smaller studies, which benefit less from pre-registration anyway.

The tables are misleading too: of course p(h0=T|p<.05) is going to be about 0.50 if your power is 0.50 and p(h1) = 0.10. That is maybe the bigger problem than p-values per se. Also, he fails to compute the other cell, which is p(h0=T|p>.05), which ends up being about 0.95. The problem in that table is have no power to detect anything.

In real life, you don't know p(h0)--that's the point of doing the study. So Bayesianism doesn't really help you.

The real problem is lack of power.

True, but the article was making a point about "even if we set aside p-hacking for a moment", so I was going with that. But I agree, pre-registration is important as well.


I'm not trying to be facetious, but isn't this something you learn in junior-level stats? I had this drilled in in both undergrad math courses and grad machine learning courses; I'm confused to see it warrant an article.

It's well known what p-values show. But they are, in practice, used as a gatekeeping mechanism in academic journals in many fields (including mine). Worse, getting p<0.05 is informally seen as a measure of practical significance, rather than simply as one statistical test amongst many passed.

So yes, it is something you learn in introductory quantitative methods classes. But I don't think most researchers understand just how much it matters.

Also, a key R package for producing regression tables of coefficients for journal articles is called 'stargazer'. Given the unwarranted focus of many readers on those indicia of 'significant' results, I think it's well named.

I currently have the opposite problem. Given that I work with very large online datasets (N=1M or so) everything, including the random noise, is statistically significant to p<0.05. It really is effect sizes or busy at that point.

Real measures of practical significance are OR (odds ratio), effect size and dose response curve. Response histogram for statistical effects. (Or the 2D component analysis island histogram.)

The harsh reality is that most scientists are not sieving through every statistics book they can get their hands on in order to find out all the reasons they might be wrong. The "individual motivation" to become statistics experts is only present in a few fields, and in the others it is ousted into applied courses taught by other departments.

Statistics is directly necessary in ML, so it's a "profit center" and emphasized. In many sciences it's treated like a cost center (something that you need, like IT, but that lies outside of your central expertise.)

TIL thanks for the explanation, I guess I never thought about the fact that other STEM fields would not emphasize its meaning.

To misquote Upton Sinclair you can’t get a scientist to understand statistics when their job depends on misunderstanding statistics.

The basic problem is under the current funding environment it is far better to pump out a dozen wrong papers than one carefully researched paper.

The literal answer to your question is, "no, generally not". That a greater emphasis on statistics should be included in science is certainly the case, but then there is a school of thought that know a little bit of frequentist statistics is better than knowing none at all. But regardless, I am fairly confident that most scientists (or engineers) do not actually learn this as juniors (or seniors) (or Ph.D's)

You can end up with a Ph.D. in some fields being exposed to almost no statistics, or only statistics which work in very confined settings (certain experimental sciences where "Just do an ANOVA... is genuinely the answer to almost every question).

That often works...right up until the moment when a scientist has to step outside that context.

This often cuts both ways though. I have seen beautiful math and statistics around problems that don't make any sense if you've taken more than one semester of microbiology.

Andrew Gelman's blog provides regular insightful commentary on this issue, I highly recommend it:


The post that turned me on to all of this is at:


The article says:

> Note: this has nothing to do with p-hacking (which is a huge but separate issue).

I disagree. p-hacking is when one experimenter checks many statistical tests to find one that is significant. The effect the author is discussing is that many experimenters do many experiments and the significant ones get published. One is more unethical (or maybe just incompetent) than the other, but they’re essentially the same phenomenon.

They are the same in that they both create a situation in which the p-value cannot be trusted. However, in one case this is deliberate. In the other it's a problem with the whole enterprise.

Also, running multiple tests without correcting for multiple testing (usually by reducing the threshold for significance) is just one form of p-hacking. The more insidious version is when one runs the test after every few participants until random chance makes it "slip over the edge of significance". In that case there might not even be enough variables for multiple testing to have occurred, and it becomes very difficult to detect.

The difference between something that requires a bad actor and one which is an outcome of a system working as intended is pretty huge.

I'm honestly more tired of essays about p-values than p-values.

It's true that like all metrics if it becomes a target then it maybe abused (Goodharts Law).

However if you abolished p-values people would start hacking or misunderstanding priors or confidence limits or OR.

It's an easy dumb stat that most anyone can do in excel and most everyone recognises. The emphasis should be that it remains a quick shorthand for casual use but that more complex studies have more sophisticated models and probabilistic reasoning.

But the emphasis on the p-values is bizarre. As best illustrated by JT Leek the pipeline of data research has multiple points of failure that may lead to false findings or irreproducible research. But we talk very little about them whilst essays about p-values come out every week...


This was a really interesting article. I've worked with researchers who try to defend a small but statistically significant finding that just doesn't seem likely to be real, and this provides a statistical explanation for my skepticism. The p-value mentality is deeply entrained in a lot of researchers, though

The challenge for journal editors seems very real. There's another group that deals with this challenge of interpreting the validity of significant findings for a living, though: biotech VCs. A lot of times trying to reproduce the work is their best way of addressing this, and often the first work done by startups is to try to replicate the academic work. For some other heuristics VCs use to assess "reproducibility risk", see here;


2 solutions: a) stop doing experiments that just look for correlation without any attempt to get at mechanism. Of course sometimes you can't avoid this and then 2) use lower p values. Don't waste thousands or millions (more) dollars following up 5% results.

When I was first taught statistics, I was told that the researcher had to justify a plausible hypothesis first - and then do a hypothesis test/p-value to prove their theory.

If this combination of the scientist's intuitive understanding and the p-value test result align, then this is a credible result.

On the other hand, the trend now is to conduct every possible test whether or not there is any justification for doing so (corrected for multiple testing, no p-hacking, yes, sure)

For example, in tech, we might test every shade of blue. Some of those blues are gonna come up as p-value hits - but since we had no good reason to do this test, this was probably just random noise.

Similarly, in genetics, we're gonna test every single gene against everything - just to see what happens (yes, yes, do a Bonferroni correction on each set of tests). Hmm, recent results in genetics don't seem to be very robust or repeatable, for some reason.

The likelihood of a truthful link in these tests is incredibly low. When have no particular reason to believe there is a truthful link, and are just blind testing, the false positive rate is very high (as described in the article), and probably even higher than the article speculates with - almost all hits are gonna be false positives.

Maybe p-values just don't work well with modern day data. Or, maybe, Big Data just doesn't contain information about mysterious, unexplored, and innovative correlations that we hope it does.

“On the other hand, the trend now is to conduct every possible test whether or not there is any justification for doing so (corrected for multiple testing, no p-hacking, yes, sure)”

You are literally describing p-hacking.

He's describing a multiple comparisons problem, not p-hacking, enabled by, essentially, ease of statistical computation. An honest researcher can trigger his problem without ever p-hacking.

See: Genome Wide Association Studies.

It isn't so much that there is no "truthful link", it is that everything is linked to everything else to some degree and the mathematical models they use for the null hypothesis are just "defaults". These assumptions are almost always violated. The statistical tests detect that, and are providing "true positives".

Yes, this is kind of exactly the attitude change I am talking about.

The “true positives” found in your model can be expected to have no predictive value, so what is the point of identifying these?

If you run an experiment twice and the same shade of blue wins both times, that should persuade you that the winning shade is better. And if you keep replicating the experiment and it keeps winning, that should increase confidence further.

Yes, Modern day data is not obtained from a controlled clinical trials

Here's a follow-up to the original blog post: https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-fait...

> instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data. However, at the end of the day, editors need to make an all-or-none decision about whether to publish a paper

Yet another way in which the traditional publishing structure actively harms science.

Do you have an alternative way of publishing in mind?

My criticism is not necessarily constructive. ;) But it's not too hard to imagine something along the lines of arXiv combined with a rating/commenting system not unlike HN itself, combined with a Facebook-ish algorithm for surfacing articles relevant to each reader. The devil is in the details, of course, but it shouldn't be hard to do better than the expensive, artificially constrained, and arbitrary system we're saddled with now. The real trick will be convincing entrenched academics to switch -- I'm still not convinced this is actually possible.

I remember reading http://andrewgelman.com/2016/11/13/more-on-my-paper-with-joh...

with its graph "This is what power = 0.06 looks like". So I got the point that you have to have sufficient statistical power. A useful rule of thumb is that you need a power of at least 0.8. You need to have some idea how big the effect is likely to. Perhaps from previous exploratory research, from claims of other researchers, from reasoning "well, if this is happening the way we think it is, there has to be an effect greater than x waiting to be discovered.". Then you work out how big a sample size you need to use. Then you roll up your sleeves and get down to work.

But the reason for using p values rather than Bayesian inference is that it gets you out of the tricky problem of coming up with a prior. You only need to think about the null hypothesis and ask yourself whether the probability of the data, given the null hypothesis, is less than 0.05.

So there is a bit of contradiction. p values don't really work unless you ensure that you have sufficient power. To do that you need a plausible effect size, to feed into your power calculation. And that is implicitly an rough approximate prior, 50:50 either null or that effect. You could just do a Bayesian update, stating how much you shifted from 50:50.

Basically, if you don't already know enough to have an arguable prior to get a Bayesian approach started, you don't know enough to do a power calculation, so you shouldn't be using p-values either.

I went looking on andrewgelman.com for a reference for wanting power = 0.8 and found a more recent post


Oh shit! The situation is much worse than I realised :-(

> But the reason for using p values rather than Bayesian inference is that it gets you out of the tricky problem of coming up with a prior.

It technically doesn't even do this. Using a frequentist approach is equivalent to a Bayesian approach with an uninformative prior, which is itself an assumption being baked into the analysis, only one that is almost unquestionably incorrect. Its essentially saying you have literally no idea how a data is being generated, which is certainly not true.

James Abdey wrote his Ph.D. on this subject several year ago and proposed an alternative method for making decisions based on statistical evidence: http://etheses.lse.ac.uk/31/

> Many researchers are now arguing that we should, more generally, move away from using statistics to make all-or-none decisions and instead use them for "estimation". In other words, instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data.

I couldn’t agree more with this statement, and even moreso in a business setting than in research. It’s just so easy to get caught up in statistical significance and lose perspective on practical significance. I’ve found confidence intervals most informative and easy to understand.

This is an old thread already and I don't know if I'm getting my voice heard. But at any rate: hypothesis testing (slightly different philosophically from p-values, but anyway) is bogus because conjectures-and-refutations falsificationism is bogus. That's not how good science has ever happened, only how bogus research programs have dressed themselves in science.

The core of science is "the unity of science". Signal-to-noise measurements tell you very little outside a general coherentist/holistic verificationist framework.

This is especially troubling when combined with confirmation bias. The whole point of data is that it anchors us to reality. Data should be the check that prevents us from believing something simply because we want it to be true. But if we only test theories we already suspect are true, we are already biasing the kinds of false positives we will get.

pvals are a lot like the weather - everyone complains about it, nobody does anything about it. Specifically, what tends to be missing from these conversations is a good alternative - the author seems to be asking for false discovery rates/q values. Or maybe effect sizes? The reality is one size doesn't fit all, and the most useful statistic depends on the context. Oh, and the target: good luck submitting your work to a biological journal without pvals. I'm sure the editor will briefly marvel at your courage in taking a stand as she rejects without review.

While we're on the subject: there's a tendency to appeal to larger sample sizes, as the author also mentions. Worth remembering that for some of us data isn't a thing you download from the interweb, it's something you generate - and it costs money and time to do so. (And for human subjects research, the stakes are even higher...)

I don't think there's any cause to abandon p-values and NHST if you're running experiments with high power and intelligent, deliberate priors.

With power = 0.8 and p(h1) = 0.6, p(h0 | p < 0.05) = 0.04. Even if power = 0.8 and p(h1) = 0.2 then p(h0 | p < 0.05) = 0.2.

Does anyone have that article about how a pro/anti parapsychology both designed a study, analyzed the data, and got conflicting results? (There was a joke about how it was the only paper published that explained a discrepancy by saying the other side cheated)

Is there a book (written in plain language) that goes into the history of academic journals and the details the current state of the "replication crisis", "data dredging", etc?

Despite using statistics daily, I still feel utterly uncomfortable about its philosophical grounding. Are there any resources HN can suggest to soothe the heart of a sceptic?

Skeptic * quite different from sceptic :)

I think it's just an alternate spelling that is preferred in some locales: https://en.wiktionary.org/wiki/sceptic

I never understood why people took p-values seriously. They never seemed to mean anything of use.

Whenever I brought it up around other academics, no one seemed to want to comment on it. Maybe they were afraid to admit they didn't understand a topic that's apparently important to publishing? Anyone can follow the formula to make a p-value, but there's no requirement to understand its meaning.

I'd love to find their use, but I still haven't found it.

Before computers could do the complicated math of proper Bayesian updating, it was impossible to get the answer to the question people were asking ("probability of fact given evidence?"), and the science community decided that they'd rather get good answer to bad questions ("probability of evidence given fact?") then bad answers to good questions.

Well, consider the reverse. A 0.05 p-value might mean anything, but what about a 0.5 p-value?

No simple test is going to tell you a hypothesis is definitely correct, but if you can weed out the probably-garbage hypotheses, you can focus on evaluating the remainder more critically.

> there's no requirement to understand its meaning.

This is a feature, not a bug. The fact that the p-value has nothing to do with whether or not the hypothesis is correct is intended to force people to think for themselves rather than using math as a substitute for logic.

If a feature is working in reverse it's probably a bug though- it might be intended to force independent thinking, but in reality all it does is serve as a gameable goalpost, no thought required.

p-values that are not in the physics ranges are ridiculous.

It's a shame everyone started copying physics but decided for higher acceptance/rejection values.

I was a little bit disappointed when I realized that a bunch of valid modern science is just proper experiment design and number crunching. If it's not physics, there's no models of why things work, there's just a p-value on the correlation or some other comparison function.

Medicine has turned into a field where you can't know a thing.


I love reading reports like the above:

> There is currently no evidence that supports or refutes that these interventions(chiropractic intervention) provide a clinically meaningful difference for pain or disability in people with [lower back pain] when compared to other interventions.

p-values really do not help that much.

> It's a shame everyone started copying physics but decided for higher acceptance/rejection values.

P-values were developed outside of physics, it's not like people took them from physics relaxing the significance thresholds.

"If your experiment needs statistics, you ought to have done a better experiment." Lord Ernest Rutherford (maybe)

Sometimes dumbing down a concept can totally screw up a person's learning curve. During early days a lot of Java tutorials (in Indian engineering books) mentioned that the reason Java has Interfaces is because otherwise it is not possible to inherit from multiple classes. While it is true that you can implement multiple interfaces, the whole point of interface is to define "interface" without forcing an implementation. It has nothing to do with the "limitation" of single inheritance.

Coming back to p values. A simple google search will find you many articles that say

> A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.

The whole idea of p-values is to warn a scientist that they should look for statistical significance. The behaviour of hypothesis over infinite trials is what that matters and hence more data => better reliance. But "more", "better" etc. are subjective ideas and in many cases when everything else considered normal <0.05 might be good but not always. There are far too many factors such as wrong sampling method, things you can not measure vs things you can measure etc. that affect this number truly.

I think author nails it when he writes "Replication is the best statistic."

Always think of these tests in biological evolution perspective. Do you think this hypothetis would survive the test of time where it has to frequently face the real world ?

It seems to me that a number of things are not clear at all.

1) p-values is a metric on a whole library of procedures, not a single check (although one of them, the Normal distribution is the one usually meant). There's 5 basic ones everyone should know, and you could fill a decent bookshelf with details on all of them.

This means a true p-value should be accompanied by

a) what the source data was

b) how it's distributed, and why (better yet, proof). This should include assurances that there are no attempts to game the measurement (or otherwise any change in the source data directly related to this measurement), as that of course invalidates it.

c) a sanity check (like a normality test, or redoing the procedure on generated data and verifying the expected result)

d) what the exact claim is (e.g. this is normally distributed with a mean > X)

e) what the procedure was to verify this claim (e.g. normality test + mean > X ... you need to COMBINE your p values, because if the data isn't normal your P test is invalid and of course your hypothesis should be rejected even if the numbers in the t-test say it shouldn't)

f) a HUGE disclaimer that this measurement and it's actual numerical result only apply to past values, and if it's used to change something it is no longer valid, even if it's numerical value changes as a result, and you can still calculate it.

2) it's not a given that it's possible at all. The truth is that there are many things that don't follow the central limit theorem and therefore cannot correctly be used for statistical approximation.

Essentially you need to be utterly convinced that every measurement is the result of somehow combinding a large number of effects that keep reappearing.

For instance, planetary orbits do NOT satisfy this criterion. Sure they're the result of large numbers of influences, but almost none of those influences ever repeat (e.g. comets passing in close orbits is a big modifier of orbits, and in 4-5 billion years that'll happen once with any given asteroid. Almost every perturbation is unique and doesn't repeat, so it can't be predicted and won't follow statistical laws. That is the sort of effect that no form of statistics will ever find, and that invalidates your results. Because these are rare it works in the short term, but it doesn't work in the long term. E.g. if you repeat the experiment to determine the speed of light with Jupiter's moons, that won't work with the original data because their orbits have shifted too much)

You are far more articulate than I am.

Replication is extremely important but not the solution to the problems Luck is discussing: How do you decide whether some experiment is worth a replication attempt? How do you decide whether a replication was successful? The standard (but not the only) answer is: through p-values. This shows that stats and replication are largely orthogonal issues. So what /is/ the solution to the problems with p-values? Bayesian statistics, which gives you the probability of an hypothesis given the data and some clearly defined set of assumptions, exactly what you typically want.

tldr: The author got a PhD in 1993[1] and is just now figuring out that p-values are not false positive rates


I was lucky and figured it out before getting a degree. Its got to be hard for people in this position to look back on their previous work where the most fundamental aspect of interpreting the results was incorrect.

He gets it right that statistics are good for estimation, but there is a part two. You need to come up with a theory that makes a prediction to compare to these estimates, and then test that. Ie, your prediction about the distribution of the results is the "null hypothesis". I think p-values are probably ok for that.

I've met Steve Luck multiple times over the last decade. He is a very rigorous and insightful scientist, with expertise in a wide range of methods (especially eeg) and psychological phenomena. His wife Prof. Lisa Oaks may be even more impressive in these regards, but that is an aside. The point I would like to make though is that there are lots of aspects to research and stats is just one part to master of many. It should not be up to a neuroscientist to make posts like this. The stats community should be actively pushing the scientific community to alternatives that match interpretative intuition with reality of the statistical metric.

I went through the same thing. I was taught the same half-assed BS statistics with all the wrong interpretations and was surrounded by people just following along with that.

Still, it was a point of duty/honor for me to figure out what these statistical results meant. How is interpreting results not a key part of a scientists job? I guess if you don't want to bother learning to interpret results then you can do science by either being a lab tech or theorist.

Sounds like someone who never understood statistics, still doesn't, and doesn't want to.

A particularly glaring issue is this offhand comment:

> this is a statement about what happens when the null hypothesis is actually true. In real research, we don't know whether the null hypothesis is actually true. If we knew that, we wouldn't need any statistics! In real research, we have a p value, and we want to know whether we should accept or reject the null hypothesis.

That isn't a question that any statistical approach will help you with. There's a reason we talk in terms of "rejecting" or "failing to reject" a hypothesis. We don't do statistical tests to accept hypotheses, only to reject them.

The concept of accepting one hypothesis based on a comparison between it and one other hypothesis is ludicrous on its face, suffering exactly the problems associated with Pascal's wager.

He clearly meant "fail to reject" where he said "accept", you're quibbling about semantics. The whole post is about how p-values (and indeed, known statistical techniques) don't actually help you decide whether or not you should reject the null.

The post is about a confusion between the question that p-values address ("how likely is this data to have come from the null hypothesis?") and another very different question ("given several findings that were unlikely to have come from various different hypotheses, how many are likely to be spurious?").

>"That isn't a question that any statistical approach will help you with. There's a reason we talk in terms of "rejecting" or "failing to reject" a hypothesis. We don't do statistical tests to accept hypotheses, only to reject them."

In (Neyman-Pearson's) hypothesis testing you definitely can accept an hypothesis. This is not true for (Fisher's) significance testing though.

This paper is probably fine to explain it: https://onlinelibrary.wiley.com/doi/pdf/10.1002/0470013192.b...

The only comment to the point so far has gotten downvoted. How telling... Bashing p-values without really understanding what they are about is quite trending these days.

I've never understood this semantic point of statistics.

When used in English, the word "accept" literally means "fail to reject", and has no more implication than that. Do people making this point use a different definition of "accept"?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact