
P values are not as reliable as many scientists assume (2014) - e0m
http://www.nature.com/news/scientific-method-statistical-errors-1.14700
======
kazinator
Previous post with discussion, 563 days ago:
[https://news.ycombinator.com/item?id=7225739](https://news.ycombinator.com/item?id=7225739)

PDF via same nature.com:
[https://news.ycombinator.com/item?id=8404620](https://news.ycombinator.com/item?id=8404620)

Related, dupes of each other:

[https://news.ycombinator.com/item?id=9463806](https://news.ycombinator.com/item?id=9463806)

[https://news.ycombinator.com/item?id=9486059](https://news.ycombinator.com/item?id=9486059)

Related:

[https://news.ycombinator.com/item?id=9119228](https://news.ycombinator.com/item?id=9119228)

------
BorisVSchmid
From my experience, scientists, -at least in biology, where like in sociology
you might have a lot of noise to deal with-, have an internal intuition that a
single paper with a significant result does not mean that we have found the
truth. The recent study which reported a reproducibility in sociology of about
36% strikes me as pretty accurate.

I think the scientific system can work with that. It means that if you build
follow-up experiments based on a single paper there is a good chance that the
experiment fails. In some way, the scientific system of publishing is self-
correcting in this regard, because you can then cast doubt on the previous
paper, which is easier to publish than if you only have a fresh negative
result (p-value > threshold).

~~~
jsprogrammer
If the p-values were accurate and averaged around 0.05, ~95% of results should
be reproducible.

That only 36% were points to deep, fundamental errors.

~~~
gwern
No. P-values don't work that way and don't mean what you think they mean. Read
OP or heck, any of the classics like "Why most published research findings are
false"
[http://dx.plos.org/10.1371/journal.pmed.0020124](http://dx.plos.org/10.1371/journal.pmed.0020124)

(36% may or may not be bad, but you can't know without additional stuff like
power or prior probability of hypotheses being true; p-values have no
intuitive meaning and aren't an answer to any question that people are asking,
which is a major reason why Bayesian approaches can be useful. And from a
Bayesian perspective, I find 36% totally unsurprising - if anything,
substantially better than I had expected given the gross underpowering of most
psych studies, the statistical-significance publication filter, and the
dubiousness of most hypotheses.)

~~~
jsprogrammer
A proper rebuttal would show what a p-value actually is and how it differs
from what I claimed. Now, since a p-value is exactly what I previously
claimed, you obviously can't do that. I'm not even sure what you are arguing
against me here.

~~~
Dylan16807
The p-value is the chance of a false positive. But you don't know what the
rate of true positives is, or the rate of false negatives.

In a world where there are only false positives and true negatives, and people
publish all positive and negative results, then reproduction of a paper should
be 95%.

But the reproduction rate when there actually is an effect is not 95%.
Depending on sample size, I might get a true positive 20% of the time and a
false negative 80% of the time, or I might get a true positive 99.8% of the
time and a false negative .2% of the time.

So the average reproduction rate, where an effect actually exists, can be
almost any number between 5 and 100. There is no reason to assume it will be
95%.

So the average reproduction rate, where some effects are real and some are
imaginary, will almost certainly not be exactly 95%, and that is not a problem
in and of itself.

(And when you talk about an average p-value of .05, that sounds like only
publishing positive results, which is blatantly going to fail reproduction.
100 false hypotheses -> 5 publications, all false positives -> 5% reproduction
rate)

~~~
jsprogrammer
>In a world where there are only false positives and true negatives, and
people publish all positive and negative results, then reproduction of a paper
should be 95%.

This is the world p-value assumes and is therefore the only one worth
considering in relation to my comment.

If an experiment is not well-formed then of course you won't see reproduction
at the expected rate. This is what I'm referring to when I say that the low
reproduction rate points to deep, fundamental flaws in the experiments.

I agree that the reproduction rate will never be exactly 95% (or 1 - p) due to
the discrete nature of experimentation [that's why I used a ~ in front :)],
but the reproduction rate of a well-formed experiment should very closely
track 1 - p.

~~~
Dylan16807
>This is the world p-value assumes and is therefore the only one worth
considering in relation to my comment.

I'm not sure if that was clear enough. In that world, no one has ever had a
hypothesis that was correct. The whole field is useless, measuring things that
are wrong and getting the occasional false positive.

You can talk about that world if you want, but it has no connection to
reality. It's not p-values that assume that world, it's your misunderstanding
of p-values.

>If an experiment is not well-formed then of course you won't see reproduction
at the expected rate. This is what I'm referring to when I say that the low
reproduction rate points to deep, fundamental flaws in the experiments.

Experiments don't have to have enormous sample sizes to be well-formed. That's
the whole point of having a cutoff value.

It's not like an experiment that reproduces 80% of the time disproves the
result the rest of the time, it just doesn't quite reach .05 on those trials

>the reproduction rate of a well-formed experiment should very closely track 1
- p

I'm suspicious of this. I don't have time to do the math right now, but an
experiment that averages .01 might clear a .05 hurdle far more than 99% of the
time, and would definitely be well-formed. And if you set a hurdle at .01 it
would only clear it half the time, but it would still be well-formed.

~~~
jsprogrammer
Hypotheses can never be proven to be correct. I don't want to be in any world
where it is believed that a hypothesis is or could be correct.

This is a fundamental tenant of science. All that can be done is to reject
hypotheses.

You (along with Gwern) have now claimed that I don't understand p-values, but
you present no alternative understanding. The reason, of course, is that when
you look at the mathematics behind p-value, it is obvious that it is exactly
as I claim.

Edit to address your edit:

>I'm suspicious of this. I don't have time to do the math right now, but an
experiment that averages .01 might clear a .05 hurdle far more than 99% of the
time, and would definitely be well-formed. And if you set a hurdle at .01 it
would only clear it half the time, but it would still be well-formed.

You are right that you need to be careful here about what you are comparing
across instances. There will be variability since you are only sampling a
distribution (most likely at a very low rate) and not observing the entire
distribution (which, for continuous distributions, is impossible).

~~~
Dylan16807
On a certain philosophical level you can never be absolutely sure of anything,
and p-values are meaningless.

On a practical level, p-values are the chance that a correlation is reported
where 'reality' does not have a correlation. This is not the same number as
the chance that the result agrees with 'reality'.

You can reject the concept of objectivity, but you cannot reject that logic.
So I have explained the alternative understanding fine, just go back and
replace 'true' and 'false' and 'correct' with a philosophically-hedged
version.

~~~
jsprogrammer
On a practical level, people may not be able execute a well formed experiment.
I completely agree with that.

However, that doesn't change the meaning of the mathematics, only that your
reality has diverged from what you originally intended/believed.

What is the meaning of the number that people call 'p-value' when it is not
calculated on a well-formed experiment? I'm not sure if there is a general
formula, but you may be able to find some meaning in a particular instance.

~~~
Dylan16807
You're either defining "well-formed" as there being no such thing as a true
hypothesis, or you have completely lost me. Either way I don't think there's
anything more I can say.

p does not tell you how likely a result is to be true.

~~~
jsprogrammer
A well formed experiment tests only a null hypothesis.

p-value is exactly the probability that you observed X given that the
previously stated null hypothesis was true at the time of observation. The
value (1 - p-value) is exactly the probability that you will make an
observation consistent with your hypothesis (ie. expected replication rate).

Wikipedia has a decent treatment that might help:
[https://en.wikipedia.org/wiki/P-value#Definition_and_interpr...](https://en.wikipedia.org/wiki/P-value#Definition_and_interpretation)

~~~
Dylan16807
But the importance of a p-value is showing when it's _not_ the null
hypothesis.

The only time you get 95% reproduction is a result that says the null
hypothesis is true.

You're entirely right about that specific case.

But this only happens when nothing correlates. (And almost no science has been
done, because most things in fact don't correlate.)

A result that disagrees with the null hypothesis at .05 does not imply any
particular chance of another result that also disagrees with the null
hypothesis at .05

If there is no correlation, then replication will happen 5% of the time. If
there is correlation, it will be somewhere over 5%, but no particular value.

When people talk about reproduction, they talk about that chance. It will only
be 95% by coincidence.

~~~
jsprogrammer
>You're entirely right about that specific case.

In fact, this is the only case that matters. All other (valid) cases can be
reduced to a single, null hypothesis design.

p-value is undefined for hypotheses that are not a null hypothesis. It is also
undefined for hypotheses which do not hold.

Sure, you can walk through the motions, put some numbers together, and
eventually produce a number between 0 and 1. However that does not mean you
have computed a p-value. If you are testing a non-null hypothesis you have not
computed a p-value. If you are testing a null hypothesis that doesn't hold,
you have not computed a p-value.

~~~
Dylan16807
The null hypothesis is where nothing happens. You're supposed to be showing
evidence against it. If you redefine things so your "null hypothesis" is where
something happens, and you're showing evidence for it, you have done something
very very wrong, and you should not be using a .05 threshold either.

------
haddr
It is not that P-values are now bad by definition. It's only that they are
many times wrongly intepreted. Putting too much confidence in P-values only
might result in some wrong conclusions. And this is what some meta analyses
discover. Many scientists try hard only to reach the "golden" <0.05 in order
to claim discovery and publish it. This is why there is so many papers that
misteriously cluster around 0.05...

~~~
tel
There's also the systemic effect of prioritizing particular p values in that
negative results are omitted leading to replication bias across the community.

------
danharaj
Scientists have to do their work in a system that incentivizes bad science.
How many people actually get to do their work in an environment that isn't
hostile to them?

~~~
themodelplumber
a) Are you serious about that second question and b) if so, can we discount
thermodynamics in our answer? Otherwise it's kind of boring.

~~~
danharaj
We can restrict ourselves to social factors. Nature isn't hostile, it doesn't
have human intentionality like that. Seems to me we make work unpleasant for
everyone in the misguided belief that people work harder for it.

------
rndn
Isn’t a main problem with p-values that you don’t know whether significance
(low p-value) is a result of big effect and small sample or big sample and
small effect. This is why you also need a measure for the effect, for example
the distance of the two measurements in terms of standard derivations.

~~~
jjoonathan
I agree with TFA that p-hacking is a bigger problem.

Low p-value <=> null hypothesis is unlikely.

Choose a shitty null hypothesis ("aliens did it!", "everything is Gaussian",
etc) and you trivially get low p. Peer review checks this to some extent (you
won't get away with "aliens did it") but there's a large gray area of null
hypotheses shitty enough to give low p but not shitty enough to be rejected by
peer review. Choosing the hypothesis after-the-fact is the most common
strategy because it's undetectable except by repeating the experiment, which
is hard.

------
marvy
I'm probably commenting too late to get my question answered, but here goes:
the article has a pretty picture where they show how likely your p-values will
mislead you depending on how likely the null hypothesis is. For instance, they
say if you think that the null hypothesis has a 50% probability of being
right, and you get p=5%, then there's still a 29% chance the null hypothesis
is true. But according to my calculations, the right number should be 1/21 =
4.8%. What am I missing here? Or are they wrong? My calculations are below:

Curious George has 200 fascinating phenomena he wishes to investigate. In
reality, 100 of those are real, and the other hundred are mere coincidences.
The experiments for the 100 real phenomena all show that "yes, this is for
real". (I'm assuming no false negatives.) Most of the 100 experiments that
test bogus phenomena show that "this is bogus", but 5 of them achieve a
significance of p=5%, as expected. George then runs of to tell the Man in the
Yellow Hat about his 105 amazing discoveries. If Yellow Hat Man knows that
half of the phenomena that capture George's attention are bogus, he knows that
5/105 = 1/21 = 4.8% of George's discoveries are likely bogus, even though he
doesn't know which ones.

~~~
kgwgk
Assume that you're sampling from a normal distribution with known standard
deviation sigma (1 for simplicity) and unknown mean mu. To test if the mean is
larger than (the null hypothesis) mu=0 you can check if the observed value is
larger than 1.64 sigma (for the 95% confidence test). So if your observation
is larger than 1.64 you reject the null hypothesis.

Your calculation would be correct only if the assumption "no false negatives"
is approximately valid. This is the case when the true value is large in terms
of sigma (say mu=6). Then for the 100 cases with mu=0 you'll reject the null 5
times on average, and for each one of the 100 cases with mu=100 you will
reject the null (unless you're unlucky: there will be a false negative around
once in 150000 trials).

But you're conditioning on p<0.05, not on p~0.05. It's easy to see that it's
much easier to get p=0.05 if mu=0 (this is a 1.64 sigma event) than if mu=6
(it's a 4.36 sigma event). If mu=0, you will get on average 1 (out of 100)
observation with 0.04<p<0.05 (i.e. 1.64<x<1.75). The probability of obtaining
an observation on that range when mu=6 is very small (0.0004 out of 100).
Almost 100% of the "discoveries" with p~0.05 will be false (when mu=6 you will
get p-values around 1e-9).

When the true value of mu gets closer to 0, you cannot ignore the false
negatives. For example if mu=0.1 the rejection rate will be quite similar to
the mu=0 case (the probability of getting 0.4<p<0.5 is 1.2% and 1%
respectively) and almost 50% of the "discoveries" with p~0.05 will be false.

Somewhere between the two extreme cases, there is a lower bound for this
"false discovery rate".

See
[http://faculty.washington.edu/jonno/SISG-2011/lectures/sellk...](http://faculty.washington.edu/jonno/SISG-2011/lectures/sellkeetal01.pdf)
and in particular figure 2.

~~~
marvy
I think I'm still missing something here. In particular, I'm still not getting
29% as a lower bound. I'm getting around 20%. If we compare mu=0 with mu=1.64,
the probability density at x=1.64 is roughly 0.1 and 0.4, respectively, so the
lower bound should be .1/(.1+.4)=1/5\. No? Unless they were assuming something
other than "two normal distributions with the same variance"?

~~~
kgwgk
You're absolutely right! My example was mainly for illustration, I was not
sure that it would give exactly the same lower bound (but I was indeed
surprised that it's below the 29% in those papers, which I thought a "hard"
bound).

It seems the bound that you are calculating (that I have reproduced, R code
below) was already published more than 50 years ago for this specific case of
a normal distribution. See slide 10 in
[http://www.biostat.uzh.ch/teaching/master/previous/seminarba...](http://www.biostat.uzh.ch/teaching/master/previous/seminarbayes/LeonhardHeld_slides.pdf)

I have not really read the paper of Sellke et al. entirely, but it seems that
the "calibration" they propose is more general but it makes some assumptions
about the distribution of the p-value and it's therefore approximative.

    
    
      p=0.05
      null=0.5
      c0=qnorm(1-p)
      x=seq(0,5,0.01)
      y=100*dnorm(c0)/(dnorm(c0)+dnorm(x-c0))
      calib=100/(1-1/(exp(1)*p*log(p)))
      actual=min(y)
      plot(x,y,type="l",ylim=c(0,100),ylab="%null",xlab="mu1",bty="l",xaxs="i",yaxs="i")
      title(paste(null*100,"% null  p = ",p,sep=""))
      legend("topleft",c(paste("Sellke, Bayarri, Berger (2001) =",format(calib,digits=3)),
                       paste("Edwards, Lindman, Savage (1963) =",format(actual,digits=3))),
           lwd=2,col=c("red","blue"),bty="n")
      abline(h=calib,col="red")
      abline(h=actual,col="blue")
      grid()

~~~
marvy
I don't know R, but I found a few sites that happily run R code for me. I find
the shape of that curve somehow pretty. 29% clearly can't be a hard bound,
since we can get 4.8% by assuming no false negatives. I just wish I understood
whether there is anything particularly natural about the number 29, or did
they make their distributional assumptions for the same reasons you did:
"mainly for illustration". If so, then the Nature article was terribly
misleading by presenting that number as some kind of "speed of light"-type
limit, because that makes p-values look worse than they really are. It seems
that p-values are bad enough without making up more bad stuff about them! :)

Anyway, thanks for all your help. Your ability to dig up references (and pump
out R code) at a moment's notice makes me think you are someone who knows
quite a bit of statistics. I'll happily look at anything else you care to
point me to.

~~~
kgwgk
Assuming no false negatives is just not an option :-) I think that can only
happen if the situation is such that there can be no false positives either
(i.e. the p-value when the null hypothesis is not true is always zero). EDIT:
What I wrote is true only if the distributions under the null and the
alternative are completely disjoint. You can actually have very low false
negative rates if the distributions are not symmetric, and if you allow
different distributions you can do even better: imagine the null hypothesis is
x~Normal(0,1) and the alternative is x=C0=1.64 (exactly the cutoff value for
0.05 significance). If we get exactly p=0.05 then the probability of the null
being true is 0%. I mean, we get p in [C0-epsilon C0+epsilon] with probability
1 under the alternative, but with probability->0 under the null as
epsilon->0\. Of course, this alternative is very unlikely and mixing continous
and discrete distributions is always tricky. This is why it makes sense to
make averages over prior distributions of the alternative.

As you can see in slide 14, there are multiple calibrations proposed under
different assumptions. I agree it is misleading to give one as the "real"
error I rate, but it's interesting that all of them are giving rates well
above the nominal alpha rate. EDIT: note as well that this is for the case
where in 50% of the cases the null hypothesis is true(nowhere in the
calculation of p-values do we consider how often the null hypothesis is true,
but obviously if it's always true 100% of the significant results will be
false positives and if it's never true 0% of the significant results will be
false positives).

In slide 11 there are other calculations for the normal case, two-sided test
this time. But instead of looking for the mu1 giving the lowest bound, they
calculate the aggregate error rate making some assumptions about the
distribution of the mu1. For example, if I understand correctly the results,
assuming mu1 is normally distributed around mu0=0, if you get a p-value=0.05
(in the two-sided test, some modifications are required to the calculation we
did) you should expect the null hypothesis to be true at least 32.1% (if the
distribution of mu1 is very concentrated around 0, the 50% rejection rate on
the left side of the chart dominates, if the standard deviation if very high
the region of almost 100% rejection rate far from mu0 at the right of the
chart dominates, for some intermediate standard deviation one will hopefully
get the 32.1% lower bound).

Unfortunately, I think the assumption behind the nice result 1/(1-1/(e p
log(p)))) is that p-values follow a beta distribution when the null hypothesis
is not true and I don't think there is a clear interpretation of that.

~~~
marvy
Ok, so the pretty result is mostly arbitrary. Fair enough. Re: false
negatives... you seem to be living in a world of bell curves, or at least a
mostly continuous world. I can easily make (very contrived) experiments where
false negatives just don't happen. For instance: I have two coins. One is a
perfectly fair coin. The other is a two-headed coin. You see me flip one of
them. The null hypothesis is that I flipped the fair coin. A false negative
means deciding the coin is fair but it's really not. This will never happen,
because you will only decide that if it lands tails, and then it must be fair.
(If I only flip it once, the false positive rate is something like 1/3, not
0.) But this is probably much too contrived for your taste, and maybe even for
mine. But it's almost 5am, and I must go to sleep now, or else it will get
bright soon, and I never will. I now appreciate the value of the 20min
procrastination setting.

~~~
kgwgk
I agree on your point, if we are sufficiently creative we can get many extreme
results. For example, I made an addition to the first paragraph of my previous
comment, that you might have missed, giving an example where the probability
of the null hypothesis being true when p=0.05 is zero (or arbitrarily small,
if we replace the discrete probability lump under the alternative hypothesis
by a continuous distribution which is concentrated enough). I also added a
comment on the second paragraph, by the way.

One minor comment on your example. If H0:fair coin and H1:two-headed and the
statistic is the number of heads h, I cannot reject (at the 0.05 level) the
null when n (the number of flips) is small even if I'm only getting heads. For
one flip, p[h=1|H0]=0.5. For two flips p[h=2|H0]=0.25. For n>5 you will of
course reject the null hypothesis for every case where H1 is true (and for ~5%
of the cases where H0 is true). There will be no false negatives. But I guess
you have noticed that this doesn't help with the false discovery rate in this
example: when H1 is true the p-value will be very small (1/2^n) so if the
observed p-value is ~0.05 (or any other value larger than 1/2^n) then it's for
sure a false positive (because there will be at least one occurrence of
tails).

Ok, enough time wasted on this subject :-)

~~~
marvy
> so if the observed p-value is ~0.05 (or any other value larger than 1/2^n)
> then it's for sure a false positive (because there will be at least one
> occurrence of tails).

Good point! I didn't think of that.

> Ok, enough time wasted on this subject :-)

Even better point!

------
RA_Fisher
Great article. I'm not sure that replication itself will solve the problem
since Type 1 error rate requires asymptotics. We'd have to run many
replications and then show convergence. That'll be broadly cost-prohibitive
for all but the most important conclusions. Lower thresholds probably won't do
it either. Right now, the only solutions I see are:

a) Baysian methods

b) Fisher's single H hypothesis method

c) Tukey's Exploratory Data Analysis method.

d) All of the above.

~~~
sgerrish
I don't see why (e) teaching scientists to be statistically literate so they
don't abuse or misunderstand these tests, and/or (f) focusing on reproducible
results and shaming researchers with sloppy methodology, wouldn't work. The
hypothesis test has known limitations, but it's not clear that we should blame
null hypothesis tests for people mis-using them, when researchers untrained in
stats are just as likely to mis-use any method you give them.

------
eruditely
Relevant, from Deborah Mayo.

[http://errorstatistics.com/2015/03/16/stephen-senn-the-
pathe...](http://errorstatistics.com/2015/03/16/stephen-senn-the-pathetic-p-
value-guest-post/)

------
themodelplumber
"Essentially, all models are wrong, but some are useful." \--George E.P. Box

~~~
danparsonson
The p-value test isn't a model, it's a measure of the significance of an
effect in data against random noise.

~~~
tel
Which arises from a model (!) of random noise and of your effect.

~~~
danparsonson
I see - my mistake. That's a very broad definition of 'model' though isn't it?
Including 'random numbers'? You might as well say everything is a model in
which case the original quote says nothing :-)

~~~
tel
It's perhaps a bit like "everything is a model" in the sense that all of these
tests, even the model-free ones, arise from a coherent choice of assumptions
and, if you for a moment take the Bayesian perspective very seriously, prior
distributions over conditionals. The original quote should be taken to mean
that any particular choice of assumptions is limiting, but making interesting
choices can drive interesting questions which are thought provoking and
meaningful even if they are wrong.

