
Why I've lost faith in p values - anacleto
https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-faith-in-p-values
======
davidxc
Here's a more simple thought experiment that gets across the point of why
p(null | significant effect) /= p(significant effect | null), and why p-values
are flawed as stated in the post.

Imagine a society where scientists are really, really bad at hypothesis
generation. In fact, they're so bad that they only test null hypothesis that
are true. So in this hypothetical society, the null hypothesis in any
scientific experiment ever done is true. But statistically using a p value of
0.05, we'll still reject the null in 5% of experiments. And those experiments
will then end up being published in scientific literature. But then this
society's scientific literature now only contains false results - literally
all published scientific results are false.

Of course, in real life, we hope that our scientists have better intuition for
what is in fact true - that is, we hope that the "prior" probability in Bayes'
theorem, p(null), is not 1.

~~~
taneq
> But statistically using a p value of 0.05, we'll still reject the null in 5%
> of experiments. And those experiments will then end up being published in
> scientific literature. But then this society's scientific literature now
> only contains false results - literally all published scientific results are
> false.

The problem with this picture is that it's showing publication as the end of
the scientific story, and the acceptance of the finding as fact.

Publication should be the _start_ of a the story of a scientific finding. Then
additional published experiments replicating the initial publication should
comprise the next several chapters. A result shouldn't be accepted as anything
other than partial evidence until it has been replicated multiple times by
multiple different (and often competing) groups.

We need to start assigning WAY more importance, and way more credit, to
replication. Instead of "publish or perish" we need "(publish | reproduce |
disprove) or perish".

Edit: Maybe journals could issue "credits" for publishing replications of
existing experiments, and require a researcher to "spend" a certain number of
credits to publish an original paper?

~~~
thisiszilff
That's a good idea: encourage researchers to focus on a mix of replication and
new research. When writing grants, a part of that grant might be towards
replicating interesting/unexpected results and the rest for new research.
Moreover, given that the experiment has already been designed, replication
could end up demanding much less effort from a PI and allow his students to
gain some deliberate practice in experiment administration and publication. On
the other hand, scholarly publication might have to be changed in order to
allow for summary reporting of replication results to stave off a lot of
repition.

~~~
Fomite
My field has less of a "You publish first or you're not interesting" culture
than many others, and part of what that is is recognizing that estimating an
effect in a different population, with different underlying variables, is,
itself, an interesting result all its own.

Tim Lash, the editor of Epidemiology, has some particularly cogent thoughts
about replication, including some criticisms of what is rapidly becoming a
"one size fits all" approach.

------
wrp
One of the best articles covering this issues is Meehl[1][2]. You can find
discussion in various places like Gelman[3] and Reinhart[4].

[1] Meehl, Paul E (1990). Why summaries of research on psychological theories
are often uninterpretable. Psychological Reports, 66(1), 195–244.

[2]
[http://meehl.umn.edu/files/144whysummariespdf](http://meehl.umn.edu/files/144whysummariespdf)

[3] [http://andrewgelman.com/2015/03/23/paul-meehl-continues-
boss...](http://andrewgelman.com/2015/03/23/paul-meehl-continues-boss/)

[4]
[https://www.refsmmat.com/notebooks/meehl.html](https://www.refsmmat.com/notebooks/meehl.html)

------
tprice7
'The fundamental problem is that p values don't mean what we "need" them to
mean, that is p(null | significant effect).'

From Bayes' theorem, this more useful probability is given by p * x, where x =
p(null) / p(significant effect). Maybe we could just lower the accepted
threshold for statistical significance by several orders of magnitude so that,
for statistically significant p, p * x is still small even for careful (i.e.
big) estimates of x (e.g. maybe a Fermi approximation of the total number of
experiments ever performed in the field in question). This doesn't necessarily
imply impractically big sample sizes, although obviously this depends on the
specifics (I believe the p value for a given value of the t-statistic decays
exponentially with sample size).

~~~
nonbel
I don't follow your argument. You've got two premises:

1) You are saying that people are committing the transposing the conditional
fallacy: p(H0|data) != p(data|H0):

\- OK

2) You say to use Bayes theorem to get the value we want:

\- OK, but actually a better formulation is

    
    
      p(H_0|data) = p(H_0)*p(data|H_0)/[p(H_0)*p(data|H_0) + p(H_1)*p(data|H_1) + ... + p(H_n)*p(data|H_n)]
    

You probably don't need to add up all the way to hypothesis _n_ since the
terms eventually become negligible and can be dropped from the denominator.
The point is that you have to compare how likely the result would be under
other hypotheses, not just H_0.

3) You propose lowering the threshold for "significance"

\- How does this follow from the premises? Lets say you get a very low value
for p(H_0)p(data|H_0), this can still be much higher than p(H_1)p(data|H_1),
etc so it is still the best choice. Ie, you can get a low p-value given H_0
but if there is no better model out there you should still keep H_0.

~~~
tprice7
I’m assuming the question we are trying to answer is not “which H_n is most
probable”, but rather “how safe is it to conclude that H_0 is not true”. For
example, say we are concerned with whether the difference between two groups
is less than a certain amount or greater than a certain amount.

~~~
nonbel
>“how safe is it to conclude that H_0 is not true”

I would just take "very safe" as a principle, there is even the truism "all
models are wrong".

>"the difference between two groups is less than a certain amount or greater
than a certain amount"

You are ignoring a lot of the model being tested here (eg, normality,
independence of the measurements, etc) and only considering one parameter.

~~~
tprice7
“how safe is it to conclude that H_0 is not true”

What I meant here by H_0 was the hypothesis that the difference between groups
is less than some particular threshold. I think if you made the threshold
large enough then it would not be safe to conclude that H_0 is not true.

"You are ignoring a lot of the model being tested here (eg, normality,
independence of the measurements, etc) and only considering one parameter. "

I said enough so that if you were arguing in good faith you could fill in the
gaps yourself.

~~~
nonbel
I am arguing in good faith. There may be some cases in physics where there is
some theoretical distribution derived related specifically to the problem, and
they believe the model to be actually 100% true. Otherwise, the model should
be assumed to only be an approximation at best.

~~~
tprice7
Yes, the t-test does assume normality and you can never be sure of perfect
normality if that's what you are getting at (although I believe that
simulation tests of the robustness of the t-test against deviations from
normality generally show that this isn't too much of a practical concern). I
wasn't trying to address every potential weakness with the t-test (or p-values
in general); I was addressing the one stated in the article.

~~~
nonbel
I'm saying that unless you have bothered to derive a statistical model from
your theory, and you believe that theory may actually be correct, then you
know that you will reject your model if enough time/money is spent on testing
it.

~~~
tprice7
Ok, I will assume that by "model" here you mean a probability distribution on
the parameters relevant to your experiment. In that case I agree with what you
just said: knowing exactly the correct model is impossible in a similar way
that knowing someone's height to an infinite degree of precision is
impossible. But I never said anything to the contrary. The H_0 I gave
corresponds to an infinite set of models and not a single one (note that I
said the difference is less than a certain threshold, not that the difference
is 0 (although it still would be an infinite set in that case, but the
probability would be 0)).

~~~
tprice7
And in anticipation of the rebuttal that the probability is still 0 because
it's never exactly normal, what I really meant originally but didn't write out
explicitly for brevity and because I assumed it would be implicit: when I talk
about giving an upper bound on p(null | data) from p(data | null), what I
really mean is giving an upper bound on p(null | data, normal) from p(data |
null, normal) where normal is the assumption that the distribution of whatever
parameter we are looking at is normally distributed and null is the event that
the difference in means between the two groups we are looking at it is less
than some predetermined positive threshold. Or, for a 1-sample test, that the
mean of a single group is within that threshold of some default value.

~~~
nonbel
If you write out the actual calculation you will see normality (which was just
one example of an assumption) is actually part of the null model being tested.
It is not something different or outside of it.

~~~
tprice7
That is just a trivial semantics issue, and yes I am familiar with the
calculation.

Where do you think the flaw is specifically? Say we are doing a 1-sample test.

0\. (setup) Suppose we have a real number mu and a positive epsilon. Define
the interval I as [mu – epsilon, mu + epsilon]. For each “candidate mean”
within this interval, we have a corresponding t-statistic. Let the statistic
t_0 be the inf of all these t-statistics. Let T be the event that t_0 is at
least as big as the observed value.

1\. You can use Student’s t-distribution to compute an upper bound for the
probability of T under the assumptions that the observations are iid normal
and the mean lies in I. I will call this probability p(T | null, normal, iid),
where “null” is the event that the mean exists and is in I. It makes no
difference that it is more typical to lump these assumptions together as
“null” because in math you can define things however you want as long as you
are consistent.

2\. We have that p(null | T, normal, iid) = p(T | null, normal, iid) * p(null
| normal, iid) / p(T | normal, iid).

3\. Therefore, if we have an upper bound for x = p(null | normal, iid) / p(T |
normal, iid) then we can get an upper bound for p(null | T, normal, iid). That
is my main claim.

Which of the above statements do you object to?

~~~
nonbel
>"You can use Student’s t-distribution to compute an upper bound for the
probability of T under the assumptions"

I'm not sure what you are arguing anymore. I am saying you will _never_ test a
parameter value in isolation, it is always part of a model with other
assumptions. There is simply no such thing as testing a parameter value alone.
To define a likelihood you need more than simply a parameter...

You seemed to be disagreeing with that, but are now acknowledging the presence
of the other assumptions.

~~~
tprice7
“I'm not sure what you are arguing anymore.”

It’s the claim I make in 3, and then the secondary claim that making our upper
bound on p(null | T, normal, iid) small for significant p-values (i.e. p(T |
null, normal, id)) could be used as a criterion for whether our threshold for
statistical significance is small enough.

“You seemed to be disagreeing with that”

I’m not sure what I said that gave that impression. I didn’t mention anything
about the normal / iid assumptions initially not because I thought we weren’t
making these assumptions but because I didn’t think these details were
essential to my point.

------
btilly
Here is a better way to think about this.

The proper role of data is to update our existing beliefs about the world. It
is not to specify what our beliefs should be.

The question that we really want to answer is, "What is the probability that X
is true?" What p-values do is replace that with the seemingly similar but very
different, "What is the probability that I'd have the evidence I have against
X by chance alone were X true?" Bayesian factors try to capture the idea of
how much belief should shift.

The conclusion at the end is that replication is better than either approach.
I agree. We know that there are a lot of ways to hack p-values. Bayesian
factors haven't caught on because they don't match how people want to think.
However if we keep consistent research standards, and replicate routinely, the
replication rate gives us a sense of how much confidence we should have in a
new result that we hear about.

(Spoiler. A lot less confidence than most breathless science reporting would
have you believe.)

~~~
kolpa
This is like Functional programming , and people have a very hard time with
it. Instead of passing around numbers "95% true" or whatever, we're passing
around function "It's 2x as likely as you though it was, please insert your
own prior and update", but even _worse_ , it's "please apply this complicated
curve function at whatever value you chose for your prior". It's just too hard
for people to manage. Computers can do it (but it's hard for them too, very
computationally intensive), and you have to really trust your computer program
to be working properly (and you have to put your ego in the incinerator!) to
hand over your decision-making to the computer.

~~~
btilly
I question whether computers can do it at all in useful practice.

Take a look at the results quoted in
[https://en.wikipedia.org/wiki/Bayesian_network#Inference_com...](https://en.wikipedia.org/wiki/Bayesian_network#Inference_complexity_and_approximation_algorithms)
about how updating a Bayesian net is an NP hard problem, and even an
approximation algorithm that gets the probability right to within 0.5 more
than 0.5 of the time is NP-hard.

------
sykh
My favorite probability theory problem is related to this article.

You have a test for a disease that is 99% accurate. This means that 99% of the
time the test gives a correct result. You test positive for the disease and it
is known that 1% of the population has the disease. What is the probability
you have the disease?

The answer is not at all the one most people think at first when given this
problem. This problem is why getting two tests is always a good thing to do
when testing positive for a disease.

EDIT: I updated the statement of the problem to be one that can be answered!

~~~
cdancette
I'm not sure you phrased the problem correctly. If we follow your explanation,
then the probability of having the disease is indeed 99%.

If you want to show the implication of Bayes' Theorem then you need to be more
precise : Say you have a 1% of false positive and false negative rates (99%
reliability) and 1% of the population is sick. If you are tested positive,
then the probability of being sick is much less than 99%.

~~~
thaumasiotes
> If we follow your explanation, then the probability of having the disease is
> indeed 99%.

This is not correct; the probability of having the disease is unknown. He
didn't say what he meant by the test being "99% accurate", but that doesn't
mean you can just make your own assumption.

Note that in your more precisely specified scenario, when the test has 99%
reliability, it is perfectly true that "99% of the time the test gives a
correct result", which immediately disproves the claim that, if we follow that
definition, the probability of having the disease given a positive test result
is 99%.

~~~
cdancette
The problem is that "99% the time gives a correct result" is imprecise.

It can be understood as both:

\- p(sick|positive) = 0.99 \- p(positive|sick) = 0.99

We get totally different results, the first one is obvious (99% change of
being sick), and the second one needs Bayes' Theorem (and is the one we want
to use).

~~~
thaumasiotes
I would only interpret "the test gives a correct result 99% of the time" to
mean that out of every 100 test results, 99 are correct and one is wrong.
Neither of your interpretations matches that. You need all kinds of additional
information to say anything more specific. "99% of results are correct" can
easily be true while p(sick | positive) and p(positive | sick) each vary
anywhere between 0 and 1.

------
rossdavidh
The core issue is that p-values are cheaper to get than replicating the study,
but replicating the study is the only reliable way to see if it's true or not.
Sometimes the expensive/time-consuming way, is the only good way.

~~~
lisper
Replication by itself is not enough. You need pre-registration too. Otherwise
you can p-hack the replications.

~~~
tomrod
Pre-registration?

~~~
lisper
[http://www.apa.org/science/about/psa/2015/08/pre-
registratio...](http://www.apa.org/science/about/psa/2015/08/pre-
registration.aspx)

------
tyrankh
I'm not trying to be facetious, but isn't this something you learn in junior-
level stats? I had this drilled in in both undergrad math courses and grad
machine learning courses; I'm confused to see it warrant an article.

~~~
pmyteh
It's well known what p-values show. But they are, in practice, used as a
gatekeeping mechanism in academic journals in many fields (including mine).
Worse, getting p<0.05 is informally seen as a measure of practical
significance, rather than simply as one statistical test amongst many passed.

So yes, it is something you learn in introductory quantitative methods
classes. But I don't think most researchers understand just how much it
matters.

Also, a key R package for producing regression tables of coefficients for
journal articles is called 'stargazer'. Given the unwarranted focus of many
readers on those indicia of 'significant' results, I think it's well named.

I currently have the opposite problem. Given that I work with very large
online datasets (N=1M or so) _everything_ , including the random noise, is
statistically significant to p<0.05. It really is effect sizes or busy at that
point.

~~~
AstralStorm
Real measures of practical significance are OR (odds ratio), effect size and
dose response curve. Response histogram for statistical effects. (Or the 2D
component analysis island histogram.)

------
keithfma
Andrew Gelman's blog provides regular insightful commentary on this issue, I
highly recommend it:

[http://andrewgelman.com/](http://andrewgelman.com/)

The post that turned me on to all of this is at:

[http://andrewgelman.com/2016/09/21/what-has-happened-down-
he...](http://andrewgelman.com/2016/09/21/what-has-happened-down-here-is-the-
winds-have-changed/)

------
amluto
The article says:

> Note: this has nothing to do with p-hacking (which is a huge but separate
> issue).

I disagree. p-hacking is when one experimenter checks many statistical tests
to find one that is significant. The effect the author is discussing is that
many experimenters do many experiments and the significant ones get published.
One is more unethical (or maybe just incompetent) than the other, but they’re
essentially the same phenomenon.

~~~
anbende
They are the same in that they both create a situation in which the p-value
cannot be trusted. However, in one case this is deliberate. In the other it's
a problem with the whole enterprise.

Also, running multiple tests without correcting for multiple testing (usually
by reducing the threshold for significance) is just one form of p-hacking. The
more insidious version is when one runs the test after every few participants
until random chance makes it "slip over the edge of significance". In that
case there might not even be enough variables for multiple testing to have
occurred, and it becomes very difficult to detect.

------
Malarkey73
I'm honestly more tired of essays about p-values than p-values.

It's true that like all metrics if it becomes a target then it maybe abused
(Goodharts Law).

However if you abolished p-values people would start hacking or
misunderstanding priors or confidence limits or OR.

It's an easy dumb stat that most anyone can do in excel and most everyone
recognises. The emphasis should be that it remains a quick shorthand for
casual use but that more complex studies have more sophisticated models and
probabilistic reasoning.

But the emphasis on the p-values is bizarre. As best illustrated by JT Leek
the pipeline of data research has multiple points of failure that may lead to
false findings or irreproducible research. But we talk very little about them
whilst essays about p-values come out every week...

[https://www.nature.com/news/statistics-p-values-are-just-
the...](https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-
the-iceberg-1.17412#/pipe)

------
aaavl2821
This was a really interesting article. I've worked with researchers who try to
defend a small but statistically significant finding that just doesn't seem
likely to be real, and this provides a statistical explanation for my
skepticism. The p-value mentality is deeply entrained in a lot of researchers,
though

The challenge for journal editors seems very real. There's another group that
deals with this challenge of interpreting the validity of significant findings
for a living, though: biotech VCs. A lot of times trying to reproduce the work
is their best way of addressing this, and often the first work done by
startups is to try to replicate the academic work. For some other heuristics
VCs use to assess "reproducibility risk", see here;

[https://lifescivc.com/2012/09/scientific-reproducibility-
beg...](https://lifescivc.com/2012/09/scientific-reproducibility-begleys-six-
rules/)

------
wmnwmn
2 solutions: a) stop doing experiments that just look for correlation without
any attempt to get at mechanism. Of course sometimes you can't avoid this and
then 2) use lower p values. Don't waste thousands or millions (more) dollars
following up 5% results.

------
learnstats2
When I was first taught statistics, I was told that the researcher had to
justify a plausible hypothesis first - and then do a hypothesis test/p-value
to prove their theory.

If this combination of the scientist's intuitive understanding and the p-value
test result align, then this is a credible result.

On the other hand, the trend now is to conduct every possible test whether or
not there is any justification for doing so (corrected for multiple testing,
no p-hacking, yes, sure)

For example, in tech, we might test every shade of blue. Some of those blues
are gonna come up as p-value hits - but since we had no good reason to do this
test, this was probably just random noise.

Similarly, in genetics, we're gonna test every single gene against everything
- just to see what happens (yes, yes, do a Bonferroni correction on each set
of tests). Hmm, recent results in genetics don't seem to be very robust or
repeatable, for some reason.

The likelihood of a truthful link in these tests is incredibly low. When have
no particular reason to believe there is a truthful link, and are just blind
testing, the false positive rate is very high (as described in the article),
and probably even higher than the article speculates with - almost all hits
are gonna be false positives.

Maybe p-values just don't work well with modern day data. Or, maybe, Big Data
just doesn't contain information about mysterious, unexplored, and innovative
correlations that we hope it does.

~~~
teej
“On the other hand, the trend now is to conduct every possible test whether or
not there is any justification for doing so (corrected for multiple testing,
no p-hacking, yes, sure)”

You are literally describing p-hacking.

~~~
Fomite
He's describing a multiple comparisons problem, not p-hacking, enabled by,
essentially, ease of statistical computation. An honest researcher can trigger
his problem without ever p-hacking.

See: Genome Wide Association Studies.

------
stevenjluck
Here's a follow-up to the original blog post:
[https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-
fait...](https://lucklab.ucdavis.edu/blog/2018/4/28/why-ive-lost-faith-in-p-
values-part-2)

------
piotrkaminski
> instead of asking whether an effect is null or not, we should ask how big
> the effect is likely to be given the data. However, at the end of the day,
> editors need to make an all-or-none decision about whether to publish a
> paper

Yet another way in which the traditional publishing structure actively harms
science.

~~~
Bromskloss
Do you have an alternative way of publishing in mind?

~~~
piotrkaminski
My criticism is not necessarily constructive. ;) But it's not too hard to
imagine something along the lines of arXiv combined with a rating/commenting
system not unlike HN itself, combined with a Facebook-ish algorithm for
surfacing articles relevant to each reader. The devil is in the details, of
course, but it shouldn't be hard to do better than the expensive, artificially
constrained, and arbitrary system we're saddled with now. The real trick will
be convincing entrenched academics to switch -- I'm still not convinced this
is actually possible.

------
alan-crowe
I remember reading [http://andrewgelman.com/2016/11/13/more-on-my-paper-with-
joh...](http://andrewgelman.com/2016/11/13/more-on-my-paper-with-john-carlin-
on-type-m-and-type-s-errors/)

with its graph "This is what power = 0.06 looks like". So I got the point that
you have to have sufficient statistical power. A useful rule of thumb is that
you need a power of at least 0.8. You need to have some idea how big the
effect is likely to. Perhaps from previous exploratory research, from claims
of other researchers, from reasoning "well, if this is happening the way we
think it is, there has to be an effect greater than x waiting to be
discovered.". Then you work out how big a sample size you need to use. Then
you roll up your sleeves and get down to work.

But the reason for using p values rather than Bayesian inference is that it
gets you out of the tricky problem of coming up with a prior. You only need to
think about the null hypothesis and ask yourself whether the probability of
the data, given the null hypothesis, is less than 0.05.

So there is a bit of contradiction. p values don't really work unless you
ensure that you have sufficient power. To do that you need a plausible effect
size, to feed into your power calculation. And that is implicitly an rough
approximate prior, 50:50 either null or that effect. You could just do a
Bayesian update, stating how much you shifted from 50:50.

Basically, if you don't already know enough to have an arguable prior to get a
Bayesian approach started, you don't know enough to do a power calculation, so
you shouldn't be using p-values either.

I went looking on andrewgelman.com for a reference for wanting power = 0.8 and
found a more recent post

[http://andrewgelman.com/2017/12/04/80-power-
lie/](http://andrewgelman.com/2017/12/04/80-power-lie/)

Oh shit! The situation is much worse than I realised :-(

~~~
thousandautumns
> But the reason for using p values rather than Bayesian inference is that it
> gets you out of the tricky problem of coming up with a prior.

It technically doesn't even do this. Using a frequentist approach is
equivalent to a Bayesian approach with an uninformative prior, which is itself
an assumption being baked into the analysis, only one that is almost
unquestionably incorrect. Its essentially saying you have literally no idea
how a data is being generated, which is certainly not true.

------
vcdimension
James Abdey wrote his Ph.D. on this subject several year ago and proposed an
alternative method for making decisions based on statistical evidence:
[http://etheses.lse.ac.uk/31/](http://etheses.lse.ac.uk/31/)

------
jssmith
> Many researchers are now arguing that we should, more generally, move away
> from using statistics to make all-or-none decisions and instead use them for
> "estimation". In other words, instead of asking whether an effect is null or
> not, we should ask how big the effect is likely to be given the data.

I couldn’t agree more with this statement, and even moreso in a business
setting than in research. It’s just so easy to get caught up in statistical
significance and lose perspective on practical significance. I’ve found
confidence intervals most informative and easy to understand.

------
thanatropism
This is an old thread already and I don't know if I'm getting my voice heard.
But at any rate: hypothesis testing (slightly different philosophically from
p-values, but anyway) is bogus because conjectures-and-refutations
falsificationism is bogus. That's not how good science has ever happened, only
how bogus research programs have dressed themselves in science.

The core of science is "the unity of science". Signal-to-noise measurements
tell you very little outside a general coherentist/holistic verificationist
framework.

~~~
NPMaxwell
About unity of science:
[https://en.wikipedia.org/wiki/Unity_of_science](https://en.wikipedia.org/wiki/Unity_of_science)

------
haberman
This is especially troubling when combined with confirmation bias. The whole
point of data is that it anchors us to reality. Data should be the check that
prevents us from believing something simply because we want it to be true. But
if we only test theories we already suspect are true, we are already biasing
the kinds of false positives we will get.

------
cassowary37
pvals are a lot like the weather - everyone complains about it, nobody does
anything about it. Specifically, what tends to be missing from these
conversations is a good alternative - the author seems to be asking for false
discovery rates/q values. Or maybe effect sizes? The reality is one size
doesn't fit all, and the most useful statistic depends on the context. Oh, and
the target: good luck submitting your work to a biological journal without
pvals. I'm sure the editor will briefly marvel at your courage in taking a
stand as she rejects without review.

While we're on the subject: there's a tendency to appeal to larger sample
sizes, as the author also mentions. Worth remembering that for some of us data
isn't a thing you download from the interweb, it's something you generate -
and it costs money and time to do so. (And for human subjects research, the
stakes are even higher...)

------
rfdearborn
I don't think there's any cause to abandon p-values and NHST if you're running
experiments with high power and intelligent, deliberate priors.

With power = 0.8 and p(h1) = 0.6, p(h0 | p < 0.05) = 0.04. Even if power = 0.8
and p(h1) = 0.2 then p(h0 | p < 0.05) = 0.2.

------
vadansky
Does anyone have that article about how a pro/anti parapsychology both
designed a study, analyzed the data, and got conflicting results? (There was a
joke about how it was the only paper published that explained a discrepancy by
saying the other side cheated)

------
look_lookatme
Is there a book (written in plain language) that goes into the history of
academic journals and the details the current state of the "replication
crisis", "data dredging", etc?

------
tw1010
Despite using statistics daily, I still feel utterly uncomfortable about its
philosophical grounding. Are there any resources HN can suggest to soothe the
heart of a sceptic?

~~~
tyrankh
Skeptic * quite different from sceptic :)

~~~
nkurz
I think it's just an alternate spelling that is preferred in some locales:
[https://en.wiktionary.org/wiki/sceptic](https://en.wiktionary.org/wiki/sceptic)

------
imjustsaying
I never understood why people took p-values seriously. They never seemed to
mean anything of use.

Whenever I brought it up around other academics, no one seemed to want to
comment on it. Maybe they were afraid to admit they didn't understand a topic
that's apparently important to publishing? Anyone can follow the formula to
make a p-value, but there's no requirement to understand its meaning.

I'd love to find their use, but I still haven't found it.

~~~
Alex3917
> there's no requirement to understand its meaning.

This is a feature, not a bug. The fact that the p-value has nothing to do with
whether or not the hypothesis is correct is intended to force people to think
for themselves rather than using math as a substitute for logic.

~~~
k3d
If a feature is working in reverse it's probably a bug though- it might be
_intended_ to force independent thinking, but in reality all it does is serve
as a gameable goalpost, no thought required.

------
btcindivist
p-values that are not in the physics ranges are ridiculous.

It's a shame everyone started copying physics but decided for higher
acceptance/rejection values.

I was a little bit disappointed when I realized that a bunch of valid modern
science is just proper experiment design and number crunching. If it's not
physics, there's no models of why things work, there's just a p-value on the
correlation or some other comparison function.

Medicine has turned into a field where you can't know a thing.

[http://www.cochrane.org/CD005427/BACK_combined-
chiropractic-...](http://www.cochrane.org/CD005427/BACK_combined-chiropractic-
interventions-for-low-back-pain)

I love reading reports like the above:

> There is currently no evidence that supports or refutes that these
> interventions(chiropractic intervention) provide a clinically meaningful
> difference for pain or disability in people with [lower back pain] when
> compared to other interventions.

p-values really do not help that much.

~~~
kgwgk
> It's a shame everyone started copying physics but decided for higher
> acceptance/rejection values.

P-values were developed outside of physics, it's not like people took them
from physics relaxing the significance thresholds.

"If your experiment needs statistics, you ought to have done a better
experiment." Lord Ernest Rutherford (maybe)

------
Karishma1234
Sometimes dumbing down a concept can totally screw up a person's learning
curve. During early days a lot of Java tutorials (in Indian engineering books)
mentioned that the reason Java has Interfaces is because otherwise it is not
possible to inherit from multiple classes. While it is true that you can
implement multiple interfaces, the whole point of interface is to define
"interface" without forcing an implementation. It has nothing to do with the
"limitation" of single inheritance.

Coming back to p values. A simple google search will find you many articles
that say

> A small p-value (typically ≤ 0.05) indicates strong evidence against the
> null hypothesis, so you reject the null hypothesis.

The whole idea of p-values is to warn a scientist that they should look for
statistical significance. The behaviour of hypothesis over infinite trials is
what that matters and hence more data => better reliance. But "more", "better"
etc. are subjective ideas and in many cases when everything else considered
normal <0.05 might be good but not always. There are far too many factors such
as wrong sampling method, things you can not measure vs things you can measure
etc. that affect this number truly.

I think author nails it when he writes "Replication is the best statistic."

Always think of these tests in biological evolution perspective. Do you think
this hypothetis would survive the test of time where it has to frequently face
the real world ?

~~~
candiodari
It seems to me that a number of things are not clear at all.

1) p-values is a metric on a whole library of procedures, not a single check
(although one of them, the Normal distribution is the one usually meant).
There's 5 basic ones everyone should know, and you could fill a decent
bookshelf with details on all of them.

This means a true p-value should be accompanied by

a) what the source data was

b) how it's distributed, and why (better yet, proof). This should include
assurances that there are no attempts to game the measurement (or otherwise
any change in the source data directly related to this measurement), as that
of course invalidates it.

c) a sanity check (like a normality test, or redoing the procedure on
generated data and verifying the expected result)

d) what the exact claim is (e.g. this is normally distributed with a mean > X)

e) what the procedure was to verify this claim (e.g. normality test + mean > X
... you need to COMBINE your p values, because if the data isn't normal your P
test is invalid and of course your hypothesis should be rejected even if the
numbers in the t-test say it shouldn't)

f) a HUGE disclaimer that this measurement and it's actual numerical result
only apply to past values, and if it's used to change something it is no
longer valid, even if it's numerical value changes as a result, and you can
still calculate it.

2) it's not a given that it's possible at all. The truth is that there are
many things that don't follow the central limit theorem and therefore cannot
correctly be used for statistical approximation.

Essentially you need to be utterly convinced that every measurement is the
result of somehow combinding a large number of effects that keep reappearing.

For instance, planetary orbits do NOT satisfy this criterion. Sure they're the
result of large numbers of influences, but almost none of those influences
ever repeat (e.g. comets passing in close orbits is a big modifier of orbits,
and in 4-5 billion years that'll happen once with any given asteroid. Almost
every perturbation is unique and doesn't repeat, so it can't be predicted and
won't follow statistical laws. That is the sort of effect that no form of
statistics will ever find, and that invalidates your results. Because these
are rare it works in the short term, but it doesn't work in the long term.
E.g. if you repeat the experiment to determine the speed of light with
Jupiter's moons, that won't work with the original data because their orbits
have shifted too much)

~~~
Karishma1234
You are far more articulate than I am.

------
nonbel
tldr: The author got a PhD in 1993[1] and is just now figuring out that
p-values are not false positive rates

[1][http://mindbrain.ucdavis.edu/people/sjluck](http://mindbrain.ucdavis.edu/people/sjluck)

I was lucky and figured it out before getting a degree. Its got to be hard for
people in this position to look back on their previous work where the most
fundamental aspect of interpreting the results was incorrect.

He gets it right that statistics are good for estimation, but there is a part
two. You need to come up with a theory that makes a prediction to compare to
these estimates, and then test _that_. Ie, your prediction about the
distribution of the results is the "null hypothesis". I think p-values are
probably ok for that.

~~~
SubiculumCode
I've met Steve Luck multiple times over the last decade. He is a very rigorous
and insightful scientist, with expertise in a wide range of methods
(especially eeg) and psychological phenomena. His wife Prof. Lisa Oaks may be
even more impressive in these regards, but that is an aside. The point I would
like to make though is that there are lots of aspects to research and stats is
just one part to master of many. It should not be up to a neuroscientist to
make posts like this. The stats community should be ___actively_ __pushing the
scientific community to alternatives that match interpretative intuition with
reality of the statistical metric.

~~~
nonbel
I went through the same thing. I was taught the same half-assed BS statistics
with all the wrong interpretations and was surrounded by people just following
along with that.

Still, it was a point of duty/honor for me to figure out what these
statistical results meant. How is interpreting results not a key part of a
scientists job? I guess if you don't want to bother learning to interpret
results then you can do science by either being a lab tech or theorist.

------
thaumasiotes
Sounds like someone who never understood statistics, still doesn't, and
doesn't want to.

A particularly glaring issue is this offhand comment:

> this is a statement about what happens when the null hypothesis is actually
> true. In real research, we don't know whether the null hypothesis is
> actually true. If we knew that, we wouldn't need any statistics! In real
> research, we have a p value, and we want to know whether we should accept or
> reject the null hypothesis.

That isn't a question that _any_ statistical approach will help you with.
There's a reason we talk in terms of "rejecting" or "failing to reject" a
hypothesis. We don't do statistical tests to accept hypotheses, only to reject
them.

The concept of accepting one hypothesis based on a comparison between it and
one other hypothesis is ludicrous on its face, suffering exactly the problems
associated with Pascal's wager.

~~~
s17n
He clearly meant "fail to reject" where he said "accept", you're quibbling
about semantics. The whole post is about how p-values (and indeed, known
statistical techniques) don't actually help you decide whether or not you
should reject the null.

~~~
thaumasiotes
The post is about a confusion between the question that p-values address ("how
likely is this data to have come from the null hypothesis?") and another very
different question ("given several findings that were unlikely to have come
from various different hypotheses, how many are likely to be spurious?").

