
The problem with p-values - vmuhonen
https://aeon.co/essays/it-s-time-for-science-to-abandon-the-term-statistically-significant
======
nkurz
_Take the proposition that the Earth goes round the Sun. It either does or it
doesn’t, so it’s hard to see how we could pick a probability for this
statement._

Sometimes I fail to follow the distinctions made in certain strains of
classical statistics. How is his "conundrum" different than this one? "Roll a
6 sided die whose result we cannot see. Take the proposition that the top
shows a 6. It either does or it doesn’t, so it’s hard to see how we could pick
a probability for this statement."

 _What matters to a scientific observer is how often you’ll be wrong if you
claim that an effect is real, rather than being merely random._

I think "scientific observer" may mean statistician here.

For the scientist, what should matter is the probability that the claimed
effect is real -- period. That is, unlike the statistician, the scientist
isn't (shouldn't be) allowed to blame "modeling error" when it turns out that
the measurements are biased, the samples are correlated, or the effects are
nonlinear. False assumptions that "randomness" is the only (or main) danger
can lead to unrealistic error bars and unwarranted confidence in the
effectiveness of flawed models.

~~~
bbctol
Well, it's less of a conundrum in the case of the die because it's easier to
estimate the priors. That's the power and difficulty of using Bayesian
reasoning; you need some estimate of how likely something is to be true before
you perform the statistical test.

In case of the six-sided die, we have a good physical model and years of
memory to know that there should be a 1/6 chance of any number coming up, so
it's easy to estimate that prior. Similarly, when doing a disease screening
test, we have data on how common diseases are in the general population that
can be easily used.

The problem comes when using Bayesian methods on unsolved scientific problems.
Suppose you don't know if the Earth goes around the Sun or the other way
around, it's the early Renaissance and you've gathered some data that could
indicate a probability of one or the other. What's the prior odds that one
theory is correct? You have no idea, that's why you're investigating! The
worry when using Bayes's theorem to replace deductive p-value methods is that
the prior probabilities may just be made up out of baseless intuition, and
skew the final calculation. (It can still be used effectively, you just have
to get a little fancy.)

~~~
nkurz
_Well, it 's less of a conundrum in the case of the die because it's easier to
estimate the priors._

I agree, but in the case of a die, "estimating the priors" is the same thing
as "knowing the probability of a 6". So you are saying (correctly) that once
we have a model that we believe in, we can assign a probability because we
believe we know the probability in advance.

I was pointing more to the author's emphasis on "either it does or it
doesn't". One view is that probability requires replication, and we can only
speak of probabilities in the long run of many trials. The other view
(Bayesian to my limited understanding) is that probability can also be used to
measure appropriate degree of belief in a proposition.

While it's hard to quantify, and dependent on one's assumptions, I do think
it's possible to speak of "the probability that the earth revolves around the
sun", much as I think it's meaningful to talk about "the probability of
anthropogenic warming" even though we only have a single Earth to study.

I don't really understand how classical statistics disallows discussion of the
first while allowing discussion of the second. Maybe it doesn't?

~~~
bbctol
I'd say classical statistics lets you deduce, based on some knowledge that
something is the case, the probability of an event. It doesn't let you induce
the probability of something being the case (that's why we have all this
indirect dodging with p-values.) So I think there is a fundamental difference
between "What are the odds of rolling a six" as "What are the odds of that
event happening," and "What are the odds the Sun revolves around the Earth" as
a question about whether or not something is true.

That's what the author seems to mean by "either it does or it doesn't," though
it isn't worded that well. Even if the die has already been rolled, we know
that it _could_ have been a 4 or a 5, and have good knowledge of the
probabilities of those events. This is more like looking at a picture of the
top of a die and seeing a six: what are the odds that it's a standard die, vs.
a die with sixes on each face? I don't think it's meaningful to talk about
that, or the sun going around the Earth, without using a Bayesian prior.

~~~
duneroadrunner
> So I think there is a fundamental difference between "What are the odds of
> rolling a six" as "What are the odds of that event happening," and "What are
> the odds the Sun revolves around the Earth" as a question about whether or
> not something is true.

Right, using Bayes makes the difference not a "fundamental" one, but just a
practical one of coming up with the Bayesian prior. Even if it would be hard
to establish a consensus on the most appropriate complete set of factors
determining the Bayesian prior, there are clearly some examples of meaningful
inputs. Like, for example, if you somehow had information about how many of
the researcher's previous hypotheses on the subject failed to obtain a
"p-value of significance".

But perhaps more practically, you could consider things like the (Kolmogorov)
complexity of the hypothesis. Since the number of "low complexity" hypotheses
are finite, they are less susceptible to "p-value mining"[1]. The challenge
being deciding which inputs to use to evaluate the complexity.

This seems to me like an area where machine learning should be applicable.
Rather than lament about the impracticality of determining an appropriate,
tractable set of (quantifiable) criteria for determining a Bayesian prior, why
not just include every potentially relevant piece of information and let Deep
Thought[2] figure out which are actually relevant? So what we really need is
unified data about all published results that have been confirmed and
discredited.

[1] obligatory xkcd: [https://www.xkcd.com/882/](https://www.xkcd.com/882/)

[2] for the youngsters:
[https://en.wikipedia.org/wiki/List_of_minor_The_Hitchhiker%2...](https://en.wikipedia.org/wiki/List_of_minor_The_Hitchhiker%27s_Guide_to_the_Galaxy_characters#Deep_Thought)

------
cuchoi
I think the article is a very good explanation of the application of Bayes
theorem to p-values.

My only issue with this kind of calculations is that their assumptions
dramatically changes the results. He says that 10% of the drug tests are going
to be effective. This might apply to an specific area, but not to others. Here
is a Nature publication that changes the most important parameters, power and
% of programs/drugs/tests that trully have an impact
[http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2738....](http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2738.html).

If you assume power of 80% (not crazy in rigorous studies) and 50% of programs
having an effect (not crazy in many areas) then you get that 94% of the
programs that you stated that had impact, were trully effective.

~~~
apathy
> If you assume power of 80% (not crazy in rigorous studies) and 50% of
> programs having an effect...

You're making another assumption: that the power calculations are actually
rigorous and the effect size (not "having an effect", but "having an effect of
at least the size used to calculate power at a given alpha") is sufficient. If
you've ever designed a clinical trial or written up the statistics for a
grant, you know damned well that these numbers are cooked 6 ways from Sunday.

Colquhoun is a sharp guy (I have debated various fine points with him on
several occasions and he has convinced me with reproducible examples that I
was wrong in my beliefs). It would be nice if the name of the article was
"It's time for science to abandon the term statistically significant", as in
the URL, because that's the real point here.

"Signficance" can only be judged in context. Post all your data and then we'll
see whether you can be believed. But that notion of transparency is absolutely
terrifying to the sorts of senior scientists who control most funding and peer
review at the moment. It's OK, though; the rest of us put up preprints
knocking down the most egregious lies and those who care about the truth can
have it (for free, no paywall).

I've got plenty of Cell, Nature, and NEJM papers on my CV; direct support on
grants where I am PI, Co-I, or KP is over $5M; and I still think the situation
is fucked up. I'm not bitter because I feel left out; I'm bitter because I
fear that the bad money is driving out the good, as it always does.

Take away the monetary incentive, the Journal Impact Factor BS, the "prestige"
of hoodwinking 3-5 referees with a pile of STAP or single-sample comparisons
(had to warn a student about this when he pulled some data from a Nature paper
that a respectable MSKCC scientist gooned for...) and let's see what's left...

------
Homunculiheaded
This statement is a misunderstanding of the Bayesian approach:

"Take the proposition that the Earth goes round the Sun. It either does or it
doesn’t, so it’s hard to see how we could pick a probability for this
statement."

Bayes Factor, the Bayesian alternative to a NHST, is quite a bit different
than simply creating the Bayesian equivalent of a t-test. Bayes factor asks
"How many times better is my Hypothesis at explaining the data than an
alternate Hypothesis". So the Bayesian statement would first pit one model of
the Earth's orbit against another. The Bayesian statement of the question of
the Earth's orbit would be:

"How much more likely is the astronomical data we've observed to have happened
given that the Earth revolves around the sun than it is if the sun revolved
around the Earth."

For a more concrete example let's suppose that we have a coin. I think the
coin has only heads and you think it is a fair coin, with a 50/50 chance of
getting heads or tails. We observe three heads in a row. My hypothesis says
that the probability of getting 3 heads in a row given a trick coin is 1. Your
hypothesis says that the probability of getting 3 heads in a row given a fair
coin is 0.5 x 0.5 x 0.5 = 0.125. My hypothesis explains the data 1/0.125 = 8
times better than your hypothesis. Now suppose the next flip is a tail. The
probability of HHHT in my model is 0 and yours is 0.5 x 0.5 x 0.5 x 0.5 =
0.0625. You're hypothesis explains the model infinitely better than mine!

Now we can say that our new Hypothesis is that the coin is fair. Suppose
another friend comes along and claims that they thought the coin had a 75%
chance of getting heads and only a 25% chance of tails. We flip the coin 5
more times and get HHTTH. Your hypothesis says 0.5^5 = 0.03125, and the
friend's says 0.75^3 x 0.25^2 = 0.0263... You're hypothesis explains the data
only 1.2 times better than theirs. Clearly, we need more data to feel really
confident in one hypothesis over the other.

If you want an even longer example, I wrote a post awhile back about "Bayesian
Reasoning in the Twilight Zone" that goes into more detail (including
priors)[0]

[0] [https://www.countbayesie.com/blog/2016/3/16/bayesian-
reasoni...](https://www.countbayesie.com/blog/2016/3/16/bayesian-reasoning-in-
the-twilight-zone)

~~~
SubiculumCode
So in the real world, are there developed techniques to build relatively
complex models that account for covariates and sources of variability (e.g.
random effects) and repeated measures?

------
jsprogrammer
>The problem is that the p-value gives the right answer to the wrong question.
What we really want to know is not the probability of the observations given a
hypothesis about the existence of a real effect, but rather the probability
that there is a real effect – that the hypothesis is true – given the
observations. And that is a problem of induction.

The problem of induction is real and unavoidable in the general case, but
there is no such thing as "the probability that there is a real effect".
Either there is a "real effect" or there is not.

It might be possible to find a "probability" that you observed a "real
effect".

>The problem of induction was solved, in principle, by the Reverend Thomas
Bayes in the middle of the 18th century.

The problem of induction is fundamentally unsolvable (hence, "problem"). The
article just states that it was solved and never mentions it again. Is it a
widely held view that induction was solved by Bayes? Does anyone know where I
can read more detailed claims about how people believe Bayes _solved
induction_?

~~~
bayeslives
Bayes did not "solve induction". If we define induction as telling which
specific model generated this data: that is not possible. Countless models
could in theory generate our data. What we need is some restriction. Like a
prior. And when was the last time you started a research project without _any_
idea what to expect? And if you did, wouldn't it be wiser to do some
literature study, experts interviews etc. before starting experiments?
Modeling the state of the art, pre-experiment, seems like a clever move
anyway.

To name just a few of the NHST/p-value flaws:

1- I'm interested in P(H1 | data) but I get P(data | Ho). Contrary to popular
belief P(data|Ho) != P(Ho|data). Let alone that conclusions about P(H1|data)
can be drawn.

2- it is vulnerable to wrong interpretations. * No, a 95% confidence interval
(a,b) does NOT mean there is a 95% chance that Mu is in (a,b). * No, p=0.04
does not mean they is a 96% chance that H1 is true.

3- the p-value depends on the intentions of the scientist. If you end your
experiment after 80 observations, as planned, your p-value is different from
that of an experiment that ended unplanned after 80 observations. So the same
data have different evidential power, influenced by results you did not see in
experiments you did not do. This is very unsatisfactory.

4- the idea of "an effect that exists or does not exist", based on some
arbitrary threshold. The reality is, in many cases, uncertainty and variation.
In group A I see effects of medicine A, with lots of variation between
persons. In group B I see varying effects of medicine B. Then I introduce
uncertainty by drawing random samples from A and B. Let's day I used those
samples to make an inference: is, on average, medicine A better than medicine
B ? Matras like "there is an effect, or there isn't" are not very helpful.
Statistics should be about quantifying uncertainty rather than give false
yes/no statements.

5\. Basing decisions and knowledge on the data only makes t vulnerable to
outliers, unlucky samples and so on. And why should you NOT use information,
when it's there ?

~~~
vmuhonen
P-value being defined as the probability of observing a result equally or more
extreme under a model H0. So if you start with with assumption that H0 is
true, there's not much you can say about alternative hypotheses.

The American Statistical Association actually put out a statement this year on
the issue of p-value. You can find the whole article here
[http://dx.doi.org/10.1080/00031305.2016.1154108](http://dx.doi.org/10.1080/00031305.2016.1154108)
but here are the main points:

1) P-values can indicate how incompatible the data are with a specified
statistical model.

2) P-values do not measure the probability that the studied hypothesis is
true, or the probability that the data were produced by random chance alone.

3) Scientific conclusions and business or policy decisions should not be based
only on whether a p-value passes a specific threshold.

4) Proper inference requires full reporting and transparency.

5) A p-value, or statistical significance, does not measure the size of an
effect or the importance of a result.

6) By itself, a p-value does not provide a good measure of evidence regarding
a model or hypothesis.

As an additional curiosity, the group of writers was not completely
unconflicted coming up those definitions and the article contains a number of
supplemental articles by the individual authors to clarify/dispute some of the
points made.

[edit] fixed formatting

~~~
bayeslives
Good points. The NHST thing was invented by Neyman & Pearson as a tool for
decision making, not for finding the truth.95% confidence means your intervals
will be not to far off, in 95% of all samples.

Perhaps this is nice fur Quality Asurance in factories. Where I do repeated
measurements and where I want a simple YES or NO.

But science usually asks: "what can I learn from this specific data? I don't
do 100 samples and I'm not interested in being "not to far off moist of the
time". I want a best estimation based on this specific sample".

NHST does not give that answer. Bayes does.

~~~
jsprogrammer
Science only asks, "Have I observed something contrary to my theories?" For
non-deductive theories, the only real approach is to count the number of times
you observed something agreeing with your theories vs. the total number of
times you observed something (ie. p-value).

------
qwrusz
It's ironic the author starts off saying they need to get more rigorous in
their science and statistics. Then goes on to write a relatively short and not
very rigorous overview of Bayesian statistics.

Statistics and probability are difficult subjects. They are also not intuitive
subjects for many people.

Besides the publish or perish thing, I would guess many authors of these
unreliable/non-replicable biomedicine papers were focused on biomedicine
during grad school - while they learned statistics it was a secondary subject.

A solution seems to be more statistics training for academics doing studies or
requiring a trained statistician on the team/reviewing a study prior to
publication. I don't see either happening to be honest, and this problem will
likely continue.

