
P values are not as reliable as many scientists assume - feelthepain
http://www.nature.com/news/scientific-method-statistical-errors-1.14700
======
Blahah
This is probably the most annoying problem in my daily life (yeah I know,
first-world problems). I have daily conversations with biologists where I've
analysed some data and associated (say) a posterior probability with each
condition in the model. They insist I give them p-values or something they can
present as though they were p-values. At the beginning of my PhD I complied.
Then they throw out all the nuance in the data, put one asterisk for p < 0.05,
two asterisks for p < 0.01, etc., and (this is the horrifying part) _believe_
that an asterisk indicates that something is true. They put stupid asterisks
all over my beautiful plots and then think their arbitrary cutoffs mean
something biologically meaningful. I die a little inside every time I see an
asterisk on a plot.

Now I refuse to use p-values and deliberately construct analyses that are
incompatible with Fisherian statistics. And rather than giving people raw
numbers, I produce a massive document of interpretation. Takes a huge amount
of time, but I'm hoping it will mean my publishing track will contain
significantly (ha!) fewer false results than most biologists'.

~~~
mtdewcmu
Noble, but it sounds risky. Don't you need false results to pad your CV to
keep up?

~~~
gwern
He may not be long for this (scientific) world. As they say, _p_-ublish or
perish.

~~~
Blahah
Publishing is easy. It's just a bit easier for the people who are (perhaps
unconsciously) cheating. I'm pretty confident my loudly-refuse-to-cheat
strategy will prevent me perishing before my time.

------
gabemart
I was recently trying to explain Bayesian logic to a friend, and came up with
the following analogy. I would be interested to hear feedback on it.

\---

Imagine everyone in the USA gets sudden amnesia. We want to find out who the
President is, but no one can remember.

A scientist comes up with a test to determine if someone is the President.

If they _are_ the President, there is a 100% chance the test will say they are
the President and a 0% chance the test will say they are not the President.

If they are _not_ the President, there is a 99.999% chance the test will say
that are not the President, and a 0.001% chance the test will falsely say they
are the President.

Giving the test to the person sitting in the big chair in the Oval Office is
useful, because it's already quite likely this person is the President. If the
test is positive for Presidency, it's extremely likely that person is the
president.

Giving the test to the 10 people nearest the oval office is useful, because
it's fairly likely the President is one of these people. A positive result
will indicate strongly that that person is the President, and if no-one in
that group is actually the President, there's a 99.99% chance the test will
say so.

Giving the test to the 1000 people in the White House is pretty useful,
because it's pretty likely the President is in the White House, and if none of
these people are the president, there's still a 99% chance the test will be
correct. A positive result for any one person will indicate quite strongly
that that person is the President.

But giving the test to everyone in America is not very useful at all, because
it's very unlikely that any particular person is the President, and we can
expect the test will give a positive result for around 3200 people. For any
particular person in this group, it's much more likely they're not the
President than they are.

\---

Is this a broadly correct, if non-rigorous, analogy? I realize most HNers will
be much more familiar with this stuff than I am, I'm interested chiefly in
whether or not I misled my friend.

~~~
klodolph
Right. And the frequentist version is also useful.

If you test one person and the the test is positive, then that person is the
president (p=0.00001).

If you test a thousand people, and the test is positive for one of them, then
that person is the president (p=0.01).

So you don't really need Bayesian logic to reason that you should test fewer
people if you want a more significant result. (Note I'm not saying you don't
need Bayes' Theorem, which _everyone_ uses.)

Edit: I think most people on HN get their knowledge of frequentist and
bayesian statistics from XKCD #1132. That's sad.

~~~
jey
> So you don't really need Bayesian logic to reason that you should test fewer
> people if you want a more significant result.

That's misleading by the use of the word "significant" which apparently means
something in frequentism than it does in normal speech. I certainly wouldn't
use "significant" in that way as a non-frequentist, I would instead rephrase
what you said as:

> So you don't really need Bayesian logic to reason that you should test fewer
> people if you want _more confirmation bias_ in your result.

And that's a statement I can definitely get behind!

~~~
_delirium
It can definitely be used misleadingly, but it's not too out of line with
normal scientific usage of the term. The "significant" in "significant
figures" is the same: if a number has "11 significant figures", it doesn't
mean the 11th digit is significant in the sense of being important or having a
big impact, just that the 11th digit is within the measurement precision (as
propagated through any subsequent calculations).

------
jawns
This is the reason why I don't include P values on
[http://www.correlated.org](http://www.correlated.org).

They would muddle my otherwise irreproachable statistics.

~~~
klodolph
I bet the p-values on that site would be very high if properly calculated.
Generally, if you are reporting correlations between a large number of
variables, the p-values shoot through the roof.

Of course, TONS of people forget this and publish a p-value as if those two
variables are the only ones under consideration. Which is just sad.

~~~
singingfish
With correlation statistics the p value is proportional to the sample size.

~~~
klodolph
I think we're talking about different experiments. I'm not talking about,
"here are the correlation statistics for 100 variables", I'm talking about, "I
tested 100 variables, and these are the five pairs which are most strongly
correlated."

~~~
singingfish
At which point you need to do some variant on principal components analysis on
the covariance matrix.

------
relaunched
The last sentence in this paragraph is hilarious:

P values have always had critics. In their almost nine decades of existence,
they have been likened to mosquitoes (annoying and impossible to swat away),
the emperor's new clothes (fraught with obvious problems that everyone
ignores) and the tool of a “sterile intellectual rake” who ravishes science
but leaves it with no progeny3. One researcher suggested rechristening the
methodology “statistical hypothesis inference testing”3, presumably for the
acronym it would yield.

~~~
pcrh
As is:

"One researcher suggested rechristening the methodology “statistical
hypothesis inference testing”, presumably for the acronym it would yield."

~~~
Nicholas_C
I'm guessing you didn't read the last sentence of the paragraph in the post
you were replying to.

~~~
pcrh
Oops!

------
snowwrestler
Statistics are descriptive, not predictive--period.

I'm continually surprised at how many people either don't know, or don't
internalize, that. Look at how often "risk factors"\--which are a descriptive
concept--are converted to advice--which is predictive.

Doing so in the absence of a causal hypothesis is a basic violation of
"correlation does not equal causation."

If you want to construct a scientific theory you must be able to articulate
some predictive tests, and that means you must hypothesize a causal mechanism.

~~~
cschmidt
As in "I used to think correlation implied causation. Then I took a statistics
class. Now I don't..."

[http://xkcd.com/552/](http://xkcd.com/552/)

~~~
snowwrestler
The reason that comic is funny is because there _is_ a known mechanism that we
would expect to cause that effect: teaching.

~~~
dredmorbius
Which is why the alt text on that particular comic is so key.

Correlation _suggests that if you 're looking for causation, it might be
somewhere over here_. It doesn't _insist_ that the two are the same, but if
you're looking for clues, it's a hell of a dowsing rod.

------
gwern
Further reading:
[http://lesswrong.com/lw/g13/against_nhst/](http://lesswrong.com/lw/g13/against_nhst/)

------
shas3
The root of this problem is that most data sets in psychology, anthropology,
and epidemiology are not as large in terms of sample size as what computer
scientists and electrical engineers encounter. p-values are a surrogate to
explicitly describing the data using probability distributions or as random
processes. In essence, you sacrifice granularity for simplicity. If you look
at the original works of Fisher, etc. and their widespread utility, a large
part of early statistics is intended for 'practical statisticians' who seldom
encounter data-sets that are 'small' in terms of sample size. As someone who
works in electrical engineering/computer science, I've never used the p-value
because:

1\. The field, in general, demands far more mathematical rigor when dealing
with statistics.

2\. The demand for mathematical rigor is justified because most data sets we
deal with are many orders of magnitude larger than what psychologists and
others encounter. So predictions based on limit theorems, etc. are often
testable.

~~~
Fomite
As an Epidemiologist, some of the most appalling uses of statistics I've ever
seen have been in electrical engineering and CS.

~~~
shas3
Can you elaborate?

~~~
Fomite
A little late in coming (blame the Southeast U.S.'s snowstorm), but I work in
a very stats heavy field and interact with a number of CS types because I work
on computational models.

I've gotten a fair amount of "I just need a p-value" requests, and some
assumptions that everything can be hit with a t-test or an ANOVA and it'll all
work out fine.

------
JoshTriplett
I'd love to see a comprehensive article that shows what a research paper's
analysis would look like using Bayesian methods. I've seen plenty of general
hints about Bayesian methods, discussion of priors, and similar, but I haven't
found any specific guide on how to apply those methods to the types of
research papers that would traditionally use a null hypothesis significance
test with a p value.

~~~
Homunculiheaded
Not articles but there are two very excellent books on the subject that I
can't recommend enough:

If you read calculus with about the same fluency as the comic books then "Data
Analysis: A Bayesian Tutorial" is awesome [http://www.amazon.com/Data-
Analysis-A-Bayesian-Tutorial/dp/0...](http://www.amazon.com/Data-Analysis-A-
Bayesian-Tutorial/dp/0198568320)

And if you would like a little more exposition (but still a mathematically
sophisticated treatment) "Doing Bayesian Data Analysis: A Tutorial with R and
BUGS" is fantastic [http://www.amazon.com/Doing-Bayesian-Data-Analysis-
Tutorial/...](http://www.amazon.com/Doing-Bayesian-Data-Analysis-
Tutorial/dp/0123814855/)

The latter will also give you more details of how to approach classical,
frequentest tests and summary statistics with their Bayesian equivalent.

Honestly I would say get both books as they're cheap and provide different
insights. You only need to read a few chapters of each to see how you approach
basic experiments from a Bayesian perspective.

------
tokenadult
As the article reports, "Perhaps the worst fallacy is the kind of self-
deception for which psychologist Uri Simonsohn of the University of
Pennsylvania and his colleagues have popularized the term P-hacking; it is
also known as data-dredging, snooping, fishing, significance-chasing and
double-dipping. 'P-hacking,' says Simonsohn, 'is trying multiple things until
you get the desired result' — even unconsciously."

Simonsohn, has a whole website about "p-hacking" and how to detect it.

[http://www.p-curve.com/](http://www.p-curve.com/)

He and his colleagues are concerned about making scientific papers more
reliable. You can use the p-curve software on that site for your own
investigations into p values found in published research.

Many of the interesting issues brought up by the comments on the article
kindly submitted here become much more clear after reading Simonsohn's various
articles

[http://opim.wharton.upenn.edu/~uws/](http://opim.wharton.upenn.edu/~uws/)

about p values and what they mean, and other aspects of interpreting published
scientific research. He also has a paper

[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879)

on evaluating replication results with more specific tips on that issue.

"Abstract: "When does a replication attempt fail? The most common standard is:
when it obtains p>.05. I begin here by evaluating this standard in the context
of three published replication attempts, involving investigations of the
embodiment of morality, the endowment effect, and weather effects on life
satisfaction, concluding the standard has unacceptable problems. I then
describe similarly unacceptable problems associated with standards that rely
on effect-size comparisons between original and replication results. Finally,
I propose a new standard: Replication attempts fail when their results
indicate that the effect, if it exists at all, is too small to have been
detected by the original study. This new standard (1) circumvents the problems
associated with existing standards, (2) arrives at intuitively compelling
interpretations of existing replication results, and (3) suggests a simple
sample size requirement for replication attempts: 2.5 times the original
sample."

------
analog31
"If your experiment needs statistics, you ought to have done a better
experiment." \-- Ernest Rutherford

------
yread
I don't understand how did they get to the number 71% percent for 0.05
p-value?

0.05 p-value means that there is 5% probability that (for a t-test as example)
a difference in averages of two sequences (the statistic) is by chance and not
because of difference in means of their underlying normal distribution.

I assume that the "toss-up" means that there is no difference in the means in
reality (so the null hypothesis is true). Am I understanding it correctly?
Shouldn't in this case the probability of getting p-value < 0.05 be in fact
less than 5% and not 29%?

How did they get the 29%?

~~~
runarberg
No, you've got it wrong. A _p_ -value of .05 means that given a _true_ null-
hypotheses, you have a 5% chance of getting this result anyway. That is, if
there is _no difference_ in the population mean, than you still have a 5%
chance of finding a sample that contains this difference anyway.

Do you see the difference. The _p_ -value doesn't tell you anything about the
population, it only gives you information about your sample -- a confusion
that this article was pinpointing.

So what the 29% is telling you is that, the chances of _finding_ a
statistically significant result ( _p = 0.05_ ) given a population factor of
_µ = 0_ are up to 29%. Whereas given a population factor of _µ = 0_ the
chances of _getting_ a significant result ( _p = 0.05_ ) is 5%.

EDIT: erased a wrong and confusing claim.

~~~
yread
Thanks, can you also explain why 29%?

EDIT: also, what is population factor µ? The difference in means?

------
cwyers
If you're interested in the subject, some additional reading (warning, PDF):

[http://www.deirdremccloskey.com/docs/jsm.pdf](http://www.deirdremccloskey.com/docs/jsm.pdf)

------
yetanotherphd
Most of the objections here and in the article are not inherent problems with
frequentist p-values.

First, the reported p-value might be wrong. E.g. basing it on assumptions of
normality when the data is non-normal. However modern non-parametric
approaches like the bootstrap can avoid this issue.

Second, testing multiple hypotheses. If you test 10 hypotheses then you cannot
reject the null (that all 10 null hypotheses hold) simply because one single
hypothesis is rejected in isolation. But this is well known, and failing to
account for it is an issue with the researcher, not with frequentist
statistics. I actually think that the main practical difference between
Bayesian and Frequentist statistics is whether accounting for the issue of
multiple hypotheses is done formally or informally.

~~~
hootener
The article doesn't bash the p-value as a statistical test specifically, more
its use and interpretation by scientists over the years.

You're absolutely correct about using non-parametric tests, and more
scientists should be using them. The normality assumption is flat out
laughable when using real-world data most of the time.

You're also correct about multiple hypothesis testing. Accounting for
familywise error (e.g., Holms adjustments) can help to keep your p-value
reporting honest.

That doesn't negate the underlying problem, though. A p-value is simply an
indication, nothing more. The p-value never promised to be more than that. The
issue isn't in the p-value's construction, the issue lies in its misuse and
how easily it can be abused in statistical reporting (see: p-hacking).

The p-value as a test statistic is perfectly honest in my opinion. But like
many other statistical methods, it comes with its own set of baggage that I
feel gets conveniently glossed over more often than it should.

------
mandor
I fully agree with the critics about the p-values, but what are the best
alternatives to analyze and compare data? Most of the time, scientists have to
compare the outcome of treatment 1 versus treatment 2; how should they do it
"properly"?

What is the HN recommandation?

~~~
Fomite
Effect measures. Don't just report your p-value, report the actual effect
measure, and a measure of uncertainty around it, be it a frequentist
confidence interval, Bayesian posterior distribution, etc.

More information is better.

~~~
capnrefsmmat
Agreed. A good reference is Geoff Cumming's "Understanding the new
statistics," although its focus on analysis in Excel may put off the typical
HN audience.

Effect sizes and confidence intervals are much more useful than a p value or
two.

~~~
dewarrn1
Cumming had a short treatment of the subject published in Psych Science
recently:
[http://pss.sagepub.com/content/25/1/7.short](http://pss.sagepub.com/content/25/1/7.short)
; doi: 10.1177/0956797613504966.

------
socrates1998
I have always thought people rely on the normal distribution too much.

Does it work? Sometimes.

The problem is that people tend to believe something they use a lot.

Even the 0.05 threshold is sort of made up.

Correlation does not mean causation.

~~~
jheriko
p-values need not arise from the normal distribution - but any distribution -
selecting the right one is another cause for error when producing them.

the normal distribution is also quite well justified by the central limit
theorem.

i do however agree that a p-value of 0.05 is not worth very much.

------
Finster
> In 2005, epidemiologist John Ioannidis of Stanford University in California
> suggested that most published findings are false; since then, a string of
> high-profile replication problems has forced scientists to rethink how they
> evaluate results.

That's what is supposed to happen, though, right? You publish your findings.
Others try to reproduce. They publish THEIR findings, etc. etc. If most
published findings are false, it sounds like the process is working as
designed.

~~~
pessimizer
Bad papers can be generated. published, and cited a lot faster than failures
to replicate.

~~~
Finster
So, doesn't that mean we should be blaming idiotic news media outlets that
tend to generalize and publish scoops on scientific papers too quickly?

~~~
capnrefsmmat
Yes and no. Yes, the media sucks, but no, most scientific papers are not
replicated -- or if they are, it's by a hapless grad student using the results
in their research, only to find they don't hold up. The hapless grad student
usually doesn't get to publish this, because negative results are boring and
not usually published in prestigious journals.

Most scientists have better things to do than replicate previous findings,
unless that previous finding directly bears on their own work.

------
loderunner
Another reason to be skeptical of the statistics thrown around in popular
news.

~~~
pacaro
Unfortunately it's not just the nightly news one needs to be skeptical of,
it's also (amongst many others) your doctor—who probably isn't particularly
well educated w.r.t. statistics, but who will nonetheless give advice and
prescribe treatments based on them.

~~~
mattfenwick
This is a great -- and scary -- point. These statistical fallacies and
misunderstandings are so deeply ingrained into our scientific and medical
systems that it's hard to see how and when they will be removed. I can attest
from personal experience that many scientists 1) don't understand these
statistical tests, 2) don't care to find out, and 3) don't think there's a
problem.

------
milliams
This is why, to be on the safe side, in particle physics we have a requirement
of a p-value of 0.0000003 for a discovery.

~~~
stinos
But isn't the whole point that no matter how low the P-value is, it is not a
reliable measure?

------
mrcactu5
i routinely get jobs from doctors at prestigious universities who say, "here's
at study with 3 samples, see if we can get p < 0.05"

------
jheriko
a p-value of 0.05 or even 0.01 is stupidly high. it only takes a little
thought experiment about what that means in reality to realise how permissive
it is and you can find demonstrations of this without going particularly far,
looking very hard or being especially well educated...

consider the wikipedia example with heads vs. tails.

[http://en.wikipedia.org/wiki/P-value#Examples](http://en.wikipedia.org/wiki/P-value#Examples)

the idea that 5 coin tosses can produce a p-value < 0.05 that 'demonstrates'
that the coin is biased towards heads is intuitively 'obviously wrong'. even
if we take it to 10 coin tosses (the p-value you get is 0.001 - which looks
really strong if we accept that 0.01 is acceptable) it clashes with my own
ideals for what statistical significance should mean. this is in a loose way a
proof by contradiction that p-values of 0.05 or 0.01 do not have utility (at
least for these kinds of small n).

aside from that consider running the experiment 5 times or 20 times. how many
false positives do you expect? what is the expected number of false positives?
is that significant?

it also bothers me how connected to the problem formulation that the value
itself is. if we analyse the same situation with an identical test but a
different formulation of the problem that the values differ?

why is five heads in a row less significant as a result when the test is
whether a coin is biased at all rather than a test that it is biased towards
heads only? sure i understand the probability involved there that we have all
these potential coins biased towards tails that mean nothing in the first case
- but there is something very deeply wrong with that.

shouldn't this be the other way around? if 5 consecutive heads is good
evidence that a coin is biased towards heads, isn't it equally good evidence
that it is biased at all? classical logic says that it is because being biased
towards heads is a subset of being biased in either direction. the truth is
that it really is equally good evidence - i challenge someone to explain why
it is not! ( actually i kinda want to be wrong about that because i might
learn something new then :) )

probability is counter-intuitive and useless for the kinds of small n usually
used in experiments - the intuition about it recovers when we deal with
sensible n - numbers like 1000 or 10000 - but these are still small n really
if you need to scale up, or be confident that your result is correct. even at
100 samples its obvious that our idealisation of percentage and what happens
in reality do not marry up neatly...

to make a very crude software analogy what about those 1 in 10,000 bugs? they
are still a very real problem if you have millions of customers...

or - IMO even 10,000 is a very exceedingly small n to try and draw robust
conclusions from.

~~~
sp332
0.05 == 5% == 1/20\. If you flip a coin 5 times and get heads every time, do
you intuitively feel that there is more than 1-in-20 odds that the coin is
fair?

You should really get used to the idea that stating a different problem will
give you a different answer. You need to be very careful when asking a
question, or your answer might not mean what you think it means.

~~~
jheriko
maybe my presentation was unclear. i seem to have given the impression that i
am really quite dim...

> You should really get used to the idea that stating a different problem will
> give you a different answer

this is entirely normal and expected since ever i can remember... what i'm
saying is that you can analyse the same data in two different ways and reach
differing conclusions because of the nature of the p-value (vs. the nature of,
well... nature)

what i was mainly trying to get across is that a coin being biased towards
heads /logically infers that/ it is biased. so the idea that 5 heads in a row
is less evidence of a coin being biased than it is that it is biased towards
heads is not only counter-intuitive but in disagreement with a much stronger
and more intuitive form of reasoning.

the fact that the p-values are different in these cases leads me to expect
that p-values on their own are not a good indicator of strength of evidence
without a lot more context - and /really/ understanding what that context is
and means - in which case why use the value at all? nobody else is likely to
interpret it correctly unless you lay it out that way which then negates the
supposed utility of the p-value...

and yes, i intuitively consider 5 heads in a row to be unspectacular for a
fair coin certainly not a 19/20 chance that it is biased (maybe i am very,
very wrong though).

~~~
sp332
Well, try it a few times and see if you can convince yourself :)

The evidence is different because there are different outcomes. For the two-
tailed test the possible outcomes are: biased toward heads or tails, or not
biased at all. For the one-tailed experiment, the outcomes are: biased toward
heads, or not. Getting 5 tails in a row would be evidence in favor of the coin
being biased, but not being biased toward heads.

Think of it this way: the two-tailed test is running 2 experiments at the same
time (one for heads and one for tails) with the option of picking the one that
gives you better results. So obviously the standard for significance has to be
higher, because you're cherry-picking results.
[https://xkcd.com/882/](https://xkcd.com/882/)

------
michaelochurch
Many frequentists tend to Bayeslessly trust them.

