
Redefine statistical significance - arstin
https://www.nature.com/articles/s41562-017-0189-z.epdf
======
imh
>For a wide range of common statistical tests, transitioning from a P value
threshold of α = 0.05 to α = 0.005 while maintaining 80% power would require
an increase in sample sizes of about 70%.

This proposal is a great pragmatic step forward. Like they say in the paper,
it doesn't solve all problems, but it would be an improvement with reasonable
cost and tremendous benefits.

>Such an increase means that fewer studies can be conducted using current
experimental designs and budgets. But Fig. 2 shows the benefit: false positive
rates would typically fall by factors greater than two. Hence, considerable
resources would be saved by not performing future studies based on false
premises.

~~~
stdbrouw
In some fields like psychology, power is more likely to already be 10% or 20%
for the majority of studies, and in fact P-hacking and low standards for
evidence would be far less harmful if power were higher, because low power
leads to inflated effect size estimates. Additionally, power calculations are
always just a guess and easy to fudge, so it's pretty much a given that
current statistical power would _not_ be maintained with more stringent
critical values. See [http://andrewgelman.com/2014/11/17/power-06-looks-like-
get-u...](http://andrewgelman.com/2014/11/17/power-06-looks-like-get-used/)

So this proposal is really the opposite of pragmatic. Pragmatic would be
requiring effect size estimates and confidence intervals in all published
papers. It is surprising how many papers will talk about highly significant
effects without actually discussing how large the estimated effect is thought
to be, which gives authors a lot of leeway when exaggerating the importance of
their findings.

~~~
maxerickson
So ultimately the issue is that push button statistics don't work?

~~~
xaa
Ultimately the issue is that we somehow got fixated on the p-value, which
(roughly and imprecisely speaking) quantifies the probability that there is
any effect, even a small one, rather than using effect size estimates which
estimate both the _magnitude_ of the observed effect _and_ the uncertainty in
that estimate.

Using p-values as our primary metric means an overemphasis on finding small
effects (which are usually not clinically relevant anyway) and unduly low
focus on things with big effects.

If an effect is real, but very small, that too may well cause replicability
problems because it suggests the effect may not be very robust to small
changes in experimental conditions, whereas a big effect would be more likely
to be robust.

If you think about the really important scientific findings -- the ones that
made a big impact and are indisputably true -- statistics usually aren't
necessary to prove them, because the effect size is so large it is simply
obvious. I'm not against using statistics anyway, of course, but the point is
that we _should_ be looking mainly for effects with big effect sizes if we are
after important findings, IMO. It is only a major bonus that big effect sizes
are most likely to be replicable.

------
SubiculumCode
This is fine, but without other simultaneous changes, will do harm to young
scientists. We need credit for publishing null results, or stop judgment on
the basis of publication number. Would lead to larger, more well powered
studies (good), but this tends to lead to acquiring multiple measures which
can be inappropriately data-mined, and leads to large grants to established
investigators, but fewer grants to new investigators.

~~~
jerrytsai
Definitely. The main problem is that in the current system no one is being
rewarded for good science, but for showing something interesting, bolstered by
a declaration of (statistical) significance. The incentives are not aligned
with societal objectives.

Good science requires a tension between hypothesis generation and skepticism.
Perhaps if we rewarded the _debunking_ of findings as much as we do the
discovery of findings, things would change.

~~~
adrianratnapala
Why doesn't this happen already.

The funding bodies etc, who want "quantitive" measures of research look at
publications. Why would we expect debunking papers be published if they are
debunking something interesting?

~~~
adrianN
Once you have debunked something interesting I suppose you don't have trouble
getting in published. But you have a hard time writing a grant application
that reads something like "I want to replicate study X, no new results are
expected."

~~~
adrianratnapala
I think this hits the nail on the head.

That said, it also seems like low hanging fruit. At least in some fields,
replication should be a lot cheaper than doing things in the first place --
because a lot of the cost in pissing about trying to find something that even
seems to work.

For funding bodies to explicitly support replication studies, even if each
gets only 20% of the amount the original studies get should be a winner at
reasonably low cost.

------
gattilorenz
It can hardly hurt, but it is still a stop gap measure. It won't solve the
publication bias people will still change the hypothesis or the test after
measurements are done.

I think the situation would improve with better teaching of philosophy of
science and statistics (this would educate better reviewers too).

~~~
epistasis
It can hurt, in that it can slow the spread of information. If you perform 70%
fewer different types of experiments because you have to hit p=0.005 instead
of p=0.05, then you explore in fewer directions.

This is a classic tradeoff between exploration and exploitation in active
learning.

If your view of the world is that there are only a very few hypotheses worth
exploring, and you have a good lay of the scientific land, then requiring
higher bar of proof is probably good.

If it's a new field that's extremely complex and where very little is known of
the governing principles, then requiring very high stats could severely slow
progress and waste lots of research dollars.

I completely agree that rather than setting arbitrary barriers for
significance, it would seem much better to let people actually understand what
was found, at whatever significance it was. Even setting up the null model to
get a p-value requires tons and assumptions. The better test is
reproducibility and predictive models that can be validated or invalidated.
That's where the science is, and not in the p.

~~~
btilly
Yours is a theoretical concern.

The very practical concern is that entire areas of research have been based on
studies replicated and backed up entirely through p-hacking and selectively
publishing only papers with positive results. This is a proven issue _today_.
See
[https://en.wikipedia.org/wiki/Replication_crisis](https://en.wikipedia.org/wiki/Replication_crisis)
for more.

It may be that there is a pendulum that needs to swing a few times to get to a
good tradeoff. But it is clear, now, which direction it needs to swing.

~~~
epistasis
I disagree 100%, having read that Wikipedia page and its sources.

It's something that affects a few fields, not all science. And the problem has
been completely 100% overblown.

If the problem is that things aren't replicating, changing the p-value cutoff
for significance isn't going to fix everything. It can just as easily be a bad
null model that's the problem,in which case you can't trust _any_ p-value. The
MRI scan problem was closely related to that.

It's a field-specific and null-model specific thing. Broadly changing the a
p-value cutoff for everybody isn't going to fix this issue.

~~~
nonbel
>"And the problem has been completely 100% overblown"

Just the fact that so few replications have been published indicates huge
cultural problems. When I did biomed, in my tiny area of expertise there had
been 1-2 thousand papers published since the 1980s. Out of these maybe 2-3
were close to direct replications. None of those showed the results were
reproducible, but no one cared...

Usually there were "minor differences" in the methods so it resulted in stuff
like: "Protein P has effect E by acting through receptor R in cell line L from
animal A of sex S and age Y when in media M".

However if you changed L, A, S, Y, or M apparently totally different things
were going on (there were then supposedly dozens of receptors for each ligand,
each receptor having dozens of ligands in different circumstances, etc).

In the end I found that E was nearly perfectly correlated with the molecular
weight of P (using data from one of the most cited papers on the topic, in
which they specifically claimed there was no correlation with any physical
properties of the ligands).

So the effect has nothing to do with specific ligand-receptor interactions at
all, but no one cared. Situations like this (with few published direct
replications, the ones that are published are contradictory, the results are
all being misinterpreted anyway, and everyone just continues on their way when
problems are pointed out) are totally standard for biomed. The replication
aspect of the issue is really only the tip of the iceberg of problems.

------
jimmar
Not a huge fan of this idea. For example, people who analyze twitter data can
get very small p-values because they analyze millions of tweets even though
the effects they find are very small. See
[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1336700](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1336700)

~~~
moultano
I'd rather hear about small things that are true than large things that are
false.

~~~
cropsieboss
The things is that these small things might just be noise from some
confounding factors.

[http://jaoa.org/article.aspx?articleid=2517494](http://jaoa.org/article.aspx?articleid=2517494)

For example, here sample size is huge, USA population gets significant
increased risk, while EU population does not. Mixing the two together would
result in a smaller but still significant increased risk.

Given the size, it's quite clear that USA population has many other
confounding factors that cannot be eliminated by mathematics alone (there is
no control).

------
leemailll
Change p value from 0.05 to 0.005 won't stop p-hacking. And this might also
lead to more grunted graduate students as they then will have to increase
sample numbers to satisfy new test, which inevitably increase the already
painful long time span for projects to get published

~~~
taeric
To be fair, this is a pragmatic, not a technical, solution. Similarly, we
limit the speeds we allow in residential areas not because it prevents
wreckless driving, but because it decreases the actual risk of it.

Similarly, the technical solution involves technology that does not require
drivers and has no risk of human error anymore. The pragmatic solution is to
just limit the acceptable speeds.

~~~
nkrisc
A bit of humorous pedantry: We seek to prevent _reckless_ driving. _"
Wreckless"_ driving is what we're trying to promote.

~~~
taeric
Thanks for the pedantry. It is amusing to me, because the "w" looks way more
correct to me. Can't say why, though.

~~~
noxToken
Though reck is a word, it isn't commonly used, whereas wreck is a pretty
elementary word. Reck and wreck are homophones. That's probably your answer.

~~~
taeric
Right. But to me the word "wreckless" even looks like what I want in this
sentence. I think I take the meaning to be "as if wrecks weren't a thing."

As opposed to how the pedantry points out, that if "wreckless" were a word, it
would actually be more of the opposite of "reckless."

------
Fomite
Stop worshipping p-values set to an arbitrary threshold, whether it be 0.05 or
0.005, and start actually critically engaging with the statistics and results
themselves.

------
eelkefolmer
Its time to ditch significance levels altogether and use Bayesian inference or
analysis.

~~~
analog31
I'm concerned that prior-hacking will become the new p-hacking.

~~~
Houshalter
So you require people report Bayes factors, not posteriors/priors. Those are
invariant to the prior.

------
iovrthoughtthis
When can we have scientific papers formatted for the web? Reading pdf's with
many tiny columns spread across each page puts me off reading so much.

~~~
folli
Most journals have a HTML and a PDF version (as does Nature):
[https://www.nature.com/articles/s41562-017-0189-z](https://www.nature.com/articles/s41562-017-0189-z)

I prefer the PDF version for print outs.

~~~
iovrthoughtthis
Thanks you for this. I had no idea. Now I can actually read the paper!

------
logicallee
Can someone explain why this three-page article has 72 "authors"? That works
out to about as much writing per author as this comment.

~~~
arstin
Given the kind of paper this is, I assume the names should be understood as an
endorsement. Sorta like signatures on a petition.

------
aheilbut
No one in biology would be able to publish anything.

~~~
pfortuny
Imagine psychology... The end of a science.

~~~
Strilanc
Funny, I was thinking the opposite.

Imagine psychology... done _properly_. The beginning of a science.

(I realize that "beginning" is too harsh, but psychology does have very
serious problems with replicability. At the moment, it deserves its tarnished
reputation.)

------
Houshalter
The concept of statistical significance is nonsense. In Bayesian statistics
there is only evidence. A p value of 0.05 is roughly equivalent to a factor 20
of evidence. That means you multiply the odds you believe in a hypothesis by
20 (or add 13 decibels.) Similarly a p value of 0.005 is roughly equivalent to
200 units of evidence (23 decibels.)

But whether some amount of evidence is "significant" or not is entirely
dependent on your prior. If you believe something has about a 50:50 chance of
being true to start with, then a factor 20 of evidence is quite enough. Now
you believe it 20:1 likely to be true.

But for something like xkcd's "green jelly beans cause cancer", your prior
should be something like 1 to 100,000 or even smaller. After all, there are a
lot of possible foods and a lot of possible diseases. Unless you believe a
significant number of them are dangerous, your prior for any specific food
causing any specific disease must be pretty low. And then even a factor 200 of
evidence is nowhere near enough to convince me that green jelly beans cause
cancer.

~~~
maxerickson
If it is nonsense you shouldn't be able to translate it coherently into
various levels of evidence.

~~~
Houshalter
P values aren't nonsense and correlate with bayesian evidence. I think that
interpreting levels of evidence as "significant" or not is nonsense.

------
s17n
It used to be possible to have have a successful academic career without
publishing much - for example, one of my philosophy profs in college (at a top
10 school) had never published anything after his dissertation (he got his phd
in the early 60s).

Of course, this system only worked because academia was a bastion of the male
WASP elites that didn't have much pretense of serving the broader public. But
at least you didn't have the torrent of mediocre papers that you see today.

~~~
adekok
> academia was a bastion of the male WASP elites that didn't have much
> pretense of serving the broader public

Have things really changed? I suspect there are fewer males, but any job that
demands 20 years of full-time concerted effort is likely to be dominated by
men. Similarly, the western world is overwhelmingly caucasian, so again... the
best predictor (now as then) is that white male professors will be represented
disproportionately.

> at least you didn't have the torrent of mediocre papers that you see today.

That certainly is true. Stats for the humanities and social sciences are that
80% of the papers have zero citations. i.e. they have _no_ contribution to the
greater body of human work.

In Physics (my background), most papers have 2-3 citations, and only a small
percentage have 1 or fewer.

I would say that if a discipline is dominated by uncited papers, then that
discipline is probably a waste of time. And the professors who work in it are
a net drain on society.

~~~
tnecniv
As a note, WASP refers to old money families with ties going back to the
colonial era, not just middle-class/wealthy white dudes in America. Also, at
least in STEM departments, you will see plenty of non-white names.

> In Physics (my background), most papers have 2-3 citations, and only a small
> percentage have 1 or fewer

Does that account for self-citations?

~~~
dragonwriter
> WASP refers to old money families with ties going back to the colonial era

No, it refers to White Anglo-Saxon Protestants, an ethno-religious group that
cuts across socio-economic class divides and includes plenty of people that
are neither old money nor descended from families that have been in the US
since the founding (and excludes some old-money, from-the-founding families.)

~~~
PeterisP
It does exclude many _large_ subpopulations from the many immigration waves
coming after the colonial era. All the irish, italian, polish, jewish people;
the vast majority of 19th and early 20th century (very large!) immigrants and
their descendants aren't WASPs.

------
wavegeek
> For a wide range of common statistical tests, transitioning from a P value
> threshold of α = 0.05 to α = 0.005 while maintaining 80% power would require
> an increase in sample sizes of about 70%.

This seems unintuitive and the claim is unreferenced. Can anyone explain why
this is the case (if true)?

------
cameronraysmith
I've had some luck showing John Kruschke's Bayesian estimation supersedes the
t-test (BEST) and this simple demonstration
[http://www.sumsar.net/best_online/](http://www.sumsar.net/best_online/) to
people.

------
JepZ
> The choice of any particular threshold is arbitrary [...]

Sounds scientific, doesn't it?

> [...] we judge to be reasonable.

And tomorrow someone else judges it differently?

Maybe they should not try to redefine significance but simply introduce
something called 'well-reproducible' or so.

------
mnarayan01
Curious if anyone's done any work to determine if changing the P-value
threshold for e.g. Psychology studies (as they call out Psychology in
particular) measurably affected replicability with p > 0.005?

~~~
zeckalpha
There's overlap in authorship with this paper:
[http://www.sciencemag.org/content/349/6251/aac4716](http://www.sciencemag.org/content/349/6251/aac4716)

------
md224
Just curious: would this have an effect on testing the efficacy of new drugs?
I'd hate to see a false negative result for a drug that could actually help
people...

------
rgejman
Animal experiments will get A LOT more expensive. Will there be a concomitant
increase in agency funding to offset the increased costs?

~~~
siginfo
They do briefly mention "the relative cost of type I versus type II errors".
Both errors (Type I - false positive, Type II - false negative) have some cost
associated.

Money saved by using a small sample size is wasted trying to replicate a false
positive result and by groups around the world that rely on that false result.

Requiring larger sample sizes would mean fewer experiments are carried out but
we will have more confidence in the positive results produced. The outcome is
fewer experiments wasted on following up on false positives. None of this
requires a change in funding.

~~~
rgejman
I really don't think the proposal to do "fewer, but better" experiments work
with animal studies. They are so expensive and so complicated and so much work
and only answer singular, small questions that you almost always need a ton of
further follow up work.

For instance, in the field I work in you have to spend days to months waiting
for tumors to grow and then go and treat the animals every day for a couple of
weeks with an IV drug (weekends too!). That is a _a lot_ of work and at the
end only tells you one piece of information about the drug: does it slow tumor
growth in this one experimental model. It may in fact do that -- and you may
get a really great p-value if you increase the number of mice -- but you still
need to study the drug's pharmacokinetics, tissue distribution, in vivo
mechanism of action (assuming you already know the in vitro mechanism of
action). These are not just optional experiments that we require today to
publish: this kind of work is essential to presenting a story about a new
drug. It's not just about what it does, but how it works and universalizable
it is.

------
tw1010
This still doesn't feel satisfying. Part of me is still not really happy with
the philosophical foundations of statistics. Does anyone know of any
legitimately competing theory to statistics? Maybe something that doesn't rest
on the same types of mathematics that Fisher and crew relied on when all this
started? Pure mathematics has come a long way in the last fifty years but
little has seeped into the applied world.

~~~
scryder
Statistics arises from a set of axioms, assumed truths, which can be used to
prove all other things in the field.

You can take a look at the three axioms people use to justify statistics. If
you are willing to accept them, all else that relies on them (without using
new axioms) must be true:

[https://en.wikipedia.org/wiki/Probability_axioms](https://en.wikipedia.org/wiki/Probability_axioms)

This same logic is used to justify development in pure mathematics: choose a
set of axioms which you accept as ground truths, and prove things using them.
As long as you are unable to prove your axioms are contradictory, and the
axiom choice seems acceptable, then the work that you've done (with respect to
them) is philosophically justified.

~~~
tw1010
Statistics and probability are different things. I'm fine with the foundations
of probability.

~~~
mturmon
Just for reference, not everyone is OK with the foundations of probability -
what you might call "conventional mathematical probability" as axiomatized by
Kolmogorov. See [http://www2.idsia.ch/cms/isipta-
ecsqaru/](http://www2.idsia.ch/cms/isipta-ecsqaru/) for the most recent in a
series of workshops.

One entry into this set of ideas is what Peter Walley has called the "Bayesian
dogma of precision" \- that every event has a precise probability, that every
outcome has a known cost. There are real-world situations when these
probabilities cannot be assessed, or may not even exist; same for utilities.

Some examples are in betting and markets (asymmetric information, bounded
rationality), and in complex simulation environments having so many parameters
and encoded physics that the interpretation of their probabilistic predictions
is unclear.

