
It will be much harder to call findings ‘significant’ if a team gets its way - nonbel
http://www.sciencemag.org/news/2017/07/it-will-be-much-harder-call-new-findings-significant-if-team-gets-its-way
======
elsherbini
There are two legitimate ways to get lower p-values: You can have a larger
effect size, or you can have a larger number of samples[0]. Of course, you
can't change the effect size, so this would lead to larger necessary samples
sizes to study smaller effects.

I think in general though, at least in biology, people are waking up to the
fact that p-values aren't magical, and that having a really small p-value
isn't a goal in and of itself. It is, however, necessary to do some statistics
on your data to get it published, but the p-value is just checking a box more
than being used as a tool for discovery.

Yesterday a cool dataset was released from Jeff Leek which has over 3.6
million p-values from scientific literature[1]. The distribution is fun to
look at by discipline[2].

[0] [http://rpsychologist.com/d3/NHST/](http://rpsychologist.com/d3/NHST/)

[1] [https://github.com/jtleek/tidypvals](https://github.com/jtleek/tidypvals)

[2]
[https://twitter.com/drob/status/890260541876338690](https://twitter.com/drob/status/890260541876338690)

~~~
BeetleB
>There are two legitimate ways to get lower p-values: You can have a larger
effect size, or you can have a larger number of samples[0]. Of course, you
can't change the effect size, so this would lead to larger necessary samples
sizes to study smaller effects.

Be careful with this. Larger sample sizes are more likely to give you a
significant result even if one doesn't exist.

Say my null hypothesis is that X=100. The alternative hypothesis is X>100.

What if in reality X is really 100.5? Depending on the problem domain, this
may well be the same as the null hypothesis. But a larger sample size is much
more likely to give a significant result.

There are ways to fix this, but one should just be aware, though.

~~~
dragonwriter
> Say my null hypothesis is that X=100. The alternative hypothesis is X>100.

> What if in reality X is really 100.5?

Then the alternative hypothesis is true and the null is false. (Though
normally if the tested hypothesis was X > 100, the null would be X ≤ 100.)

> Depending on the problem domain, this may well be the same as the null
> hypothesis.

No, X = 100.5 is not the same as X = 100, irrespective of problem domain, so
long as the mathematical symbols have their usual definitions.

> But a larger sample size is much more likely to give a significant result.

Yes, if something is only _just barely_ true, it is more likely to take a
large sample size to distinguish it from the case where it is false. But since
in this case, the alternative hypothesis actually _is_ true, it's not a
problem that a bigger sample is more likely to reject the null hypothesis.

~~~
BeetleB
>Then the alternative hypothesis is true and the null is false. (Though
normally if the tested hypothesis was X > 100, the null would be X ≤ 100.)

Mathematically, yes. Practically, not always.

In many real world effects, we're not 100% accurate on what the null
hypothesis is. We may think the existing process suggests a the temperature
increase of 1C, and so we set that as the null hypothesis.

But in that problem domain, we may have difficulty being more precise. Suppose
that 1C in the existing process came from the mean of measuring a sample of
100, so the current belief is that the expected value of the increase should
be 1C. Suppose in reality it is 1.02C. Now we're proposing a new process and
claim it is better than the existing one. This time we take a much larger
sample, and see the number hovering at 1.02C - we can then claim the new
process is better than the existing process and have a p value to demonstrate
it.

The math will be correct, and the conclusion will be wrong.

~~~
benchaney
> Mathematically, yes. Practically, not always.

P values measure statistic significance, not practical significance. If you
are extrapolating practical significance from statistical significance, then
you have a problem that is unrelated to sample size.

~~~
BeetleB
>P values measure statistic significance, not practical significance. If you
are extrapolating practical significance from statistical significance, then
you have a problem that is unrelated to sample size

If you never extrapolate practical significance from statistical significance,
then you have a problem. p values exist for the purpose of doing that. The
problem in in research isn't the use of p values per se, but the _misuse_ and
_misunderstanding_ of them.

An argument to do away with p-values demonstrates a basic misunderstanding of
them.

------
Tloewald
I think that a very stringent threshold for p values (it's going to be a _lot_
harder to get to p < 0.005 than p < 0.05 and it's going to encourage a lot
more gaming of the system) is not the solution. At best, it will lead to
smaller numbers of much bigger studies with more authors and fewer interesting
results.

I'd suggest that a better, broader, and more practical solution is to provide
a public clearing house for studies and associated data -- let's assume it's a
website.

a) When you start a study you must describe your study, methodology, and
hypothesis, and publish these things on the site. It should be possible to
find the study the same way you find any paper that results from it.

b) All data must be stored in the same place (but not made public) as it
becomes available, and this must be demonstrable. This data should be
available to anyone trying to replicate, review, etc.

c) Final data sets and analyses must also be also be available, ideally with
the code used to do the analyses.

All of this should be prerequisite for review and publication, but also for
anyone working in the field should no papers result.

This solves a lot of problems, including making it easier to conduct meta-
analyses when looking for effects that aren't significant in smaller studies,
and also address the lack of availability of negative results, and even help
combat theft of methodology (e.g. where someone sees work in progress and is
able to publish using the methodology). It would make meta-analysis easier,
and allow researchers to better take advantage of existing, unpublished
results (e.g. to hone their study methods, etc.)

~~~
brownbat
If we were designing things from scratch, I'd even want t odisaggregate two
career tracks for scientists:

The first, theorists, design interesting studies with open ended questions,
explain how the study would provide useful scientific results, or theorize
about the implications of existing published results.

The second, practitioners, are hired to actually run the studies according to
the specifications provided. The practitioners would publish null results of
course. Practitioners would also have to acknowledge how many of their results
are significant, so it'd be transparent if they were just confirming
everything.

A bit like barristers and solicitors in many common law countries, a
distinction that isn't strictly necessary for the system to work, but prevents
some conflicts of interest.

~~~
Tloewald
Yes, and indeed with this system a theorist can look for interesting data out
there and publish with attribution, while experimenters can look for
interesting failures and home methodology.

------
shmageggy
Totally agree with Timothy Bates here:

> _[Bates] called the proposal “a risky distraction”... [It] wouldn’t address
> many other practices linked to irreproducible results: poor study design, a
> bias toward publishing positive results, and the practice of “p-value
> hacking”—fishing for significant-looking results from a huge number of
> hypotheses._

Since the reproducibilty crisis started getting attention, I've noticed that
these problems are really ubiquitous. Tweaking study designs and modifying
hypotheses after collecting data are very commonplace and much more subtly
dangerous than a potential lack of statistical power. Gelman says "Valid
p-values cannot be drawn without knowing, not just what was done with the
existing data, but what the choices in data coding, exclusion,and analysis
would have been, had the data been different." [1]

This proposal also says nothing about effect sizes, which also should play a
role in how we interpret scientific results. We should probably draw different
conclusions given a tiny effect at p=0.05 versus a large, obvious effect at
p=0.05.

[1][http://www.stat.columbia.edu/~gelman/research/published/asa_...](http://www.stat.columbia.edu/~gelman/research/published/asa_pvalues.pdf)

~~~
petters
> We should probably draw different conclusions given a tiny effect at p=0.05
> versus a large, obvious effect at p=0.05

I'm not sure that "large, obvious effect" and "p=0.05" are compatible, though.

~~~
shmageggy
Sure it is, of course depending on your domain-dependent definition of "large,
obvious". You can have different effect sizes at any given p-value.

------
pcrh
The term "significant" is itself part of the problem. In statistics it is
often used to describe an arbitrary threshold where a difference between
datasets is assumed to be real. In common parlance this term implies
"important" or "meaningful", but the statistical tests are not designed to
draw such forward-looking conclusions.

It would be far easier, and more honest, to simply report the P-value obtained
by performing a statistical test without adding the term "significant".

~~~
jaggederest
I'd be interested in a compound metric composed of the likelihood ratio and
the p-value.

Essentially the more likely your hypothesis, the higher your P value can be,
and conversely unlikely hypotheses require stronger evidence.

I don't think this solves the problem but maybe it makes that assumption about
the hypothesis likelihood more specific.

------
SubiculumCode
Fishing for findings is a problem, but also overly stringent multiple
comparison correction requirements. Those who preach Bonferroni should
consider a life-long proposal.

Let us propose that correction for multiple comparisons be a life long
endeavor. At the start of a scientist's career as principle investigator,
p<.10 is fine. But by their career's 10th comparison, they have to meet at
Bonferroni p<.01, and at their career's 100th comparison, now they have to
meet at Bonferroni p<.001. When they are old and gray, or doing genomics, at
their 1000th comparison they have to meet at p<0.0001.

~~~
carbocation
Genomics generally uses 5e-8 (p<0.00000005) as its significance threshold.
That aside, your point is clever. How to handle coauthors of different ages is
left as an exercise to the reader.

~~~
SubiculumCode
I figured it as the p-value rate goes with the PI, the funded investigator.

------
nonbel
Imagine if p <= 1.0 was considered "significant", wouldn't this cut down on
publication bias? Shouldn't every study worth funding be worth publishing?
They should all be considered significant results. So I think this proposal
is:

1) Not really addressing the issues behind the replication crisis

2) Doing the exact opposite of what should be done ( _raise_ the significance
level)

~~~
tpeo
What's the point of fitting arbitrary functions to noise? Because that's what
high p-values would lead to.

Undue emphasis on p-values is bad, but what you're suggesting is bonkers.

~~~
nonbel
Stop funding people to fit arbitrary functions to noise?

~~~
tpeo
Don't be disingenuous. That would require someone to actually read their
papers, which would already be a net loss and a massive waste of time for
everyone involved. Not to mention the amount of department politics which
could be involved in this issue.

Look, not every paper is worth publishing. There are many ways to write noise
that can pass for signal given the reader's level of understanding (e.g. crack
pottery and "fashionable nonsense"). The usage of p-values implements self-
filtering to some degree, and makes everyone better off by increasing the
likelihood that at least some papers aren't complete nonsense. And that's
good.

The only issue is that it is a much weaker constraint on publishing than it
was previously thought.

~~~
JoshTriplett
> Look, not every paper is worth publishing.

Many more _results_ should be documented than currently are, though, including
negative results.

But I agree that changing p-value thresholds is unrelated to _that_.

~~~
tpeo
I agree, publication bias is a thing. There's no questioning that.

------
SurrealSoul
>You won't believe what this team did to the word significant!

>10 significant things that WILL shock you, #7 is this team

>THIS crazy trick that will make significant much harder [teams hate him!!]

>This finding prank gone significant [gone sexual!?]

Sorry, it was such a click-baity title for no reason.

------
BeetleB
This is just silly.

 _Any_ golden value for p is bad. If I go back to my statistics text book (and
I assume all of them say something to this effect): The appropriate value of p
depends on the particular topic of study.

Not on the whole discipline.

And definitely not across all studies and across all disciplines.

You cannot pick a value of p and say "this is good". Pick too low and you'll
likely not be able to reproduce many legitimate results.

------
currymj
if i were the King of All Science and were feeling despotic, I would do the
opposite, I think -- put a moratorium for a few years on any publication of a
p-value.

Scientists would still be free to make use of hypothesis testing to guide
their research. In publications, one could use Bayesian methods and report a
full posterior. One could also just give enough information about the null
hypothesis and whatever distributions are involved to allow the computation of
a p-value.

However, any attempt to explicitly publish the p-value itself would result in
creative and embarrassing punishments.

~~~
BeetleB
>In publications, one could use Bayesian methods and report a full posterior.

I fail to see how this is superior.

I'd never advocate banning a legitimate tool just because of abuse.

~~~
currymj
I mean, it was a "modest proposal" type of suggestion.

There are a lot of researchers who have a limited knowledge of statistics and
who basically treat the p-value as a magic number.

I don't have any sort of philosophical problem with p-values or the tools of
hypothesis testing, but I think it would be good for science if the magic
number were less salient, if that makes sense.

------
SubiculumCode
What is funny asbout p<.05 is that while we frequently have a specific
hypothesis in mind (e.g. Mean1<Mean2), convention still dictates that we
should use a two-sided t-test instead of a 1-sided t-test. If I am asked to go
to .001, then I will use 1-sided t-tests when I have apriori one-sided
hypotheses.

------
sna1l
[https://www.youtube.com/watch?v=8qrfSh07rT0](https://www.youtube.com/watch?v=8qrfSh07rT0)
\-- Taleb on p-values, technical, but pretty interesting.

------
leeoniya
sounds expensive. 0.01 may be more reasonable.

~~~
mrob
Choosing the correct value is a social problem.

Eg. The Journal of Wacky Speculation publishes "The Moon is Made of Green
Cheese". Using a highly sophisticated cheese detector, the authors measure
results consistent with this hypothesis. They calculate it would happen only
4% of the time if the moon were not in fact made of green cheese, and
therefore the result is significant. The cheese detector works perfectly, the
calculations were accurate, but this does not mean the moon is made of green
cheese! We've been there, and we're very certain that it's made of rock.
Because the prior probability of the hypothesis was so low, the "significant"
result was most likely a fluke. The Journal of Wacky Speculation specializes
in this type of hypothesis, so they should use a much lower threshold for p.

Meanwhile, the Sensible and Boring Society publishes a study showing that
round wheels roll better on ordinary roads than square wheels. We already had
good reason to believe this, so p<0.05 shows it's most likely true. The
Sensible and Boring Society publishes tests of hypotheses with higher prior
probably, so they can use a higher p threshold.

