
Statisticians want to abandon science’s standard measure of ‘significance’ - respinal
https://www.sciencenews.org/article/statisticians-standard-measure-significance-p-values
======
TallGuyShort
The problem isn't p-values, the problem is a binary distinction between
p=0.049 and p=0.051. The problem would go away if everyone understood
p-values, or we replaced use of the term "statistically significant" with "3%
probability we're just seeing a pattern by accident". Renaming the term to
something that sounds just as binary isn't any different.

~~~
roenxi
We use hard cutoffs for a bunch of things, they aren't perfect but they are
fine.

The problem is that we are imbuing the words "statistical significance" with a
whole bunch of math. This would be fine, except for the inconvenient fact that
people also want to use the word "significance" as it is defined in English.

It is not only possible, but _likely_ that people will be producing results
that are insignificant but statistically significant. I mean seriously, if I
tell my boss that the results are statistically significant, how do I expect
him to understand that the results might reasonably be insignificant? How do I
expect anyone non-technical to take that sentence seriously? You lose all
credibility pretty quickly when people start saying that the result is
significant except it doesn't matter.

This is even more stupid than calling complex numbers "imaginary" and
"complex". Those names are just arbitrary. Statistical significance is
overloading words that are likely to come up with meanings other than what
they are expected to mean.

Anyway, this is the problem that things like this article are going to. People
are treating significance values _as though they are significant_. They aren't
significant, they are statistically significant. It means something different
to significance.

For example; if I run an experiment 20 times and get a p value consistent with
a 5% chance in one experiment, it is statistically significant but the result
is obviously not significant in the English-language meaning of the word.

~~~
abakker
This is an excellent point. In my work, the interpretation error I see the
most among people around be is to confuse significance with the magnitude of
the predicted effect. I.e. if you use OLS to create a model, it is possible to
have a statistically significant beta, with essentially no magnitude.

If you look at research on carcinogens, it is frequent to see something like
"it is statistically significant that eating chemical X increases your risk of
cancer" but when you read on, the magnitude of the effect is "increases risk
by .0023%, for a cancer that already has a .6% incidence in the population".

My point is that people who are not statisticians (or people who use
statistics), tend to be concerned with the magnitude of an effect, and
frequently misconstrue the statistical significance for that magnitude.

------
just_steve_h
This is a very old argument. I got my bachelor's degree in psychology at
Harvard in 1993, and was told repeatedly that p-tests are abused, overused,
and not terribly useful.

To my mind, the most hackable flaw is that the number of subjects in the study
is a term in the denominator of the p-value calculation. Any study with a
sufficiently large sample will find "significance" with p<.05, but it won't be
meaningful.

We were taught that "effect size" measures were crucial to understanding and
interpreting results. If you have p<.05 but a tiny effect size, you're likely
not seeing a meaningful difference.

~~~
DataWorker
Odd that in some ways psychology seems to be more informed, as a field, of
these nuances and pitfalls and yet they have failed as a field to adequately
police themselves. The problem isn’t on the statistics end of things, it’s the
labor market and gatekeeping end of things. Another aspect of the inability to
scale higher ed in the stupid way that many think is optimal.

------
nestorD
The paper "Frequentist and Bayesian inference: A conceptual primer" has some
great things to say about the p-value and why a bayesian approach might solve
some of the conceptual problems :
[https://www.researchgate.net/publication/326112369_Frequenti...](https://www.researchgate.net/publication/326112369_Frequentist_and_Bayesian_inference_A_conceptual_primer)

(Premise) If Tracy is an American then it is very unlikely that she is a US
congresswoman; (Premise) Tracy is a US congresswoman; (Conclusion) It is very
likely that Tracy is not an American

(Premise) If the H0 is true, then is it very unlikely that I will observe
result X (Premise) I observe result X (Conclusion) Therefore, it is very
likely that the H0 is not true

------
danepowell
Most comments here point to cherry picking and "p hacking" as being the
primary problems with p values. Certainly those are major issues, but I think
they miss the real point of the article, which is that null hypothesis testing
is fundamentally broken, or at the very least doesn't do what most people
think.

A simple example of this can be shown with the following pair of tests:

Testing for a fair coin:

    
    
      - Null hypothesis is you have a fair coin
      - You observe 100 heads in a row
      - Given a fair coin, it's extremely unlikely to observe 100 heads
      - Therefore it's not a fair coin
    

Okay, that makes sense, but this is logically the same as:

Testing whether a person "Bill" is an American:

    
    
      - Null hypothesis is Bill is an American
      - You observe *Bill is a US congressman*
      - Given Bill is an American, it's extremely unlikely to be a congressman
      - Therefore he's not an American
    

Obviously that's some broken logic, but it's a perfectly valid way to get p <
.05

~~~
mokus
The problem in the second case is not null-hypothesis testing, it’s sample
size.

~~~
mar77i
What about the fact that the probability of a non-American getting elected to
congress being rather low?

------
kerkeslager
If you're looking for a replacement you don't understand the problem.

The problem isn't that P=.05 is an arbitrary measure of significance. The
problem is that _only publishing significant results is a bias against the
null hypothesis_.

Let's say you're doing a study of flipping coins. The null hypothesis is that
the coin is evenly weighted. If the null hypothesis is true, when you flip a
coin once, it will come up heads with P=.5. If you flip the coin twice, the
null hypothesis is that _both_ flips come up heads with P=.25. The probability
of all coins coming up heads is P=0.125 for 3 flips, P=0.0625 for 4 flips, and
P=.03125 for 5 flips. So if we flip a coin 5 times, and get heads all 5 times,
we can conclude that the coin is weighted in some way with P=0.03125.

Let's say all the major journals of coin flipping only publish results with
the high significance of P<0.05. Alice flips a quarter 5 times and gets 3
heads and 2 tails, and nobody will publish her study because it has P=.3125
(see [1] for an intuitive explanation of how this was calculated). Bob flips a
quarter 5 times and gets 2 heads and 3 tails--again, no journal will publish
him. Catherine, David, Ellen, Frank, and Geri all perform the same experiment,
most of them getting groupings 3:2 result ratios, some getting 4:1 result
ratios, but nobody getting all heads or all tails, just as one would expect
given the null hypothesis. And the journal editors tirelessly send out
rejection letters to all their studies.

Now somewhere down the line, Robert flips a coin 5 times and gets 5 heads.
This result has P=.03125, which meets the requirement of P<.05! He sends it to
the American Journal of Coin Flipping Studies (AJCFS), and they are very
excited to publish his results! _Nature_ and _Science_ magazines do front page
pieces with headlines "Quarters Found Heavy-headed" and "Washington Shows His
Face" respectively. A casino hires Robert as a consultant for the design of
their coin-flipping games. His quarter-flipping study is cited in the
abstracts of two dime-flipping studies and a half-dollar flipping study.
During the trial of a murderer who placed quarters tails-side up on his
victims, Robert is called as an expert witness to say that the coins were
placed there, not flipped there.

A week later Sally flips a quarter 5 times and gets 2 heads and 3 tails. She
sends the results of her study to the AJCFS noting a failure to reproduce
Robert's result, but her study is rejected because it has P=.3125.

Now, if you survey the AJCFS and all the other academic journals on coin
flipping, you'd conclude that quarters are significantly weighted towards
heads. But in fact, the null hypothesis is true: quarters are pretty evenly
weighted. Robert's low-P result is exactly what you'd expect to happen
eventually if you have enough people perform the 5-quarter-flip experiment--in
fact, if a lot of people are studying coin flips, the P of getting a low-P
result approaches P=1. But because the AJCFS has a P=.05 requirement, they've
created a bias against the null hypothesis, which deceives the public into
thinking that flipped quarters are more likely to come up heads than tails.

This is likely the reason why so many fields, most notably psychology[2], are
having a replication crisis[3] and a similar effect can be used in
P-hacking[4] to bolster results that are essentially fake.

Unlike coin-flipping, fields with replication crises like psychology and
medicine have real affects on real people's lives. It's irresponsible and
unethical for journals to publish with a bias against the null hypothesis, and
adjusting the P-value requirements to another significance requirement, even a
less arbitrary one, doesn't fix the issue.

The solution, I think, is for journals to commit to publish studies _before_
the study has been performed, based on the methodology, previous studies on
the subject, and qualifications of the researcher. This would mean that many,
many studies would be published with null results, and _that would be a good
thing_.

[1] There are 32 possible outcomes for flipping a coin 5 times. If we group
them by how many heads and tails, we can calculate a probability for each
outcome:

    
    
        HHHHH 1 result of 5 heads       ->      P  = 1/32 = .03125
    
        HHHHT
        HHHTH
        HHTHH 5 results of 4 heads, 1 tails  -> P  = 5/32 = .15625
        HTHHH
        THHHH
    
        HHHTT
        HHTHT
        HHTTH
        HTHHT
        HTHTH
        HTTHH 10 results of 3 heads, 2 tails -> P = 10/32 = .3125
        THHHT
        THHTH
        THTHH
        TTHHH
    
        HHTTT
        HTHTT
        HTTHT
        HTTTH
        THHTT 10 results of 2 heads, 3 tails -> P = 10/32 = .3125
        THTHT
        THTTH
        TTHHT
        TTHTH
        TTTHH
    
        HTTTT
        THTTT
        TTHTT 5 results of 1 heads, 4 tails  -> P  = 5/32 = .15625
        TTTHT
        TTTTH
    
        TTTTT 1 result of 5 tails       ->      P  = 1/32 = .03125
    

[2] [https://thepsychologist.bps.org.uk/what-crisis-
reproducibili...](https://thepsychologist.bps.org.uk/what-crisis-
reproducibility-crisis)

[3]
[https://en.wikipedia.org/wiki/Replication_crisis](https://en.wikipedia.org/wiki/Replication_crisis)

[4]
[https://journals.plos.org/plosbiology/article?id=10.1371/jou...](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106)

EDIT: Also see roenxi's excellent post on how "significant" means different
things in statistics and colloquial English:
[https://news.ycombinator.com/item?id=20895893](https://news.ycombinator.com/item?id=20895893)

~~~
omarhaneef
This is actually a much better explanation of the problem than the main
article.

The main article kept saying that p values were not meant to be definitive,
but without explaining what was wrong with them. At least, not very clearly.

This comment is a much clearer explanation -- imho -- as to what goes wrong in
the industry. I also think it is right to focus the attention away from a
particular statistical measure.

~~~
kerkeslager
It seems clear to me that the author just didn't understand the problem with P
values. It's part of a larger problem of science journalism being done by
journalists without scientific backgrounds. It's not their fault--even if you
have natural ability and interest in both communication and discovery, it's
hard to get an education in both.

I have the opposite problem: my abilities lie more in the statistics/science
than the communication--I suspect the only reason that this explanation is
being received so well is that I happened across an effective example by pure
luck. The probability of me communicating effectively is probably P < .5. ;)

~~~
kqr
Let me check if I get it right: can this be said to be a case of incorrectly
aggregating experiments? We're in a sense taking min(p) over all p-values, or
any(significant) over all results, when we should use an aggregation method
that takes into account the total number of studies aggregated?

~~~
kerkeslager
Yes.

For example, if you dig, you'll find that a lot of the evidence for a
correlation between telomere shortening (associated with aging) and processed
meat consumption comes from a 2008 study in the American Journal of Clinical
Nutrition[1]. They found a P=.006 correlation between processed meat
consumption and telomere shortening. But when you look into the study further,
they collected data on _47 different food groups_.

Let's calculate how unlikely a P=.006 event is if you try 47 times to find it:

A. If you have a probability of something happening once P_1, then the
probability of that happening x times in a row, P_x, is P_x = P_1^x. For
example, the probability of flipping a coin once and getting heads is P_1=.5,
so the probability of that happening twice in a row is P_2=.5^2=.25, which you
can verify by enumerating the possibilities (1 in 4 of the following
combinations is 2 heads in a row: HH, HT, TH, TT). The probability of flipping
3 heads in a row is P_3=.5^3=.125 (only 1 of these 8 possibilities is all
heads: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT).[2]

B. The chances of a P=.006 event NOT happening P' are P'=1-P=.994 (the rule of
1).

C. Combining facts A and B, the chances of a P_1=.006 event NOT happening
(P'_1) 47 times in a row P'_47' is P'_47=.994^47=.754 (this is rounded to the
3 significant digits).

D. Applying the rule of 1 again, the probability of finding a P=.006 result if
you try 47 times is P'_47' = 1-P'_47 = .246.

So basically, the actual confidence value on that study is P=.246: there's
about a 1 in 4 chance they would find a P=.006 result for one of the 47 food
groups tested if the null hypothesis is true. The null hypothesis in this case
being "diet doesn't affect telomere length".

The paper doesn't seem to list their hypothesis, but they say things like "all
others were P>.05", which indicates that they were hypothesizing a P<.05
result. Doing the same math again for the probability of a P=.05 result
occurring if you try 47 times, P'_47' = 1 - (1 - .05)^47 = 0.91025516807.
That's over a 90% chance! In other words, by doing the test on 47 food groups,
they nearly _guaranteed_ that they would find a result within their definition
of significance. This is _very_ bad science, and a study with this bad of a
statistical design never should have been funded in the first place.

This study has not been replicated as far as I know, but is often cited in pop
science[3][4][5].

[1]
[https://academic.oup.com/ajcn/article/88/5/1405/4649028](https://academic.oup.com/ajcn/article/88/5/1405/4649028)

[2] Note that this rule only applies to _independent probabilities_ ; the
result of coin flip does not affect the result of the next coin flip, and the
result of telomere correlation to one food group does not affect the result of
telomere correlation to another food group. This rule won't work when applied
to _dependent probabilities_ , where the earlier tries affects later tries.
For example if you draw an ace from a deck of cards (P=4/52) and don't place
the ace back in the deck, the probability of drawing another ace is lower
(P=3/51).

[3] [https://www.livestrong.com/article/506649-foods-that-
boost-t...](https://www.livestrong.com/article/506649-foods-that-boost-
telomeres-telomerase/)

[4] [https://resources.teloyears.com/diet-and-
telomeres/processed...](https://resources.teloyears.com/diet-and-
telomeres/processed-meat-not-unprocessed-red-meat-inversely-associated-
leukocyte-telomere-length-strong-heart-family-study)

[5] [https://siimland.com/how-to-increase-telomere-
length/](https://siimland.com/how-to-increase-telomere-length/)

------
patientplatypus
I have a degree in statistics and I've never understood p-values. Even if it's
unlikely that you'll get the result you expect with 5% likelihood, there are
enough people doing enough tests that you're going to have 5% wrong answers.
And that philosophical problem doesn't go away by choosing a different
percentage.

Likewise, we're supposed to assume that there is something magical about our
prior assumptions? Why? Where is it written that our null and alternative
hypothesis should conform to what is true, when our best ability to predict
the future (as people) is based on our past experience? The world doesn't
always behave predictably, and yet that seems to be baked into the assumption
of every regression test ever run. Completely bonkers if you ask me.

~~~
Filligree
> Likewise, we're supposed to assume that there is something magical about our
> prior assumptions? Why?

Because that's how the math works. It isn't possible to compute a posterior
without first having a prior, so you have to decide on a prior _somehow_.

You can dress it up and try to hide it, but you can't avoid Bayes' theorem
forever.

~~~
wnoise
Well, you can report odds ratios (or their log).

~~~
Filligree
That's good advice in general, but doesn't really answer the question of "How
likely is it?".

~~~
wnoise
Right, because you can't answer that without prior information. Log-odds is a
step towards giving a function of prior information (but it is limited to the
actual class of hypotheses investigated).

------
arafa
I am pretty surprised to see no discussion of statistical power in the article
and very little mention in the comments here. To me, having more statistical
power solves many of the issues mentioned in the article. Many of the rest can
be handled with use of Bayesian priors, context-specific p-values thresholds
(.01, .1, etc.), and replication.

There's decent working guidelines for statistical power, but a lot of the
issues (especially around replication) I see are mainly due to sample sizes
being too small to adequately detect effects, especially if those effects are
small.

~~~
edmundsauto
Can you recommend any texts, videos, or websites to learn more about
statistical power? The term is quite overloaded in google.

~~~
xenocyon
A quick 101:
[https://www.statisticsdonewrong.com/power.html](https://www.statisticsdonewrong.com/power.html)

~~~
arafa
See also this, for more specifics on the math involved:
[https://effectsizefaq.com/2010/05/31/what-is-statistical-
pow...](https://effectsizefaq.com/2010/05/31/what-is-statistical-power/)

The basic idea is: Get more data, especially if you're measuring something
subtle.

------
jaclaz
Finally! If only we could really get rid of this largely meaningless 0.05
parameter.

Though I would have personally used _statistically_ instead of strictly in the
sentence: >Strictly speaking, he says, “there’s no difference between a P
value of 0.049 and a P value of 0.051.”

------
jostmey
P-values are just a symptom of a much larger problem--the incentives of being
an honest researcher don't exist, and the incentives for appearing to generate
results are massive. Picking on p-values is like attacking Toyota for
contributing to pollution. The problem goes beyond one company and one
country.

------
buboard
At the very least, Romain Brette suggests to change the wording to
"statistically detectable".

I 'm not sure if abandoning tests altogether is good though. What does it mean
"it's detectable but not clear" for communication? How do you e.g. communicate
global warming like that?

~~~
hannob
I usually don't see climate science being communicated with p-values and poor
significance cutoffs. The place where you usually see this used is for
communicating single study results (which often enough are things that you
probably shouldn't communicate at all).

To get an idea how climate science is trying to communicate look at the
summary for policymakers of the SR15 report:
[https://report.ipcc.ch/sr15/pdf/sr15_spm_final.pdf](https://report.ipcc.ch/sr15/pdf/sr15_spm_final.pdf)

They have different confidence levels they indicate instead of a single
cutoff.

~~~
buboard
That's not different than communicating significance with p < 0.05 , p < 0.01,
p < 0.001 etc as is usually done in paper. There is again a cutoff.

~~~
jerf
Cutoffs aren't the problem. They are inevitable; the logic is virtually
identical to the question "Why can you drink at 21 and vote at 18?" Of course
the Responsibility Fairy doesn't visit you that night and make you suddenly
able to handle it when you weren't the day before, it's just that at scale,
you don't have much choice but to operate that way because everything else is
just too expensive.

The problem with p-values is that the shape of the resulting cutoff just isn't
what we're looking for. We want something along the lines of "What is the
probability that this hypothesis is true, and how true is it?", and "What is
the probability that this result could have happened even if the hypothesis is
false?" is only an approximation at the best of times, and at worst, downright
misleading. That's true even before we consider some of the other issues that
my English kind of elides over, but the math contains; one of my problems with
significance testing as it is commonly done is that there are actually ranges
of hypotheses, and it _really_ overprivileges "the" "null" hypothesis; I could
write a decently-sized HN post just criticizing those two quoted words.

There is no rigid process that can produce the answer we really want, but I
think we can provide a selection of better default tools. An example of one
that has already been deployed to some extent is "power analysis", which is
not a direct answer but lets people crafting studies analyze how big their
studies will have to be before running them. We can build more tools like
that.

~~~
buboard
> "What is the probability that this hypothesis is true, and how true is it?",
> and "What is the probability that this result could have happened even if
> the hypothesis is false?

The first is impossible to calculate by definition. The second can be derived
from p value. Virtually all journals require rigorous reporting of p values
along with averages, and the justification of the statistical test used.

~~~
jerf
You've mangled my sentence in your quote. The rest matters. It may not be a
masterpiece of the English language, but I already said the first is not
directly possible and I wrote the second as a description of p-values quite on
purpose.

"Virtually all journals require rigorous reporting of p values along with
averages, and the justification of the statistical test used."

That's begging the question. The entire topic of conversation is whether or
not the standards of justification are adequate.

~~~
buboard
ah yes i agree with those. As you said, the selection of hypothesis is the
actual problem, and statistical power doesnt solve that one either.

There are some scientists thinking over this though. This is an idea from a
neuroscience lab:
[https://www.researchmaps.org/](https://www.researchmaps.org/)

The idea was to create causal directed graphs for biology from the literature
, which would be used identify what experiments are missing and thus inform
future science.

------
fela
I think the best term would have been statistically surprising, because it
strongly hint at the fact that the result would be surprising under the null
hypothesis, witch really is all that "statistically significant" really means.
Sometimes surprising results happen, but all other things being equal they
might hint at the null hypothesis being false. I could also live with
"statistically interesting". "Detectable", suggested in another comment, seems
to have some of the same issues as significant, it is too strong and seems to
imply that now we know something is really there.

~~~
vharuck
By the reasoning behind significance, "surprising" would be a great drop-in.
However, in most studies, it would be more surprising if the null hypothesis
were true. Statistically significant results are pretty much a given.

------
ineedasername
There's no way to guard against all false positives. And while changing the P
value cutoff would reduce them, it would also increase false negatives.

The answer doesn't lie in hard-line stances for or against P values or with an
alternative that will have its own set of problems. It lies with greater
education of those who run experiments & those who consumer the literature
about proper interpretation and other methods of analysis that should be used
along side it.

~~~
merpnderp
Other small things could help, like publishing failures, incentivizing
duplication of experiments, increasing access to results.

~~~
ineedasername
I think duplication and access to raw data would be two huge steps forward.
It's actually a bit mystifying that researchers are able to present their own
final conclusions & interpretations without making the actual data available.

------
bregma
The problem is not with using statistical significance as a mean to accept or
reject a hypothesis (ie. the scientific method). The problem is P-hacking
(effectively, choosing your results then customizing analysis to obtain them).

It's not the statistical analysis that's the problem, it's the bad "science"
and irresponsible journalism.

------
be_reasonable
Although well written, this article misses the main point why statistical
significance leads us astray and needs to be deemphasized. It is because
"reproducibility" needs to be the new gold standard that displaces
"significance testing". There are too many highly significant results that are
totally unreproducible.

------
pfdietz
There's a reason particle physicists want to see a five sigma effect before
they believe a detection is real.

~~~
bigfudge
Start funding psychology etc like we fund physics and the problems would be
much smaller. Most studies are underpowered. It’s not because scientists are
lazy, it’s because they’re busy.

------
riotman
"More than 800 statisticians and scientists are calling for an end to judging
studies by statistical significance in a March 20 comment published in
Nature."

While their sources are supporting this statement, I'm getting mixed signals.

[https://www.nature.com/magazine-
assets/d41586-019-00857-9/da...](https://www.nature.com/magazine-
assets/d41586-019-00857-9/data-and-list-of-co-signatories)

This is their primary source, of the statisticians calling to retire
statistical significance. However, their primary reasoning is because
statistics is misused to make erroneous conclusions. It seems like there is a
lack of understanding about the philosophy and mathematics behind statistics
that's the problem by its practitioners, not statistics itself.

------
Hitton
Major problem with "significance" is that in some areas of research (say
psychology), it's possible to gather lots of data(let 1000 people fill complex
questionnaire) and from that data to fish for theory that's significant (in
your sample Republicans might have been dumber than Democrats). But given the
size of the sample and number of theories you test, you are bound to find
something significant even if that isn't true.

Tightening significance threshold just makes this fishing more difficult,
Bayesian reasoning doesn't help much either, because you have to guestimate
reasonable priors. What does really help against fishing like this is
requirement to preregister your studies.

~~~
bonoboTP
Many people talk about preregistration, but I'm not sure it would result in
the hoped benefits.

What I predict would happen: \- either lots of studies are allowed to
preregister, most of which cannot reject the null hypothesis. You end up with
a lot of "boring" null-result papers in those high-profile journals that
nobody gets excited about and nobody gets promoted for and no media coverage
happens, bad marketing for universities and research centers.

\- or there would be a strict filter for the pre-registration, so that
researchers cannot chase their gut intuitions, some authority would need to
approve the study even before it's done. This hinders research and hinders the
dissemination of truly unexpected discoveries.

The human incentives are way deeper than any one solution could touch on. The
whole science funding structure requires flashy and sexy results that are just
not possible to produce on that scale.

~~~
MaxBarraclough
> or there would be a strict filter for the pre-registration, so that
> researchers cannot chase their gut intuitions, some authority would need to
> approve the study even before it's done

I see two things going on here.

Having additional 'oversight' for the scientific validity of a planned
experiment, is presumably a good thing.

I don't see that it would be necessary to set out to prevent researchers
investigating their intuitions.

------
teekert
I'm not a statistician, just a biologist. But I have seen some pretty low
P-values indicating two distribution have a differing mean while the plot
right on top of each other. This is when there are a lot of measurements (in
my case there were over 100.000). I prefer to make ROC curves or simple just
to look at the plotted distributions. ROC curves give a nice idea about the
mixed-ness of positive and negative measurements.

------
adj83
This gets at the heart of fake science. Everyone today thinks surveys or
studies are science.

The purpose of a paper is supposed to be to contribute something new to a
field, not just perform a statistical study.

But you or I cant say that now as so many organisations use "studys" as the
basis for their field while maintaining that they are scientific.

For me, the fact this is even being discussed really is good news.

------
slowhand09
Here is a related article on NOVA.
[https://www.pbs.org/wgbh/nova/article/rethinking-sciences-
ma...](https://www.pbs.org/wgbh/nova/article/rethinking-sciences-magic-
number/)

------
stefco_
p=0.05 is an absurdly low bar even with an extremely well-defined hypothesis.
Once you open things to p-value hacking (i.e. trying a bunch of hypotheses
until one seems significant), p=0.05 is almost guaranteed for one of your
hypotheses and is hence meaningless.

The physics community is pretty good about dealing with both of these
shortcomings. You often need a lower p-value of around p=0.003 (3 sigma) or
lower to say something was significant, and a few more sigmas to claim an
actual discovery. And in addition to this, you're expected to include a trials
factor to correct for multiple hypotheses. It's not perfect, but it makes
claims of statistical significance more meaningful.

------
SquishyPanda23
I tried convincing people of this ten years ago. Others have been trying for
something like half a century.

I am very happy to see it in Nature.

------
DoofusOfDeath
Can someone help me understand the concern about p-value hacking?

One of the comments below references this XKCD comic [0], which IIUC is an
example of p-hacking.

But in that comic, the only difference I notice between the original
hypothesis (jelly beans cause acne) and the p-hacked hypothesis ( _green_
jelly beans cause acne) is whether or not the hypothesis occurred to the
researcher at the _beginning_ of the study. And I don't understand why that
would bear on the importance of each hypothesis.

[0] [https://xkcd.com/882/](https://xkcd.com/882/)

~~~
DangitBobby
We already expect data to sometimes indicate significant evidence against the
null hypothesis when the null hypothesis is actually correct. That's just the
nature of the way we structure our statistics practices. But for each
additional set of variables you have and could test for relationships, you
increase your chances of finding a strong relationship that appears to be
significant due to random chance, but was in fact not. (You may notice that
this is just a different flavor of the issues that arise from the way
researchers use significance tests).

The reason you need to pick your hypothesis ahead of time is because otherwise
you may use intuition or just your eyeballs to find relationships in the data
that are only there due to random chance. This will supposedly increase the
probability of detecting spurious relationships and erroneously rejecting the
null.

~~~
DoofusOfDeath
Thanks, although I must admit I'm having some trouble following your logic.

IIUC, you're pointing out that when examining the empirical data gathered
during an experiment, it's often possible to find some identifiable subset of
the data that are consistent with a _refined version of_ the original
hypothesis. E.g., maybe jellybeans in general don't correlate with acne, but
_green_ ones do.

Assuming that the experiment is part of a larger effort to build or refine
some model, I don't see the problem.

Suppose that someone refrained from p-hacking during the first run of that
experiment. IIUC, they'd look at the experimental results, and wonder if they
had missed some "X" factor. So they might conject that jellybean color was
relevant, and rerun the experiment with the hypothesis, "consumption of
jellybeans, but only of a particular color, correlates with acne." And
(assuming sample sizes were big enough), the data gathered during that second
experiment would likely confirm that green jellybeans correlate with acne.

But what's the point of having run that second experiment, when they could
have just reached the same conclusion by testing additional hypotheses from
the first experiment's data?

It seems like regardless of whether you just ran one experiment and did
"p-hacking", or instead ran a follow-on experiment, you end up with the same
refinements to the model you're working on.

------
blululu
IMO the expression ‘statistical significance’ is a big part of the problem.
Popular reporting of research translates a tiny but discriminable effect size
into a SIGNIFICANT effect. Changing the nomenclature to ‘statistically
discriminable’ would go a long way to improving popular understanding.

~~~
inetknght
> _Changing the nomenclature to ‘statistically discriminable’ would go a long
> way to improving popular understanding._

I think you overestimate the intelligence of people prone to misunderstanding.
I don't think they're likely to understand big words like "statistically" or
"discriminable" when they already don't understand big words like
"statistical" or "significance".

~~~
dagw
_when they already don 't understand big words like "statistical" or
"significance"._

They do understand what "significant" means, they use the world every day. The
problem is that it means something different than the intuitive every day
meaning when used in context of p-values. Using a word like "discriminable"
might help clear things up since it's a word that doesn't have have so much
meaning packed into it already.

It's like when a mathematician says that something is "almost always" true,
they mean something very different than when a non-mathematician says
something is almost always true.

~~~
inetknght
> _The problem is that it means something different than the intuitive every
> day meaning_

Yes that's my point. Using words that are hard to understand or only
understandable in the context of P-values won't solve the problem; the problem
being that the general public won't understand the subtle differences in
meaning

~~~
dagw
I'm not sure. I think much of the confusion comes from people instinctively
applying their every day understanding of the word "significance" and falsely
believing they understand what it means in the statistical context. Had it
been called statistical confliburance or some other made up term then they
wouldn't think they knew what it meant and might find out what it actually
meant.

------
pcvarmint
I always thought p-values were arbitrary.

Frequentism is antiquated.

------
mrkeen
tl;dr:

> Is there a better way to judge if a study is solid?

> Unfortunately, there is no single alternative that everyone agrees would be
> better for all experiments.

~~~
ekianjo
How about replication?

~~~
hartator
Replication until statistically significance is found?

~~~
ekianjo
In a way yes, because replication lowers the risk that the findings in a
single experiment are due to random luck. Anyway there's no way around
avoiding replication when you want to prove something.

