
Scientists rise up against statistical significance - bookofjoe
https://www.nature.com/articles/d41586-019-00857-9
======
krisrm
Maybe I'm just being jaded, and I'm certainly not a researcher or
statistician, but I don't see how removing "statistical significance" from
scientific nomenclature is going to prevent lazy readers (or science
reporters) from trying to distill a "yes/no" or "proven/unproven" answer from
P values listed in a complex research paper.

~~~
bunderbunder
Well, that's why the article doesn't propose simply ditching P-values, it
proposes reporting confidence intervals instead. Not only to they provide more
information (by simultaneously conveying both statistical and practical
significance) they're also easier to interpret correctly without special
training.

~~~
curiousgal
> they're also easier to interpret correctly without special training.

Heh, you'd be surprised! Most people I met would interpret a 95% CI by saying
that there is a 95% chance that it contains the true mean.

~~~
taktoa
Is this not the definition of "confidence interval"? The first few Google
results all define it this way...

~~~
lutorm
Unless I've forgotten more than I hope, I believe the formal definition of a
95% confidence interval is that "if the model is true, 95% of the experiments
would result in a point estimate within the interval." This is distinctly
different from "a 95% probability that the true value is contained within the
confidence interval", but that is typically what is loosely inferred.

~~~
BeetleB
Nope. It is (from a frequentist's model of statistics), exactly what the
article is claiming it isn't: If the model is true, and we repeat the
experiment several times, 95% of the intervals we calculate will contain the
true value. The actual CI you get in each experiment will differ.

Another discrepancy between frequentists statistics and the article is that
yes, the values at the boundary of your interval are as credible as in the
center.

~~~
kgwgk
> the values at the boundary of your interval are as credible as in the
> center.

What would that mean in the frequentist framework?

~~~
BeetleB
To be explicit, and using an example similar to the one in the article, if
your CI is (2, 40), with the center being 21, there is no reason to believe
that the true value is closer to 21 than to, say, 3.

To provide an extreme case, during the Iraq war, epidemiologists did a survey
and came up with an estimated number of deaths. The point value was 100K, and
that's what all the newspapers ran with. But the actual journal paper had a CI
of (8K, 194K). There's no reason to believe the true value is closer to 100K
than it is to 10K. Or to 190K.

~~~
kgwgk
You're right, from the frequentist definition of a confidence interval (2,40)
we can't say the the true value is more likely to be closer to 21 than to 3.

But we can't neither say that the true value is equally likely to be closer to
21 than to 3.

The point is that, from the frequentist definition of a confidence interval,
there is nothing at all that we can say about how likely the true value is to
be here or there.

It could be 3, 21, or 666 and there is nothing that can be said about the
likelihood of each value (unless we go beyond the frequentist framework and
introduce prior probabilities).

~~~
BeetleB
>The point is that, from the frequentist definition of a confidence interval,
there is nothing at all that we can say about how likely the true value is to
be here or there.

Yes - sorry if I wasn't clear. I did not mean to imply that each value in the
interval is equally likely (and looking over my comments, I do not think I did
imply that).

The complaint is that the article is stating otherwise as _fact_.

>One practical way to do so is to rename confidence intervals as
‘compatibility intervals’

>The point estimate is the most compatible, and values near it are more
compatible than those near the limits.

They simply are not in a frequentist model (which is the model most social
scientists uses). I agree with the main thrust of the article in that there
are many problems with P values. But I am surprised that a journal like Nature
is allowing clearly problematic statements like these.

I don't know enough about the Bayesian world to be able to state if his
statement is wrong there as well, but if it is correct there, it is
problematic that the authors did not state clearly that they are referring to
the Bayesian model and not the frequentist one.

(Not to get into a B vs F war here, but I remember a nice joke amongst
statisticians. There are 2 types of statisticians: Those who practice Bayesian
statistics, and those who practice both).

~~~
kgwgk
> I did not mean to imply that each value in the interval is equally likely
> (and looking over my comments, I do not think I did imply that).

When you said that "the values at the boundary of your interval are as
credible as in the center" you kind of implied that, which is why I asked.

I won't defend the article being discussed, but you opposed their statement
that "the values in the center are more compatible than the values at the
boundary" with an equally ill-defined "the values at the boundary are as
credible as in the center".

~~~
BeetleB
I do not read my statement to imply uniform distribution.

What I meant was "there is no reason to prefer values at the center more than
values at the boundary" based on the CI (there may be external reasons,
though). To me, this is equivalent to your:

>there is nothing that can be said about the likelihood of each value

~~~
kgwgk
Ok, we agree. "As credible as" implies uniform "credibility". "More compatible
than" implies non-uniform "compatibility". Without any clear definition of
"credibility" or "compatibility" it's impossible to interpret precisely what
are those claims supossed to mean.

------
whatshisface
Although my failure to see an elephant on the table does not rule out
completely that there could be an elephant there, it does limit the possible
size of the elephant to a few micrometers. Failure to reject the null
hypothesis does in fact provide evidence against the other possibilities, so
long as "other possibilities" are understood to mean "other possibilities with
big effects."

I don't see why a scientist at a conference who's saying that two groups are
the same has to be heard as claiming, "we have measured every electron in
their bodies and found that they have the same mass, forget about six sigma,
we did it to infinity." Instead they could simply be understood to be saying
that the two groups must be similar enough to not have ruled out the null
hypothesis in their study.

~~~
rfeather
That's the thing. P values don't prove that anything must be. They simply say
that if rerunning the experiment again, it would be surprising to get a
different result. Conversely, if you don't find "statistical significance" it
definitely doesn't mean there isn't a difference. In practice, it might
(often) mean the study didn't have enough samples to find a relatively small
effect, but the layperson making decisions (do I allow right turn on red or is
that dangerous?) may not get that nuance. A book that really helped clarify my
thinking on this is _Statistics Done Wrong_ by Alex Reinhart.

Edit: remove "interpret" from last sentence to clarify

~~~
kgwgk
> They simply say that if rerunning the experiment again, it would be
> surprising to get a different result.

Not really. A low p-value says that it was surprising to get the result that
you got, assuming that the null hypothesis is true. And if the null hypothesis
is true it would be surprising to get again the same result (i.e. a result as
extreme). If the null hypothesis is not true, the result would not be so
surprising (or maybe more, if the true effect is in the “wrong” direction).

The result we got gives some evidence for the null hypothesis being false, but
if the null hypothesis was very very likely to be true before it may still be
very likely to be true afterwards. In that case it wouldn’t be surprising to
get a different result if the experiment is performed again.

Illustration: I roll a die three times. I get three ones. P<0.01 (for the null
hypothesis of a fair die and the two-tailed test on the average). This is not
simply saying that if I roll the die three times again it would be surprising
to get something other than ones.

~~~
nkurz
_I roll a die three times. I get three ones. P <0.01 (for the null hypothesis
of a fair die and the two-tailed test on the average)._

Hmm. At a glance, that doesn't seem right. Yes, the chances of rolling 3 1's
is 1/(6^3), but if we only rolled once and got a single 1, we wouldn't have
any reason to suspect that the die was unfair. So maybe we should only
consider the second two repetitions, and conclude with p ~ .03 that the die is
unfair? Otherwise, consider the case that we rolled a 1, 5, 2 --- certainly we
shouldn't use this series of non-repeated outcomes as p < .01 evidence of an
unfair die?

~~~
kgwgk
If the die is fair, the average score will be 3.5. One can define a test based
on that value and reject the null hypothesis when the average score is too low
or too high.

The sampling distribution for the average can be calculated and for three
rolls the extreme values are 1 (three ones) and 6 (three sixes) which happen
with probability 1/216 each. Getting three ones or three sixes is then a
p=0.0093 result.

You raise a valid point. This is clearly not the best test for detecting
unfair dice, because for a die which has only two equally probable values 3
and 4 we would reject the null hypothesis even less often than for a fair die!
(In that case, the power would be below alpha, which is obviously pretty bad.)

------
i_phish_cats
I predict nothing will change. Flaws in p-values and confidence intervals have
been apparent since almost their inception. Jaynes spoke out against it
strongly from the 60's on (see, for example, his 1976 paper "Confidence
Intervals vs Bayesian Intervals"). Although I can't find it right now, there
was a similar statement about p-values from a medical research association in
the late 90's. It's not just a problem of misunderstanding the exact meaning
of p-values either. There are deep rooted problems like optional stopping
which render it further useless.

The problem is that with all its problems, statistical significance provides
one major advantage over more meaningful methods: it provides pre-canned tests
and a number (.05, .01, etc) that you need to 'beat'. The pre-canned-
ness/standardization provides benchmarks for publication.

I once worked in a computational genomics lab. We got a paper into PNAS by
running fisher-exact test on huge (N=100000+) dataset, ranked the p-values,
got the lowest p-values, and reported those as findings. There's so much wrong
with that procedure its not even funny.

~~~
abecedarius
Hippocratic medicine lasted well into the 19th century, centuries after the
scientific revolution. There'd been critics correctly calling it an
intellectual fraud before then. You could've taken this as proof that no force
on Earth could drag medicine into modernity, but it did sort of happen, as it
became public, common knowledge that doctors were harming more people than
they helped. They did start cleaning up their act (literally) though it took a
long time and I think they're still collectively irrational about chronic
conditions.

I hope we aren't worse at reform than they were in the 1800s.

------
randcraw
As I recall, instead of "compatibility intervals" (or confidence intervals),
other gainsayers of P tests have proposed simply making the existing P
criterion more selective, like a threshold value of .01 rather than .05, which
equates to increasing the sample size from a minimum of about 10 per cohort to
20 or more.

I suspect this will be the eventual revision that's adopted in most domains,
since some sort of binary test will still be demanded by researchers. Nobody
wants to get mired in a long debate about possible confounding variables and
statistical power in every paper they publish. As scientists they want to
focus on the design of the experiment and results, not the methodological
subtleties of experimental assessment.

~~~
ordu
Raising threshold will not just reduce probability of false positive result,
but also will raise probability of false negative. Social sciences are dealing
with a complex phenomena and it maybe that there are no simple hypothesis like
A -> B, that describes reality with p<0.05. While in reality A causes B, just
there are C, D, ..., Z, and some of them also causes B, others works other way
and cancel some of others. And some of them works only when Moon is in the
right phase.

p<0.01 is good when we have a good model of reality which generally works.
When we have no good model, there are no good value for p. The trouble is all
the hypotheses are lies. The are false. We need more data to find good
hypotheses. And we think like "there are useful data, and there are useless,
we need to collect useful, while rejecting useless". But we do not know what
data is useful, while we have no theory.

There is an example from physics I like. Static electricity. Researchers
described in their works what causes static electricity. There was a lot of
empirical data. But all that data was useless, because the most important part
of it didn't get recorded. The most important part was a temporality of
phenomena. Static electricity worked some time after charging and then
discharged. Why? Because of all the materials are not a perfect insulators,
there was a process of electrical discharge, there was voltage and current. It
was a link to all other known electical phenomena. But physicists missed it
because they had no theory, they didn't knew what is important and what is
not. They chased what was shiny, like sparks from static electricity, not the
lack of sparks after some time.

We are modern people. We are clever. We are using statistics to decide what is
important and what is not. Maybe it is a key, but we need to remember that it
is not a perfect key.

~~~
nradov
In social sciences if the effects aren't clear and readily apparent without
quibbling over whether p<0.05 or p<0.01 is the right standard then perhaps the
whole thing is a waste of time. If our experimental techniques are
insufficient for dealing with multiple factors and complex webs of causality
then why bother?

~~~
light_hue_1
How clear or important the effect is has nothing at all to do with a p-value.
I can have p < 10^(-10) and still have an effect that is so weak that it's
meaningless. The confusion you're having so pervasive and a big part of the
problem. You have to use effect sizes in order to measure this.

~~~
nradov
No I'm not confused by the distinction between p-values and effect sizes,
which is why I intentionally used the phrase "clear and readily apparent".

------
twoslide
I agree significance is mis-used, but in the opposite way than these authors.
They are concerned that authors claim "non-significant" means "no effect," I
see a lot of authors claiming "significant" means "causal effect." They don't
account for the consequences of running multiple tests, and of endogeneity.

Differences between means of any two groups (e.g. treatment and control) on
any outcomes will tend be non-zero. Interpreting this sample difference as a
population difference without considering confidence interval seems risky.

~~~
usgroup
It turns out science requires really careful thinking ... who would have
thought?

I think that funding and publishing pressure turns the already hard problem of
doing good science into a Pareto optimisation between do science, publish, get
funding. The result is partially coerced results, stronger than justified
conclusions, lenient interpretations , and funding of course.

It’ll go this way whatever the metric. I see the same sort of crap in ML
papers where the authors report far more and lie far more at the same time.

------
mshron
I gave a relevant talk a few years ago: How to Kill Your Grandmother with
Statistics[1].

The authors are spot on that the problem is not p-values per se but
dichotomous thinking. People want a magic truth box that doesn’t exist.
Unfortunately there are a ton of people in the world who continue to make
money off of pretending that it does.

[https://www.youtube.com/watch?v=iRpAHS5_hDk](https://www.youtube.com/watch?v=iRpAHS5_hDk)

~~~
Mortiffer
I like your statement "there is no such thing as a truth machine". But your
talk as this article seem to allude that the world would be better off if we
just dropped significance testing. AKA dropping the objectivity of science. My
guess would be that in that world there would be many more talks about how we
need to introduce some kind of objective test to prevent all these drugs from
getting to market that were subjectively accepted.

------
tomrod
What happens when your model errors aren't normally distributed?

If the kurtosis is high, p-values are over-stated. If fat-tailed then p-values
are understated.

Why? Because the likelihood of your p-value isn't guaranteed to be normally
distributed.

Normal is a nice assumption but asymptotic can take a long time to kick in.
The CLT is beautiful analytically, but fortunes are made from people who
assume it.

~~~
Balgair
> What happens when your model errors aren't normally distributed?

Honestly...? You're screwed. At least in Bio, where most researchers haven't
taken calculus, most folks will screw up the t-test or their ANOVA if you are
not _super_ careful. For non Gaussian data you better pray it's Poisson or has
some other exotic name that you can at least google.

Especially with low N, you just kinda pray it's normal and then you go and try
and get grant funding with those results.

Cynically, in the end, it barely matters. It's all about that grant money.
Whatever way you can tease that data to get more grants, you just do that. No-
one ever checks anyway (Statcheck excepted)

------
tzhenghao
Lack of statistical literary is a huge problem these days. As the modern
workforce trend towards more analytical methods, statistics can be used as a
weapon to fool and bend the truth.

I'm frankly tired seeing executives going on stage trying to show some numbers
and graphs to prove a point on some variables. You see this in board meetings
too. The sample sizes are too small to conclude anything significant about it!

------
chasedehan
> When was the last time you heard a seminar speaker claim there was ‘no
> difference’ between two groups because the difference was ‘statistically
> non-significant’?

I'm an Economics PhD (And former professor) and if someone were to say those
lines at an academic conference there is a high likelihood that they would be
_literally_ laughed at.

Maybe it is because of my background in a quantitative field where we place a
huge emphasis on statistical rigor, but t-tests were pretty much dismissed by
anyone serious 20+ years ago. Seems like the issue stems to those disciplines
without a stats/math background to just point to t-stats. My wife reads
medical literature for her work and I gag pretty close to every time she asks
me to look at the results.

~~~
Fomite
Yeah - I'm an epidemiologist, and the medical literature is painful.
"Statistical curmudgeon" is, as far as I can tell, my primary role as peer
reviewer.

------
rossdavidh
The fundamental problem here seems to be, that you cannot get around the need
for a statistician (or someone from another field who has a similarly deep
understanding of statistics) to look at the data. There is no shortcut for
this, but we simply do not, as a society, have enough people with statistical
knowledge sufficient to the task.

There is not, I suspect, any other solution but that we must train a whole lot
more statisticians. This means we will need to give more credit, and
authority, and probably pay, to people who choose to pursue this field of
study.

~~~
o10449366
I'm afraid that this problem will only be exacerbated by the influx of "data
scientists" that may know how to implement linear regression and a few machine
learning models, but lack the statistical expertise to design experiments, vet
assumptions, and verify results.

This comment is not meant to disparage anyone who considers themselves a data
scientist. However, as someone who has advanced degrees in both statistics and
computer science, employers and recruiters outside of R&D roles have shown
very little interest in my statistical background aside from my machine
learning experience. My experimental design, data handling (not just data
cleaning, but data collection) skills, and theoretical understanding are
rarely discussed. Statisticians are compensated much less than programmers--
maybe deservedly so--but to that extent that I'm compensated the same as a
"data scientist" who only studied computer science and didn't study any
statistics, I feel like many employers, even those who benefit heavily from
them, don't properly compensate statisticians.

At my alma mater, the poor wages have really hurt the statistics program as
more students have decided to enroll in the "data science" (typically housed
within business or computer science departments) programs. I think this is a
really unfortunate trend because while those programs teach you how to
implement gradient descent, do basic data wrangling in Python and R, make data
visualizations, etc. they don't teach experimental design or the statistical
theory that drives applied statistics. Perhaps as the supply of well-trained
statisticians decreases and demand increases there will be upward wage
pressure, but I think it's more likely that unqualified and inexperienced
"data scientists" will continue to be shoehorned into these empty roles
instead.

~~~
Eridrus
I'm curious, what are the best examples of where your statistics knowledge was
useful?

What do you think practicing data scientists should learn to be more
effective? Experimental design? Something else?

~~~
o10449366
I could talk about a lot of specific circumstances, but in the interest of
brevity I'll say this: The reason I believe a well-rounded statistics
education is so valuable is because it ultimately teaches you to think
critically about questions and answers. Good experimental design is driven by
a solid understanding of theory--what statistical assumptions must be
satisfied, which ones can be reasonably assumed or violated, and whether
they're realistic to achieve.

I don't believe you need a PhD to call yourself a scientist, but I do think
one common trait most scientists share is curiosity. To that end, I would
encourage practicing data scientists who might not have a formal statistics
education to not shy away from statistical theory. A solid theoretical
understanding is what guides you when the questions and answers aren't clear--
and I think the fundamental shortcoming of many data science programs is that
they prepare their students for extremely simplified (and therefore
unrealistic) questions with easily obtainable answers relative to what will be
encountered in the real world.

I apologize for giving you an answer that isn't as coherent as I would like to
it be. I tried not to be too verbose, but I think I failed at that anyway. I
have a lot of feelings on this topic that I haven't fully articulated to
myself yet. I could answer your first question if you're still interested, but
as an addendum to my original comment, even though my statistics degree has
gained me nothing in terms of career advancement or an increase in salaries or
opportunities, its been truly invaluable to me as a programmer and a public
speaker and advocate for the critical thinking skills it taught me. I hope
others continue to recognize statistics value in academia and are curious of
it.

------
fifnir
Yet nature forces you to provide p values and n count for everything you can
in any figure as if that's enough to guarantee significance of results.

We need to start publishing with transparent and reproducible code from raw
data to figure. Show me the data and let me make my own conclusions.

It's not too hard,I'm writing my phd thesis and every figure is produced from
scratch and placed in the final document by a compilation script. My jupyter
notebooks are then compiled in pdf and attached in the thesis document as
well. Isn't this a better way of doing the "methods" section?

~~~
atlasair
This doesn't work for data gathered on humans, which has to be kept private.

Unless you have implemented some new method, I don't see why the code would be
of any interest.

~~~
fifnir
Because instead of vaguely describing what I did I can show you exactly what I
did.

Instead of saying "we normalized the counts", I can show you EXACTLY what that
was that I did.

If I can't see your code I don't trust you.

I think it's cleaner, more honest, more reproducible, and it helps teach
younger researchers.

ps. Huge amounts of "human" data are normally public and available for anyone
to work with, it's only specific subsets that need to be private.

------
bitxbit
[https://wolfweb.unr.edu/homepage/zal/STAT758/Granger_Newbold...](https://wolfweb.unr.edu/homepage/zal/STAT758/Granger_Newbold_1974.pdf)

was written over 45 years ago. Granger is rolling over in his grave every time
someone "discovers" a magical relationship between two time-series. In all
honesty, statistics is hard and it's something you need to practice on a
regular basis.

------
kbutler
Statistical significance is required, but not sufficient to prove an effect.
Lack of statistical significance means you did not prove an effect, but you
also didn't prove there is no effect.

So the answer is more likely "statistical significance and more" rather than
"ditch statistical significance".

~~~
headsoup
I think maybe you didn't read the article. This is addressed constantly
through it.

Or are you just summarising?

~~~
pmoriarty
From the HN Guidelines:

 _" Please don't insinuate that someone hasn't read an article. "Did you even
read the article? It mentions that" can be shortened to "The article mentions
that.""_

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
qwerty456127
I have always been saying that if an experimental medication application
resulted in a useful effect observed in 1 subject out of 1000 this doesn't
mean it's garbage and should be dismissed at this point, it can perfectly mean
that one person was different in the same way 1 out of every other 1000 people
is and 0.1% of the earth population is 7.55 million people still worth curing.

~~~
jefftk
Unfortunately it's much more likely that this person happened to get better
for unrelated reasons than because this medication cured them. Likely enough
that it's not worth looking into unless this is a condition that effectively
no one recovers from.

------
safgasCVS
A few points:

\- The basis of a p-value is very much aligned with the scientific process in
that you arent trying to prove something 'is true' rather you're trying to
prove something false. Rejection of p-values / hypothesis testing is a bit
like rejecting the scientific method. I am lucky enough to be friends with one
of the physicists that worked on finding the Higgs Boson and he hammered it
into my head that their work was to go out of their way to prove the Higgs
Boson was a fluke - a statistical anomaly - sheer randomness. This is a very
different mentality to trying to prove your new wunder-drug is effective -
especially when those pesky confidence intervals get in your way of a
promotion or a new grant. Its much easier to say p-values are at fault.

\- Underpinning p-values are the underlying distributional assumption that
makes up your p-value needs to match that of whatever process you're trying to
test else the p-values become less meaningful.

\- The 5% threshold is far too low. This means at least 5% of published papers
are reporting nonsense and nothing but dumb luck (even if they got lucky with
the distribution). If the distributional assumptions arent met then its even
higher. Why are we choosing 5% threshold for a process/drug that can have
serious side-effects?

\- p-value hacking. So many sneaky ways to find significance here. Taleb goes
into some detail into the problem of p-values here
[https://www.youtube.com/watch?v=8qrfSh07rT0](https://www.youtube.com/watch?v=8qrfSh07rT0)
and in a similar vein here
[https://www.youtube.com/watch?v=D6CxfBMUf1o](https://www.youtube.com/watch?v=D6CxfBMUf1o).

Doing stats well is hard and open to wilful and naive abuse. The solution is
not to misuse or throw away these tools but to understand them properly. If
you're in research you should think of stats as being part of your education
not just a tickbox that is used validate whatever experiment you're doing

------
vharuck
It definitely needs to be left out of anything with non-statisticians in the
intended audience. I've started leaving it out of most reports. If I write
about a difference, it's statistically significant. The test just gives me
confidence to write it.

~~~
sidesentists
As someone who does a lot of meta-analyses I'd prefer you left in non-
significant values as well, if they bear on the hypotheses at hand.
Aggregating over nonsignificant effect sizes can still result in an overall
effect that is significant.

~~~
Fomite
This. A dozen "non-significant" studies that all have effects in the same
magnitude and direction are telling you something.

------
jknz
From Brad Efron in [1]: "The frequentist aims for universally acceptable
conclusions, ones that will stand up to adversarial scrutiny. The FDA for
example doesn’t care about Pfizer’s prior opinion of how well it’s new drug
will work, it wants objective proof. Pfizer, on the other hand may care very
much about its own opinions in planning future drug development."

Significance requirements should be approached differently depending on the
use-case. The above are two extreme cases: FDA authorized a new drug where
significance guarantees should be rigorously obtained beforehand, and at the
other extreme, exploratory data-analysis inside a private company, where data-
scientists may use fancy priors or unproven techniques to fish for potential
discoveries in the data.

Now how much significance guarantee should be required from a lab scientist is
unclear to me. Why not let lab scientists publish their lab notebook with all
experiments/remarks/conjectures without any significance requirement? The
current situation looks pretty much like this anyway with many papers with
significance claims that are not reproducible.

We should ask the question of how much the requirement of statistical
significance hinders the science exploratory process. Maybe the current
situation is fine, maybe we should new journals for "lab notebooks" with no
significance requirements, etc.

On the other hand, in the mathematical literature, wrong claims are published
often, see [2] for some examples. But mathematicians do not seem to as
critical of this as the public is critical of non-reproducible papers in life
sciences. Wrong mathematical proofs can be fixed, wrong proofs that can't be
fixed sometimes still have a fruitful argument in it that could be helpful
elsewhere. More importantly, the most difficult task is to come up with what
to prove; if the proof is wrong or lacks an argument it can still be pretty
useful.

[1]:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.179...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.179.1454&rep=rep1&type=pdf)

[2]: [https://mathoverflow.net/questions/35468/widely-accepted-
mat...](https://mathoverflow.net/questions/35468/widely-accepted-mathematical-
results-that-were-later-shown-to-be-wrong)

------
duxup
As a layman who probabbly didn't understand that whole article I ask:

If "statistical significance" is just sort of an empty phrase used to dismiss
or prove something somewhat arbitrarily. Then isn't the same person writing
the same study likely to be just as arbitrary declaring what is or isn't
significant .... anyway?

~~~
BeatLeJuce
There are strict rules that define when something is "statistically
significant", it's not at all arbitrary. The problem is people thinking that
just because something is statistically significant, it is automatically true.
Which it isn't. Statistically significance _by definition_ includes the
possibility that something was just a statistical oddity. This article
essentially just reminds people of that, and urges them to abandon the
"statistical significance is the same as ultimate truth" conclusion.

~~~
neaden
There aren't is the thing. There are widely used standards, but they don't
actually have any real basis. Fisher I believe is the one who popularized it
but it was for specific circumstances and he acknowledged that it was just a
convenient thing.

~~~
BeatLeJuce
Sure there are. You can't just call a result "significant" at will. You can
pull numbers out of thin air, pre-filter your data or carefully pick a
statistical test to be in your favor. But it's still well-defined which
outcomes you're allowed to call significant and which ones you can't.

------
S_A_P
I feel like this is one area where clickbait media has pushed things
backwards. Everyone wants the clicks so facts from studies get skewed into
binary results when its most always shades of gray. If I see a study and it
shows that you may be slightly less likely to get alzheimers if you drink
green tea every day, but only on the order of half a percent or so, I dont
have a magic cure all to alzheimers. But you will see news headlines "Green
Tea cures alzheimers! and may even be effective for ED!" Maybe we shouldnt
rise against statistical significance and push back on incorrect dissemination
of the results?

------
sgt101
It's interesting that they talk about a category error around "no
association". In fact there is a category error in applying statistical
thinking in cases where objects are not comparable - like human metabolisms,
ecosystems, art...

------
rscho
Most arguments used in this discussion are much better exposed in the book

[https://www.statisticsdonewrong.com/](https://www.statisticsdonewrong.com/)

------
Myrmornis
The article spends too much time saying what not to do and not enough time
saying what to do instead. People treating p-values dichotomously are doing so
because they think that's what they're supposed to be doing. So while it's
amusing to rant as in this article, it should have devoted its efforts to a
presentation of exactly how a paper _should_ be written in this new age. I
suppose there's plenty of opportunity for others to do that, but this seems
high profile.

------
scotty79
I would never guess that prople would read:

"I measured no significant difference."

to mean:

"There's no difference."

and not:

"I couldn't measure precisely enough to see what difference there is, if any."

------
jpatokal
So we have 800 scientists signing a paper, but there are on the order of 7
million scientists worldwide. To prove the hypothesis "scientists rise up
against statistical significance" with 95% confidence level and a .1
confidence interval, we need a sample size of 857,462. A sample size of just
800 is clearly not statistically significant, so the paper is meaningless and
the hypothesis can be rejected. Am I doing this right?

~~~
pulisse
> Am I doing this right?

Sampling error is independent of population size, so, no.

~~~
ilyaeck
Also, the forces influencing the society are not statistical, at least not in
this sense. A small minority in a group is definitely capable of challenging
old norms and installing new ones. In fact, that is how it usually happens.

------
subjoriented
Ever hear of the statistician who drown in a river an average of 1 foot deep?

------
6gvONxR4sf7o
It's worth emphasizing that the concept of statistical significance is still
super useful. It just doesn't deserve to be as central to the scientific
process as it currently is.

~~~
mamon
But it should be central to scientific process. If you do any kind of
experiment and gather the results, then the first thing you need to do is to
make sure that your results are real, and not just some random noise in data.
Otherwise you're not doing science anymore.

~~~
drcode
...but does statistical significance really allow you to distinguish real
results from random noise? I think it's pretty clear that it isn't a good tool
for achieving this goal, due to the "p hacking" phenomenon
[https://en.wikipedia.org/wiki/Data_dredging](https://en.wikipedia.org/wiki/Data_dredging)

------
SubiculumCode
I for one flagged this kind of crap in one my recent peer reviews. It is a
transgression that is all the worse when their sample size was woefully small.

------
blunte
I must be stupid. It feels like a Monty Python word game designed to confuse.
There's no way I can not disbelieve a statistically insignificant study
outcome unless I want to allow myself to believe in some (but which ones?)
statistically insignificant ones.

~~~
thedataangel
The key is that there is a distinction between "not statistically significant"
and "statistically insignificant".

------
_rpd
I see a lot of health-related signatures on the list. Is this a backlash
because of the findings against homeopathy?

~~~
BeatLeJuce
which findings?

~~~
_rpd
Reviews like this ...

[https://bpspubs.onlinelibrary.wiley.com/doi/full/10.1046/j.1...](https://bpspubs.onlinelibrary.wiley.com/doi/full/10.1046/j.1365-2125.2002.01699.x)

that have led to various national health systems refusing to pay for
homeopathic treatment.

The studies are almost universally p-value based.

------
davidw
How many of them rose up?

------
killjoywashere
My general rule of thumb is if I'm having a debate about statistical
significance, I'm debating the wrong thing, should stop talking, and get more
data. Preferably so much more data that the question answers itself without
having to test for significance.

~~~
atlasair
This doesn't work, if you need expose patients to a novel drug to gather data.

~~~
killjoywashere
The era of medicine where we administer a 100 mg pill of a small molecule to
3000 patients, calculate a P value and release it for sale is dead. There's a
huge industry, so it's not aware of its death yet, but the research world has
moved on. I think the paradigm will be based around deep understanding of the
genomics, proteomics, histology, spatial distribution of the problem in the
body, and licensing for sale platforms that produce custom proteins, T cells,
small molecules, <intervention of choice>, designed on the fly. It's 30 years
out, but if you're betting the bank on a small molecule in a phase 3 trial,
that model is only going to hold up for so long.

------
peterlk
I think I will forever be bitter about an argument that I had with a professor
in college about statistical significance. There was a study with 12 (or some
other tiny N) people who were told to click a button when they saw some change
as a laser was moving in a circle. They were then asked to go back and pick
the place where the laser was when they observed the change. The study found
that people remembered clicking the button 10ms (or some other tiny value)
before they actually did. This was clearly grounds for all of us to question
whether humans had free will at all, because the result was statistically
significant after all! When I challenged the professor on this, I was told
that I should take a statistics class. I think that professor still turns me
off from philosophy to this day. This happened in a philosophy class

~~~
jobigoud
Your description almost matches an experiment done in EEG research called the
Libet experiment [1], although it's a bit different than how you describe it
I'm confident this is what you are referring to.

They find that when you perform an action, there is a EEG spike [2] in the
motor cortex well before you actually consciously decide to perform the
action. The experiment is conducted with a dot running around a circle and the
subject has to tell when (as per where the dot was) he decided to act. The EEG
potential is seen prior to that decision moment.

This is related to free-will as it is as if the decision of acting is not
coming from your conscious self but from a deeper layer.

[1] Libet experiment:
[https://www.youtube.com/watch?v=OjCt-L0Ph5o](https://www.youtube.com/watch?v=OjCt-L0Ph5o)

[2] Bereitschaftspotential:
[https://en.wikipedia.org/wiki/Bereitschaftspotential](https://en.wikipedia.org/wiki/Bereitschaftspotential)

~~~
worldsayshi
Perhaps I'm missing the point here but why does this say anything about free
will? Of course you can't make split second conscious decisions. But you've
still made a conscious decision to prime your subconscious facilities to act
in a certain way.

The consciousness is super slow. It doesn't make sense to have it "do"
anything. But it's good for making executive decisions. The CEO doesn't make
the product.

~~~
azernik
It meshes with certain other results, for example on timing of predictions
([http://www.kurzweilai.net/more-evidence-that-youre-a-
mindles...](http://www.kurzweilai.net/more-evidence-that-youre-a-mindless-
robot-with-no-free-will)), and most interestingly on split-brain patients
([http://www.powerofstories.com/our-brains-constantly-
confabul...](http://www.powerofstories.com/our-brains-constantly-confabulate-
stories-which-builds-a-meaningful-narrative-for-our-life)). In the latter
case, split-brain people would make choices based on information only
available to their right brain, then when asked to explain them would
unconsciously invent an explanation for that choice which was based only on
information from their left brain.

I'm not sure philosophically whether our inability to understand our decisions
undermines our free will, but it certainly undermines any ability to
consciously prime ourselves to make certain decisions - hard to have that
feedback loop when you don't even know what decision you made!

~~~
wallace_f
Is saying it "meshes with specific other results" just a signal of cognitive
bias?

I personally would confidently guess there are unconscious faculties in the
mind, but I dont see how this experiment is remarkable in proving this, see
here(1). Is it not an equally likely conclusion that the brain takes a few
hundreds of a second to develop a conscious decision? Actually, the inverse of
that is what would be remarkable.

1-[https://goo.gl/images/MkT2PQ](https://goo.gl/images/MkT2PQ)

~~~
azernik
The issue isn't whether a decision is unconscious or conscious. It's that we
often _think_ that we've made a conscious decision when the decision process
we narrate for ourselves and others is provably impossible.

I should have phrased the original as "it is a relatively weak example of this
specific set of experimental results".

------
bo1024
Wow. The issue discussed in the beginning of the article is really basic
(evidence of absence vs absence of evidence). There are much more
intellectually challenging issues with statistical significance. If scientists
don't understand this one, it's a really sad sign.

I'm not sure how well a difference in nomenclature can fix such serious
misunderstandings, but I do like the "compatibility" suggestions and the way
they talk about the point estimate and endpoints of the confidence interval.

