
Time to Abolish "Statistical Significance"? - swibbler
http://conversableeconomist.blogspot.com/2019/03/time-to-abolish-statistical-significance.html
======
yosefzeev
I suspect if you take away tenure being based upon publication, you will find
that many statistical measures become more honest. You can abolish statistical
significance, but it won't stop the abuse of knowledge, which is the real
problem here anyway.

~~~
reubens
Could you expand on what you mean by the 'abuse of knowledge'?

I agree that the focus on this metric negatively influences research outcomes,
which extends to university structuring, but I'd like to hear your thoughts on
how this extends to abuse of knowledge in general.

~~~
pathseeker
> Could you expand on what you mean by the 'abuse of knowledge'?

A nice way of saying "lying". Learning how to game the statistics (e.g.
publish the 20th experiment that showed significance but fail to mention the
other 19).

------
brofallon
Sometimes I wonder if the discussion about p < 0.05 has diverged a bit from
practical considerations. In my field (population genetics and bioinformatics)
for instance, I'm not sure any current journal would reject a paper where
primary result has p=0.051 but accept an (almost) identical paper with
p=0.049. Most papers seem to involve many separate analyses that together tell
some story, and that story may even be an interesting negative result (where p
>> 0.05). Whether or not statistical significance is a useful concept at all
is a separate question, but I suspect the discussion of whether the threshold
of 0.05 is useful might be out of touch with actual practice.

~~~
radus
This is my experience in my slice of biology as well.

~~~
gbhn
From the discussion on this I've read, I think a good direction would be to
consider statistical tests like this as simply not "publishable" at all the
way we currently think of it.

That is, if you have a theory about how a Gene relates to height on tomatoes,
and you do a test, that test would show you you're likely on the wrong track
if it falls below some p value, but the only thing it tells you by being above
is that "there may be something here."

I think this is true for many fields with a replication crisis. The problem
isn't statistical, the problem is no theory. If you have a functional theory
there's all kinds of things you do to gain confidence in it, and mostly those
will contribute to the ability to predict statistical results, but that is
completely different on kind than sending out a survey and noting that
question 2 and 6 are statistically correlated.

When a field thinks that the kind of early suggestive work like this is worth
talking about, they should probably just talk about it in conferences and
similar venues, rather than "publish" it where journalists will pick it up in
a "science shows" story that 95% (lol) of the time turns out to be wrong.

In other words, I think it is fine that fields talk about early non-theory
results -- that can be interesting for specialists to advance faster.
"Publishing" this mostly-going-to-be-wrong stuff is leading to confusion among
the public about what the scientific process demands and how trustworthy it
is. That is not a good outcome in my opinion.

------
nestorD
One very important thing to know about the 0.05 threshold, and which I did not
found in this thread, is that the ideal p-value for a problem is a function of
the number of samples (and the effect size but it as a lesser impact).

0.05 is way to stringent if you have 10 samples and way too lenient if you
have 1 million samples.

But, by force of convention, everyone is using 0.05 (a value suggested by
Fischer when basically all datasets were small) independently of their sample
size and in a world were we are sometimes reaching dataset size that would
have been inconceivable when the threshold was suggested.

Here is a good article to start to think about how one would select a p-value
for an experiment :
[https://journals.plos.org/plosone/article?id=10.1371/journal...](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0032734)

~~~
cuchoi
It is not a function of only the number of samples. It is also a function of
how costly are false positives (type I error) and false negatives (type II
error). That is (from my understanding) the paper's main point.

But then you have to be able to calculate how costly is a type I and a type II
error! That's seems a relatively straightforward question for a business (for
example in A/B testing), but how do you measure that cost in academia?

I think this would only introduce confusion and another variable for
p-hacking.

~~~
nestorD
You can also use the apriori probability of a positiv, instead of a cost,
which can be roughly deduced from existing litterature.

The strenght here is that you get rid of an arbitrary decision (p-value) and
instead use quantities that can be measured and critiqued by a skeptical
reviewer.

But, in my experience with small data, the impact of the size of the dataset
dwarfed the impact of the cost/probability.

------
kalium-xyz
Why is 0.05 so common to begin with. Its an arbitrary value ain't it?

~~~
lisper
Yep. But if you're going to make binary publishing decisions then you have to
draw a line somewhere.

~~~
kazinator
However "1/20 chance that, if the hypothesis is false, our research confirms
it anyway" is not a good choice of somewhere for any important research that
can affect people's lives.

~~~
feral
This debate isn't that simple. The other side of this isn't that everything
magically just gets more accurate. The cost will be withholding true
discoveries from the medical community for longer.

Imagine you've done a study of something else and you accidentally discover a
(properly FDR controlled) correlation between some drug and heart attacks,
significant at 0.03 and with no big contrary prior.

Should you publish or wait 5 years for a follow-up study to complete?

What if three different groups stumble across this same finding? Should they
all publish and maybe someone will realize 'shit this drug is killing people'
or should they all wait to hit some other higher standard of significance?

The point is theres always a balance between specificity and sensitivity, a
tradeoff in terms of costs.

I'd personally be happy with keeping 0.05 as the threshold for 'probably
something here'. The real issue are about publication bias, incentives, naive
interpretation of published work 'heres one study in Nature so it must be
true'. I don't see any other purely statistical change, Bayesian (essentially
has all the same problems, aside from taking into account priors) or whatever,
that will solve these in any way that won't come without unacceptable
sensitivity cost.

------
no_identd
Here's an alternative:

[https://arxiv.org/abs/1904.06605](https://arxiv.org/abs/1904.06605)

Victor Coscrato, Luís Gustavo Esteves, Rafael Izbicki, Rafael Bassi Stern —
Interpretable hypothesis tests (2019)

Abstract:

Although hypothesis tests play a prominent role in Science, their
interpretation can be challenging. Three issues are (i) the difficulty in
making an assertive decision based on the output of an hypothesis test, (ii)
the logical contradictions that occur in multiple hypothesis testing, and
(iii) the possible lack of practical importance when rejecting a precise
hypothesis. These issues can be addressed through the use of agnostic tests
and pragmatic hypotheses.

Note that this enables acquiring one of the Holy Grail of Statistics, namely,
controlling Type I & II errors simultaneously.

------
ken666
In particle physics there is the concept of "look elsewhere" effect, precisely
to take into account that if you look for a signal, for example of a particle
of any mass in some range, there is the possibility that just by chance you
find some statistical deviation at some mass value.

It is very different to confirm a prediction (i.e. to look for a particle with
a precisely predicted mass), than to fish for some unexpected signal in your
data.

In some cases Economics could do the same: Looking for an effect in any age
range could be post-processed to take into account that you are looking into
many age groups.

~~~
nestorD
That's called the multiple comparaison problem in statistics and it is both
well kown and compensated for in most studies (the hard part being not to have
too mauch false negatives in an effort to keep the quantity of false positives
constant) :
[https://en.wikipedia.org/wiki/Multiple_comparisons_problem](https://en.wikipedia.org/wiki/Multiple_comparisons_problem)

------
JadeNB
Who is Timothy Taylor? (The 'about me' on his Blogger page is blank, and
Googling doesn't turn up results that are obviously about him.) This refrain
is so well worn by now that it seems that one really ought to have something
fundamentally new to say, or be so much of a heavy weight that one's own
opinion might be enough significantly to swing the pendulum, before hoping to
have another repetition of it make any significant difference. (On the other
hand, I guess it's just a blog post, so I shouldn't spend too much energy
ranting about how someone uses his own blog.)

------
tambourine_man
P-values Broke Scientific Statistics—Can We Fix Them?

[https://www.youtube.com/watch?v=tLM7xS6t4FE](https://www.youtube.com/watch?v=tLM7xS6t4FE)

------
hashkb
This is throwing the baby out with the bath water. Let's redefine it so it's
useful again and reform academia. I'm running into more and more people who
use headlines like this to ride bikes with no helmets.

------
paulddraper
That was a lot of words.

There are two alternatives to the current methodology:

* Remove significance requirement for publishing

* Adopt another statistical measure like Bayesian stats

~~~
s1artibartfast
There are a few more listed in the article and elsewhere: * pre-registered
experiments * listing number of regressing models * P-values with no
significance declaration

------
RocketSyntax
The model either converges or diverges

------
buboard
Betteridge headlines law applies with p<0.001

