
What's the significance of 0.05 significance? (2013) - xtacy
http://www.p-value.info/2013/01/whats-significance-of-005-significance_6.html
======
avs733
The problem with all of this is that the reducto-scientific paradigm for
understanding science has long been extended to communicating and teaching
about science.

As the author does a really nice job explaining, and as seen in the Fisher
quote, these articulations of heuristics and guidelines are often taken as
closed form rules.

Look beyond statistics and you see it everywhere. Common ones include
entrepreneurship and design...in both areas, experts' ways of thinking are
often highly situated, highly metacognitive, and the actions they take are
inherently inseparable from their thinking process. However, because of the
academic drive towards objective/deterministic/observable phenomenon the
research tends to report and attribute only the actions. The result is that
those actions, rather than the underlying thinking processes, are valued and
taught.

The result is simulations of expertise masquarading as knowledge. Its one
thing when its students, but as you are seeing in psychology's 'replication
crisis' (which, side note, is kind of a metaversion of its own critique) it
can create real problems when surface level understanding is accepted and
generalized as a normative 'truth' in a field. You see it in economics and
business a lot...strive to appear scientific, but do so in ways that
inherently betray the underlying structure of what you are studying. It comes
from an underlying value in those communities, and society, that the only
truth is objective truth.

If I have an experiment where I am screening 5 possible predictors and I get p
values of .9 for 4 of them and .52 for 1...I would be an idiot not to pursue
the 1. if I get 4 .49s and 1 .00000000001...same thing. Statistics is
relative, literally.

[happy to provide citations...not sure anyone really cares]

~~~
thedailymail
I care, and agree with your points. It seems at least part of the
fetishization of p < 0.05 has arisen from its common use as the cutoff point
in drug clinical trials, and thus represent a make or break point in a multi-
billion dollar industry. There needs to be some predefined standard for such
trials to prevent cherry-picking and other games, but as the blog describes,
this could have just as easily been a different threshold. Similar
observations can also be made about the values used to set study sample size
(based on somewhat arbitrary alpha and beta).

The arbitrariness at the heart of the regulatory enterprise may seem
disconcerting, but the alternative (no shared standards for evaluating drug
efficacy) has also been tried historically, and the result was markets flooded
with useless, often dangerous, products many of which nonetheless sold very
well.

~~~
avs733
> It seems at least part of the fetishization of p < 0.05 has arisen from its
> common use as the cutoff point in drug clinical trials, and thus represent a
> make or break point in a multi-billion dollar industry.

I think that is a bit recursive. Its not so much a fetishization as it is a
misunderstanding that results in it being a valuable target.

It would be a heck of a lot more useful as a target (and granted...I would
argue the target should be more like .001) if more research adopted Bayesian
statistical techniques where you can't as easily P hack.

------
thearn4
The standard ritual for measuring significance in research seems to me to be
some strange marriage of the ideas of Fisher, Neyman, and Pearson that I'm not
sure any of them would have actually agreed with. I'd be interested to hear
any historians of statistics or scientific methodology comment more on that
angle or correct my misinterpretation if thats what it is.

~~~
kgwgk
I think you’re right.
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347431/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347431/)

------
netcraft
Semi-off topic - as someone who does a lot of data analysis with sql but has
never taken a statistics course - can anyone recommend any resources about
where to learn about how to calculate / apply p-value, r-square etc?

~~~
madhadron
I wrote a couple papers for clinicians introducing p-values and the necessary
apparatus to understand them correctly:
[http://madhadron.com/posts/2016-01-25-p_values_for_clinician...](http://madhadron.com/posts/2016-01-25-p_values_for_clinicians.html)

~~~
nonbel
>"The P-value is the smallest relevant value of α given your data (i.e., the
smallest probability of making a Type I error and deciding there is an effect
when there isn't one)."

Nope, the p-value calculation assumes there is no effect. How can it be the
probability there is an effect?

~~~
madhadron
Reading comprehension: "deciding that there is an effect when there isn't
one."

------
pontus
p-values are very misunderstood and a lot more subtle than most people
believe.

If you're interested in p-values, I wrote a post on them here with some
counterintuitive examples (one of them shows how a lower p-value can sometimes
increase your belief in the null-hypothesis).

[https://mindbowling.files.wordpress.com/2016/07/pvalues.pdf](https://mindbowling.files.wordpress.com/2016/07/pvalues.pdf)

~~~
hnhg
This is great - you should really put your name and details on there. You
deserve the recognition!

~~~
pontus
Thanks! I'm glad you liked it. I guess I didn't think about putting my name on
there, but maybe I will :)

------
lokimedes
What frightens me more is how rarely I see talk about decision theory and
hypothesis testing in the (deep) machine learning community, it is as if
people consider the classification output as sufficient evidence of
recognition, just quote that max(p(class)) rather than the significance of the
class given its classifier score. Am I missing something?

~~~
nonbel
What do you think calculating the "significance" would add?

~~~
lokimedes
Most dCNN classifiers I’ve seen used simply rely on softmax to provide a
“probability” among the classes. But it really is a score that has been
normalized across the classes. Having 0.7 of one class does not mean the same
level of discrimination from the remaining classes as it does for another
class with the same score. By only using the maximum scoring class you don’t
account for what score value is sufficient to claim a significant
discrimination between the maximum score and the alternatives.

~~~
nonbel
I see how the probability (softmax output) contains more info than simply the
class with highest probability, but not what you want regarding "significant
discrimination".

Perhaps you want to weight different types of errors differently? Eg:
[https://github.com/Hezi-Resheff/paper-log-bilinear-
loss](https://github.com/Hezi-Resheff/paper-log-bilinear-loss)

------
VikingCoder
[https://xkcd.com/882/](https://xkcd.com/882/)

------
hackeraccount
Someone needs to reference the XKCD strip on jelly beans.

[https://xkcd.com/882/](https://xkcd.com/882/)

------
BoiledCabbage
Scientists needs to start using a "training set" and a "test set".

If you have 2000 samples of data, you don't train your model on it and then
say that's your success rate. You'll end up with conclusions that don't
generalize.

Instead train on 1600 and measure your success on the remaining 400.

Similarly, don't look for statistical significance among your 2000 samples and
conclude that's the result. Do it across 1600 and then validate it on the 400.
If there is a real result there, it'll reproduce. It now makes your process
robust to overfitting / param hacking.

You avoid the green jelly bean problem entirely.

~~~
kbutler
This is the standard recommendation, but does it really help?

You do that test on the second subset, and then you discard every theory that
doesn't pass both the 1600 set and the 400 set.

So you end up with the predictions that pass all the data in your original
2000 samples.

Is it really any better at generalizing to new data? If so, can you just
evaluate your theories on the unpartitioned data by doing randomized subset
testing after-the-fact?

~~~
BoiledCabbage
Probably would be helpful to see the math on it to confirm, but I believe if
it's a spurious relationship, the probability it will pass both the 1600 & the
400 are less than the prob of it passing the 2000.

~~~
ralmeida
It may (or may not) be less, but if scientists just discard the tests which
don't pass both, then passing both becomes the new prior, and we go back to
the same situation.

