

P-Values are not Error Probabilities (2003) [pdf] - gwern
http://www.uv.es/sestio/TechRep/tr14-03.pdf

======
cl42
Thanks for sharing this. In general, it's a real shame how few people know how
to interpret and use P values correctly. We work with a lot of businesses who
ask us to compare populations (e.g., through A/B tests) and people either (a)
don't care about the significance between population differences, or (b) are
irrationally attached to P values.

Case in point #1: debating whether a P value of 0.051 versus 0.049 is a major
difference in the significance test.

Case in point #2: a P value of <0.001 but with extremely low differences in
means between populations. With enough data, everything is significant!

End rant. :)

~~~
return0
Amen for #2! It's like the entire field of biology is an endless quest for low
p-values, regardless if the hypothesis is even interesting. It's like people
are not interested to think, they just want to publish something significantly
differnt.

~~~
cl42
haha, Sociology as well -- especially now that the web provides huge amounts
of behavioral data.

I much prefer how machine learning folks tend to approach predictive accuracy,
though I guess that's not quite the same as understanding relationships
between specific variables while controlling for others.

~~~
sdenton4
There is a notion of 'feature importance,' which especially comes up in
decision trees and random forests, giving a notion of how much a particular
feature contributes to the overall prediction. It seems like combining
predictive power with feature importance would be an interesting alternate
route to demonstrating important correlations. (For example, maybe a model
predicts lung cancer with 90% precision, and 'is_smoker' has a 80% feature
importance.) Of course, these importances depend a lot on the other features
used by the model! If you include a lot of junk features and/or exclude other
important features, the importance of your pet feature will shoot up.

~~~
cl42
Hmm, interesting -- I never considered the idea of including junk features to
bias model's preferences of whatever theoretically ambiguous idea you're
trying to promote. That's actually brilliant.

~~~
mbq
Shameless plug; I'm a co-author of a method that leverages adding artificial
junk features and removing original ones that are likely nonsense to
approximate the set of all features that are relevant to the problem (rather
than standard make best model, which may be pretty deceiving).
[https://m2.icm.edu.pl/boruta](https://m2.icm.edu.pl/boruta)

------
bijection
[https://www.youtube.com/watch?v=5OL1RqHrZQ8](https://www.youtube.com/watch?v=5OL1RqHrZQ8)
is a good demonstration of this.

"I use pictures from the ESCI software to give a brief, easy account of the
Dance of the p Values. The simulation illustrates how enormously and
disastrously variable the p value is, simply because of sampling variability.
Never trust a p value!"

------
amluto
I found this paper to be quite interesting, but I have two issues with it.

1\. At least at the beginning, it focuses excessively on the historical
aspects of statistics. For example, it says that "most applied researchers are
unmindful of the historical development of methods of statistical inference,
and of the conflation of Fisherian a nd Neyman–Pearson ideas." To me,
statisticians shouldn't /have/ to understand the history at all. For example,
as a physicist, there is absolutely no need for me to understand the evolution
of Ampère' theories, Faraday's theories, Maxwell theories, etc. to apply the
laws of electricity and magnetism correctly.

2\. The difference between p and alpha is central to the paper, but it doesn't
seem to have a cogent explanation of what that difference is. (It's very clear
who advocated for one and who advocated for the other, but that's not why the
difference is important.)

~~~
capnrefsmmat
The difference between p values and alpha levels is a bit subtle, and when I
first read this paper (while preparing my book, _Statistics Done Wrong_ ) it
took me a while to figure out.

Here's the idea. If you set alpha = 0.05, you will declare statistically
significant any result that gets a p value of 0.05 or less. When there is no
statistically significant difference to be found, you will have a 5% chance of
falsely detecting one.

But crucially, this applies on average to _all_ tests you conduct with this
alpha level. Even if an individual test gets p = 0.000001 or p = 0.04, the
_overall_ false positive rate will be 5%.

More succinctly, it doesn't make sense to ask for the false positive _rate_ of
a single test. What does that even mean? You can only ask for the false
positive rate of a procedure you use many times. So you can't get p = 0.01 and
declare this means you have a false positive rate of 1%.

~~~
philh
Possibly worth clarifying: the false positive rate
([https://en.wikipedia.org/wiki/False_positive_rate](https://en.wikipedia.org/wiki/False_positive_rate))
is "probability that a test will return positive, conditional on the
hypothesis being false". It's the rate of false positives within the set of
negatives, not the rate of false positives within all tests.

------
RA_Fisher
This is exactly the reason when working with my team on A/B testing, I'm
_always_ careful to use the phrase _meaningful_ difference (or not). With
internet-based tests, the volumes of participants can be huge, so finding
statistical significance is a hell of a lot easier than finding a meaningful
difference. I like to say, "As N tends to infinity, we are guaranteed to find
significance." The sad thing is that there are more testing platforms than I
can count on one hand that introduce this p-value fallacy to users. They
encourage things like repeatedly checking tests. I don't believe that users
realize that the true p-value is oscillating around alpha. If your test isn't
significant at them moment, just check a few hours later (it likely will be).
Even with Bayesian methods, I've learned the hard way that you really have to
be patient and let certainty accrue. More and more I'm lead towards Bandit
methods for this reason. When sampling real-time data, you're in effect
treating a historical sample as a part population with one part in the future
--- that's a pretty dangerous assumption. My solution has been to guide my
team towards large tests, work to _really_ move the needle (and have high
statistical power). This is where statistical analysis has the best chance of
providing certainty. In generally, with web experiments it's best to proceed
with a healthy amount of humility.

------
thearn4
A good read. I remember asking my stats professor in undergrad about the
"whys" of the various hypothesis testing schemes we use, and he eventually
just told me that I should take a math-stats class in graduate school. I did
that (and it was pretty enjoyable), but it certainly raised just as many
questions as it answered, when it comes to the null testing --> hypothesis
rejection ritual that scientific disciplines have converged on!

------
misiti3780
I just finished a pretty interesting book on this topic:

"The Cult Of Statistical Significance":

[http://www.amazon.com/Cult-Statistical-Significance-
Economic...](http://www.amazon.com/Cult-Statistical-Significance-Economics-
Cognition/dp/0472050079/ref=sr_1_1?ie=UTF8&qid=1428371506&sr=8-1&keywords=statistical+significance)

It basically goes through a bunch of examples, mostly in Economics, but in
medicine also (Vioxx) where statistical significance has failed us and people
have died for it. As someone who works with statistics for a living, I found
to book interesting - but it was pretty depressing to find out that most
scientist are using t-test and p-values because it seems to be the status quo
and it is the easiest way to get published. The authors of this book suggest a
few different things -- publishing the size of your coefficients and using a
loss function. In the end, they make the point that statistical significance
is different than economic significance, political significance, etc.

------
eranki
Here's an interesting paper on the prevalence of these misconceptions in both
students and teachers (at least in Germany).

[http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre...](http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpretations%20of%20Significance.pdf)

TLDR: 80% of methodology instructors have a misconception about significance.
Scientific psychologists and students perform even worse.

------
cozzyd
Perhaps the FDA (or NIH?) should employ statisticians to evaluate claims in
medical journals where the stakes are potentially higher.

------
jules
That's why significance and hypothesis testing should die out and be replaced
by Bayesian inference.

------
bryanl
TLDR:

"p’s and α’s are not the same thing; they measure different concepts"

------
enupten
Also see,
[http://www.nature.com/nature/journal/v483/n7391/full/483531a...](http://www.nature.com/nature/journal/v483/n7391/full/483531a.html)
, and a counter-view, [http://www.nature.com/news/reproducibility-the-risks-
of-the-...](http://www.nature.com/news/reproducibility-the-risks-of-the-
replication-drive-1.14184)

