

The harm done by tests of significance (2003) [pdf] - luu
http://andrewgelman.com/wp-content/uploads/2014/12/1154-The-Harm-done-by-tests-of-significance.pdf

======
capnrefsmmat
I found this paper a couple years back when I was writing about statistical
errors. I love the right turn on red example -- an everyday situation where
bad statistics leads to extra deaths every year. (Somewhere on the order of
10-100, I think.)

The problem, turning "not statistically significant" into "there is no
difference," happens all the time in just about every field of science. Often
you see people report "three studies found that this medicine work, but two
found that it didn't" and conclude that the evidence is contradictory and
can't be trusted. But if you look at the effect sizes, you see the five
studies found nearly the same answers -- its just two of them didn't quite
cross the threshold for significance.

I wish I had a way of teaching statistical thinking more clearly than standard
intro classes. It's so weird and counter-intuitive that very few people get it
right. I've given it a shot by writing a book
([http://www.statisticsdonewrong.com/](http://www.statisticsdonewrong.com/))
but there's a lot more to be done.

~~~
christopheraden
>But if you look at the effect sizes, you see the five studies found nearly
the same answers -- its just two of them didn't quite cross the threshold for
significance.

This is why I wish meta-analysis was introduced much earlier than it is in
statistics education. There are sensible ways of combining the information
from the five studies, weighting them according to their sample size (provided
the studies are similar in design and cohorts). =)

------
jeffreyrogers
Deirdre McCloskey (an economist) has an entire book devoted to this[1]. Her
article here:
[http://www.deirdremccloskey.com/docs/jsm.pdf](http://www.deirdremccloskey.com/docs/jsm.pdf)
covers the main argument in the book. One important point she makes is that
not all fields misuse p-values and statistical significance. In physics
significance is almost always used appropriately, while in social sciences
(including economics) statistical significance is often conflated with actual
significance.

[1]: [http://www.amazon.com/The-Cult-Statistical-Significance-
Econ...](http://www.amazon.com/The-Cult-Statistical-Significance-
Economics/dp/0472050079)

~~~
sukilot
That difference is likely because reality won't believe you if you state the
significance wrong, but people will.

------
mdbco
The single paragraph in the postscript of this paper (part 6) is actually
really important. It's very common for people who are using statistical
testing in applied settings to entirely forget about type II error (and
correspondingly, the power of the test), and so when they see a p-value that
isn't significant at a certain level (say 5%), then they just assume that the
null hypothesis is true.

Of course, this is not correct, and all we can really say is that the test did
not reject the null, given the size (type I error rate) and power (type II
error rate) of the test. It's entirely possible that the null should be
rejected, but the test is just not very good (i.e. it might have the correct
size, but very poor power).

So given some complex and eccentric real-world data, how can we figure out
what the power of a given test might be in practice? If you have some idea of
what the data generating process might look like then one option is to do some
simulations. This enables you to see what the size and power properties of
your test are by empirically measuring the type I and type II error rates.

------
vancan1ty
People often have trouble applying statistical methods correctly, and perhaps
often can manipulate statistics to tell a given story. And indeed, P=0.03 on
its own is meaningless without an understanding of how a study is set up and a
plausible hypothesis.

But inferential statistics is grounded in sound theory, and used correctly and
with appropriate assumptions is a powerful tool to reason about data. Without
it, how are you supposed to reason about (for example) study results? Appeal
to intuition?

It seems to me that significance tests are not all-powerful, or foolproof, but
are still a very valuable tool.

~~~
jwmerrill
> Without [inferential statistics], how are you supposed to reason about (for
> example) study results? Appeal to intuition?

The alternative, which the author of the linked article mentions, but doesn't
really emphasize, is to report a best estimate of whatever effect you are
trying to measure, along with some measure of uncertainty in that estimate.

For example, instead of "we failed to find significant evidence that right
turn on red increases the expected number of fatalities," you say "our best
estimate of the expected increase in fatalities due to right turn on red is
200 +/\- 210."

This approach puts the most relevant information front and center, and it
seems to me, encourages better intuitive reasoning. It's what engineers and
most of the hard sciences do most of the time.

You do also need to say something about the meaning of your uncertainty
estimate (e.g. it's 1 sigma, or 2 sigma, or 95%), or alternatively, there
needs to be an understood convention for your field.

~~~
vancan1ty
Is the method you described not another form of inferential statistics?
(Definitely not hypothesis testing however).

It seems that this is a combination of several techniques:

1\. First, we either explicitly or implicity choose a model to relate deaths
and RTOR laws (perhaps a linear relationship, i.e. [deaths w/ RTOR] =
a*[deaths w/o RTOR]).

2\. Then we perform point estimation to estimate the parameter "a".

3\. Then we compute a confidence interval.

With respect to the RTOR example in the paper, it seems to me that it WOULD be
incorrect to reject the null hypothesis that the change in crash numbers
arises from random chance for ANY INDIVIDUAL STUDY. In this case it seems that
you must figure out a way to transfer information between studies to establish
this idea of "statistical significance." Perhaps a survey of studies or usage
of bayesian techniques would have resolved the difficulty.

------
cgold
Related to this might be the decision of a psychology journal to remove
p-values from published articles: [http://www.nature.com/news/psychology-
journal-bans-p-values-...](http://www.nature.com/news/psychology-journal-bans-
p-values-1.17001)

Even though p-hacking is bad, I'm not sure whether banning p-values is good or
bad. The real question is what will be used to replace the p-values.

