

Sifting the evidence --- what's wrong with significance tests? - nkurz
http://physicaltherapyjournal.net/cgi/content/full/81/8/1464

======
idm
If a study cannot be replicated, then it's ripe for criticism; replication is
the last step in the scientific method, after all.

If a study, significant at p=0.05, is replicated again at p=0.05, then the
odds of that happening when there actually is no effect (assuming
independence) are the product of the two (0.0025). Note: there are more
rigorous ways to perform a meta-analysis than this...

I think p=0.05 is a good balance for noisy data. It's enough to publish, and
publishing is frequently enough for others to replicate it. That's good
science.

The conclusion shouldn't be that "we need to use a stricter test of
significance (i.e. p<0.001)" Instead, I think it means the field is ripe for
replicated studies and meta-analyses.

It can be hard on people's reputations when something is falsified, but let's
face it: falsifiability is one of the best things going for science. It looks
bad in the media, but it's really healthy for the field.

EDIT: I do want to mention that this paper is a really thoughtful analysis...

~~~
efaith
"If a study, significant at p=0.05, is replicated again at p=0.05, then the
odds of that happening when there actually is no effect (assuming
independence) are the product of the two (0.0025)."

I think that must be approximately right, but I'm not sure it is precisely
right. Imagine you do a very large number of studies and suppose the null
hypothesis is in fact true. P will always be less than 1, maybe slightly less
than one - or suppose that it is 1 occasionally. This is because P varies from
0 to 1. Now go ahead and multiply these Ps, and if you've done enough studies,
the product is going to be very small even if all the individual Ps are close
to 1 and even though the null hypothesis is true.

So I don't think a simple multiplication is exactly the right formula, though
it does seem as though it must be about right.

I googled the question and found a formula for combining P of two independent
studies, but it's not simple multiplication. You start with the Ps, find the
corresponding Zs (I do not know what those are), then add them and divide by
sqrt(2). This is your new Z, and then you take the corresponding P. Also, it
requires that the Ps must be one-tailed, so it is not a fully general formula.
I do not understand the Zs but my point is that if it were simply a matter of
multiplying the Ps then why go to the trouble of adding Zs? I found it here:

[http://books.google.com/books?id=nxOFMQYMIlgC&lpg=PA527&...](http://books.google.com/books?id=nxOFMQYMIlgC&lpg=PA527&ots=FJ_CKPasRN&dq=combining%20two%20studies%20significance&pg=PA527#v=onepage&q=combining%20two%20studies%20significance&f=false)

As for why it doesn't match exactly the familiar formula for combining
independent probabilities (i.e. you simply multiply them), I think the answer
lies in the nature of P. P is not really "the probability of that result" but
"the probability of a result that is at least that extreme", and this subtly
different meaning results in a different way in which the individual Ps must
be combined.

~~~
idm
You're right about my claim being approximate - the biggest problem with what
I said was "assuming independence." If you're replicating a study, it's
definitely not going to be a completely independent event. Usually, you copy
at least part of the methods, and that right there is a major dependency.

As for the definition of P, that's something they address in the original
article, and you are in agreement with the authors: P is _definitely_ not "the
probability of that result." In its interpretation, P is the chance that you
accept the null hypothesis when it is false (i.e. there is no effect, but your
data randomly showed an effect).

------
drewcrawford
I'm not a statistician, but I found the following juxtaposition rather
contradictory:

>* Confidence intervals for the main results should always be included, but
90% rather than 95% levels should be used

>* When there is a meaningful null hypothesis, the strength of evidence
against it should be indexed by the P value. The smaller the P value, the
stronger is the evidence

This sounded to me like "You shouldn't use P-values to decide, except when you
do." (For non-mathy types, P values is 1-confidence, so they measure the same
quantity).

I think some of this stems from their strong voiced preference for Bayesian
statistics (there's something of a civil war among statisticians between
bayesian and classical statistics). I've never been able to swallow the
Bayesian theories; they're just too weird for my feeble mind to grasp.

------
jibiki
Here's an interesting question for you. I do an experiment, and I reject the
null hypothesis with p = 0.04. Then I do the experiment again, and get p =
0.09. Does the second run of the experiment make me more or less confident
about rejecting the null hypothesis?

