

The Ironic Effect of Significant Results on Credibility (2012) [pdf] - gwern
http://www.ubc-emotionlab.ca/wp-content/uploads/2012/09/Schimmack-2012-Effect-of-Significance-on-Article-Credibility.pdf

======
DustinCalim
Here's the TLDR:

power = P(reject null hypothesis | null hypothesis is false)

From the abstract:

 _I conclude with several recommendations that can increase the credibility of
scientific evidence in psychological journals. One major recommendation is to
pay more attention to the power of studies to produce positive results without
the help of questionable research practices and to request that authors
justify sample sizes with a priori predictions of effect sizes. It is also
important to publish replication studies with nonsignificant results if these
studies have high power to replicate a published finding._

------
analog31
I have a rule of thumb, which is to multiply p by 10. My limited knowledge of
statistics tells me that the "correct" value of p depends on knowing in
advance all of the hidden correlations and sources of bias in the experiment.
Padding the p value 10 fold takes care of this.

So the gold standard of p < 5% (p stands for publish) can in practical terms
be regarded as being as good as a coin toss.

------
bksenior
Anyone got a tl;dr for this?

~~~
capnrefsmmat
Yes.

Most scientific studies have a sample size too small to reliably detect the
effect they're looking for, so they'd have to be very lucky to get a
statistically significant result _even if there 's a real effect_. There's
just not enough data to distinguish signal from noise.

So if you see a paper that says "We ran ten different experiments and every
one was statistically significant, proving we were right", you should be
_less_ convinced, because this suggests they cheated. Even if all ten
experimental hypotheses were right, there's a very small chance they would
obtain statistically significant results for all of them. This suggests they
conducted other unsuccessful experiments but did not report their results, or
that they chose their hypotheses _after_ running the experiments.

There's a psychologist, Gregory Francis, who has published a series of papers
investigating this. He looks at a series of results published on some
phenomenon, calculates how many you'd _expect_ to be statistically significant
if the effect genuinely exists, and then shows that this number was greatly
exceeded. This suggests there is publication bias.

Of course, Francis got into hot water by selectively publishing the results of
his studies into publication bias, only reporting a result if he found
evidence of bias. So there's a bit of irony in it all.

~~~
gfrancis
Nice summary. I wanted to clarify one point. Selective reporting (of
publication bias or most other kinds of tests) limits the inferences that can
be drawn from the analyses, but it does not necessarily invalidate the
properties of the test itself. If my selective reporting indicates that 6 out
of 7 articles appear to be biased, it would be improper to infer that 86% of
psychology articles are biased. Nevertheless, those 6 articles appear biased
whether you make such an inference or not. I think scientists care about the
validity of specific data/theory more than about the rate of bias across a
field. However, if you want an estimate of the latter, you can look at a paper
I published that systematically investigated bias in the journal Psychological
Science:

[http://link.springer.com/article/10.3758%2Fs13423-014-0601-x](http://link.springer.com/article/10.3758%2Fs13423-014-0601-x)

-Greg Francis

~~~
capnrefsmmat
I'm not sure I completely buy that argument. In the extreme case, you could
test 100 articles and find p < 0.05 evidence that 5 of them are biased. If you
reported only on those 5, we would be misled, because it's exactly the result
we'd expect under the null. I think this is a point Simonsohn made when
critiquing your papers, although I didn't follow the discussion after that.

Thanks for the link, though; that's exactly the kind of analysis I was hoping
to see. I'll have to read through it.

~~~
gfrancis
If I selectively reported only 5 out of 100 analyses, you would only be misled
if you inferred that a high percentage of articles were biased. I am
specifically warning you not to make such an inference because the selective
reporting invalidates it.

Perhaps those 5 articles that failed the analysis are actually unbiased and
were (un)lucky in repeatedly rejecting the null with seemingly low-powered
experiments. That would be a kind of Type I error regarding the conclusion of
bias. But such errors are inevitable when making decisions in a stochastic
setting. The findings in those articles _do_ appear odd, so scientists
_should_ be skeptical about their conclusions. To do otherwise is to throw out
the basic ideas of hypothesis testing (in which case you are skeptical about
all of the articles rather than just the 5 odd ones).

In one sense, I think you are worrying about prior probabilities. Your concern
is that if some of the articles actually generated data or theorised in a
unbiased way, then we have to worry about the Type I errors. You can reduce
the occurrence of such errors by using a more stringent criterion, but you can
never remove them entirely (and the 0.1 criterion commonly used for tests of
publication bias seems very conservative, the Type I error rate is close to
0.01 for simulated cases). Either we accept that we will sometimes make
errors, or we do not make decisions.

Curiously, the publication bias analyses have also been criticised for exactly
the opposite prior: that all articles are biased so there is nothing learned
when the analysis concludes bias. (Simonsohn makes both arguments, even though
they contradict each other.) I think this prior is unjustified. Publication
bias seems to be a common problem, but I think it is not warranted to assume
that experiments are biased by default. If that negative view is appropriate,
then we should all close shop (and funding agencies should stop supporting
such flawed research). My view is more pragmatic. There are problems with many
experimental studies, but some work appears quite solid. We should make an
effort to distinguish between them.

~~~
capnrefsmmat
Another way of putting the argument is that, if the base rate of publication
bias is low, your articles will have a very high false discovery rate. That's
inevitable, yes, but it means I would not treat your individual bias results
as particularly strong.

But the survey of _Psychological Science_ papers makes this point moot, since
it shows that the base rate is fairly high.

It would be interesting to see if there are other approaches to detecting
publication bias. In medicine several reviews have spotted outcome reporting
bias by comparing clinical trial protocols given to ethics committees with the
published papers. Typically the clinical trial protocol includes several
outcomes that aren't in the paper. Other reviews search clinical trial
registries to find publication bias.

Is there something analogous in psychology? You could presumably track down
ethics board documents and figure out what fraction of experiments never
appear, then do a sensitivity analysis on the published results. I don't know
if this has been done.

I assume you're familiar with this survey that found about 50% of
psychologists admit to publication bias: John, L. K., Loewenstein, G., &
Prelec, D. (2012). Measuring the prevalence of questionable research practices
with incentives for truth telling. Psychological Science, 23(5), 524–532.
doi:10.1177/0956797611430953

~~~
gfrancis
If you feel that the conclusion of the individual bias results are not
convincing, then you probably should not trust the estimated base rate from
the analysis of the Psych Science articles. I think the results are
convincing, and I am not sure why you say there will be a high false discovery
rate (false discovery means concluding bias when it does not exist - a Type I
error). If bias does not exist, then (in the ideal case where true
experimental power is known) the Type I error rate is 0.1. When power is
estimated from the data, the Type I error rate is generally much smaller than
0.1.

As for a conclusion of bias being "strong", I think we have to define the
term. Certainly I would not suggest that a conclusion of bias means that
authors are being dishonest or stupid. There are many ways for published
results to become biased. However, I think the appearance of bias does mean
that scientists should be skeptical about the validity of the reported
results. The impetus is on the original authors to provide good support for
their theoretical ideas, and biased data generally do not provide such
support.

The closest example I know of that relates to your description of contrasting
registries and published results is a comparison of theses and subsequent
journal articles. Details are at

[http://jom.sagepub.com/content/early/2014/03/18/014920631452...](http://jom.sagepub.com/content/early/2014/03/18/0149206314527133.full)

~~~
capnrefsmmat
The false discovery rate isn't the same as the type I error rate. It's larger.
If, for example, 10% of articles have publication bias, your test has 100%
power to detect these, and the type I rate is 10%, your false discovery rate
will be 47%.

(That is, if you tested 100 papers, you'd get 10 true positives and 9 false
positives, so 9 / 19 = 47%.)

So in that ideal scenario, about half of your published findings of bias would
be false positives. This means that your analysis of _Psychological Science_
would overestimate the base rate by a factor of 2. Of course, the
overestimation amount depends on the true rate, which we don't know.

Now, you say the true type I rate is much smaller, so perhaps that solves some
of the problem. But I doubt you detect bias with 100% power, and smaller power
means a higher false discovery rate.

Thanks for the link; that looks like what I was hoping for. It'd be
interesting to see something similar for a broader sample of papers (not just
those started as theses).

~~~
gfrancis
Thanks for the clarification about the FDR. Obviously, its calculation very
much depends on the base rate probabilities and the sensitivity of the method.
The method's ability to detect bias very much depends on the type of bias
being applied. When only a file-drawer bias is being used the method is
unlikely to detect the presence of bias (because the power values are
dramatically overestimated). On the other hand, if optional stopping and a
file drawer are used, the method has a good chance of detecting bias (because
all the reported experiments stop just as soon as they satisfy the p<.05
criterion).

Ioannidis used some of my simulation results to calculate likelihood ratios,
which are somewhat related to the FDR. Details are at

[http://www.sciencedirect.com/science/article/pii/S0022249613...](http://www.sciencedirect.com/science/article/pii/S0022249613000278)

