
Psychology study that induced "reproducibility crisis" was wrong: researchers - signor_bosco
http://www.slate.com/blogs/the_slatest/2016/03/03/psychology_study_that_induced_the_reproducibility_crisis_was_wrong.html
======
Certhas
Or it wasn't, e.g.:

[http://neuroneurotic.net/2016/03/03/the-non-replicability-
of...](http://neuroneurotic.net/2016/03/03/the-non-replicability-of-soft-
science/)

 _From where I’m standing, social and other forms of traditional psychology
can’t say the same. Small contextual or methodological differences can quite
likely skew the results because the mind is a damn complex thing. For that
reason alone, we should expect psychology to have low replicability and the
effect sizes should be pretty small (i.e. smaller than what is common in the
literature) because they will always be diluted by a multitude of independent
factors. Perhaps more than any other field, psychology can benefit from
preregistering experimental protocols to delineate the exploratory garden-path
from hypothesis-driven confirmatory results.

I agree that a direct replication of a contextually dependent effect in a
different country and at a different time makes little sense but that is no
excuse. If you just say that the effects are so context-specific it is
difficult to replicate them, you are bound to end up chasing lots of phantoms.
And that isn’t science – not even a “soft” one._

~~~
vacri
The article explicitly says that the 39% reproducibility isn't that far off
what you would probably expect anyway.

The article then goes on to add the the meta-study was awful in it's study
selection, giving the example: _OSC researchers tried to reproduce an American
study that dealt with Stanford University students’ attitudes toward
affirmative action policies by using Dutch students at the University of
Amsterdam_.

The article then goes on to say that reproducibility is still a problem in
science, but that the meta-study was simply terrible in it's own methodology.
Low reproducability is less of an issue given the context of the problem
space, the article says; the meta-study _was_ wrong, both methodologically and
in conclusion ('low reproducability' != 'unethical' or 'crisis').

I recommend reading the fine article.

~~~
logicrook
>OSC researchers tried to reproduce an American study that dealt with Stanford
University students’ attitudes toward affirmative action policies by using
Dutch students at the University of Amsterdam.

If they do conclude things on "students" based on this study, then reproducing
it on Dutch students is perfectly fine. If they did conclude something on
"Stanford University students", then it is a bad reproduction. You see,
there's a very good justification there, that you can interpret as subtyping,
or the liskov substitution principle.

>'low reproducability' != 'unethical'

This is akin to saying "ok there's this theorem, but I won't give proof, just
a bogus heuristic argument", or even worse, saying "that person is guilty but
I won't justify why". Both these behaviors are unacceptable. Why would "social
sciences" be above that?

> or 'crisis'

A field is filled of politically oriented baseless claims, but this is not a
crisis, just another day in Oceania.

That said, if the meta-study was sloppy, then it still justifies that research
standards should have been upped a while ago; the "terrible"" meta-study was
accepted _precisely_ because terrible papers are accepted in the first place.

~~~
vacri
> _If they do conclude things on "students" based on this study, then
> reproducing it on Dutch students is perfectly fine._

No. Just fucking 'no'. It was a study on affirmative action policies, repeated
in a distant country with different public opinions on welfare and social
fairness. If you think that that's fine, then you really have no claim to be
declaring what psychology should and shouldn't be.

> _filled of politically oriented baseless claims_

Fuck me, but it's funny when detractors claim that psychology is full of
unscientific behaviour, while they're way overgeneralising the entire field on
a subsection of it.

~~~
mystikal
His point was about whether the term "students" is qualified.

~~~
vacri
If that's the reason for including the study, then it was a bad study to
choose to replicate, as it would be clearly not replicable for students of
other cultures - in which case, the meta-study researchers are stacking the
deck; they would be guilty of the very thing they're decrying.

------
edtechdev
Too me a big problem with traditional psychology studies (at least in
educational psychology) is generalizability (or what is called external
validity), not so much reproducibility. A typical study “includes participants
[like freshmen psych majors] who have no specific interest in learning the
domain involved and who are also given a very short study time”, often mere
minutes. When researched in real-world settings or even just real classroom
settings, the effects might not only diminish but even reverse.

Most of the popular books on learning now suffer from this, emphasizing
strategies that work fine for short-term rote memorization tasks (like
memorizing random word lists), but have little impact on or can even hurt
higher-order learning and understanding.

~~~
mettamage
IMO generalizability is a big problem in psychology as a field. This paper
"the weirdest people around the world?" explains it quite well:
[http://www.econstor.eu/bitstream/10419/43616/1/626014360.pdf](http://www.econstor.eu/bitstream/10419/43616/1/626014360.pdf)
(open access)

~~~
Xcelerate
That's a _very_ interesting article by the way. I've been reading the whole
thing front to back.

~~~
mettamage
Haha, I had the same experience! :D

------
aaron695
This article is just plain scary.

The 'researchers' chose to do a exclusive pre-release with a journalist/media
before you could check the data says it all, they are not scientists.

The more I read about Psychology the more I think it really really isn't
currently a real science.

[https://xkcd.com/435/](https://xkcd.com/435/) \- Currently I think it starts
to cut off around chemistry to be honest.

------
analog31
Perhaps a parallel question might be: Does the publication of psychology (or
medical, etc.) studies have the net effect of producing a body of reliable
knowledge?

------
ars
Am I reading this right?

They are saying there is no reproducibility problem, because it's perfectly
normal to be able reproduce just 39% of studies?

~~~
rosser
If, for sake of a convenient number, half of the original studies had
methodological errors that would make the result irreproducible — irrespective
of whether the study's hypothesis was correct, mind! — and half of the studies
attempting to reproduce the results of the original study committed some form
(possibly even the same form) of methodological error as well, then just on
that basis, we should reasonably expect to see a whopping _25%_ of studies
being successfully reproduced.

Whatever the actual error rates turn out to be, something like that is going
to cut pretty deeply into measured reproducibility rates, isn't it?

~~~
p4wnc6
Why should we expect methodological errors for the replications to be anywhere
near that? I'd expect them to be in the low single digit percentages. And
across different replications, such methodological errors should rarely be
correlated, so as long as the rate is merely less than 50%, then when given
several distinct failures to replicate, it counts as solid evidence that the
original result was spurious.

~~~
rosser
You're nit-picking the numbers I directly admitted I was making up on the spot
for sake of illustrating the broader, _and more salient point_ that the
percentage of methodological errors during the original and replication
studies have a multiplicative effect on the overall reproducibility rate.

~~~
p4wnc6
I don't think so. Saying that only 1/4 of replications are, on average, useful
because of methodological flaws is an extreme claim. Given the costs
associated with reproductions, if anything even close to 3/4 of them are on
average untrustworthy it completely changes the game in terms of spending
resources on reproduction, which in turn frees original investigators to care
even less about reproducibility from the start.

On the other hand, if probabilities of methodological flaws are low for
reproductions, then the multiplicative effects of the original method and the
reproduction method are dominated by the original method, in such a way that
more replications means multiplying probabilities together on the replication
side, which (if flaw rates are less than 50%) goes to zero pretty quickly.

Meaning that consistency among reproductions is strong evidence regarding the
validity of the original, and so the replications are worth the cost.

There's a significant _qualitative_ difference between cases where
replications are likely to be flawed vs. cases where they aren't. Cases where
different replications are likely to have correlated flaws (even when flaws
overall are unlikely) would also be problematic, because different
replications might not give you additional information.

I think it matters to be very clear about all this, and not give any
impressions that could be easily misconstrued to cast doubt on the utility of
replications. Making some back of the envelope estimates using a 50% flaw rate
strikes me as a dangerous way to talk about it (unless we have evidence that
the flaw rate really is that high).

~~~
rosser
You're still picking on the _actual, specific number_ , and ignoring (or
otherwise completely missing) the part where I specifically said, "for sake of
a convenient number". I chose 50% _because it makes the math easy_. Period.
Full stop. End of its value or contribution to the discussion.

I have exactly zero idea what the actual ratios are. But the actual ratios
absolutely _do not matter_ for the purposes of illustrating how the
methodological errors compound.

~~~
p4wnc6
The most salient detail about the way methodological errors compound is that
by engaging in more replications, you can beat down the rate of spurious
results in the replications. It's of little importance that it is
multiplicative with the flaw rate of the original, because the flaw rate of a
set of independent replications will go to zero super fast in the size of the
replications.

If the flaw rate is >= 50% and the number of replications is low, it presents
a case where it's of little social value to do replications, depending on
their cost (which is usually high).

From your original comment, the way a naive reader will view it is that
replications often don't work, because you primed them with a number you just
pulled out of nowhere (50%). That will matter much more than the other detail
about it being multiplicative, which doesn't really matter much at all no
matter what the flaw rates are, because it all hinges on the flaw rates of the
replications and how many replications you do anyway.

I know that _you_ did not mean to misrepresent it with the 50% number. I'm
just harping on this because by happening to use a number like 50% it has the
potential to distract and do more harm than good, regardless of anything else
in your comment or even whether or not the 50% had anything to do with your
main point.

------
hackuser
The claims of the original study, the one that found problems with
reproducibility, are being overstated. From the news report at the time:

 _Strictly on the basis of significance — a statistical measure of how likely
it is that a result did not occur by chance — 35 of the studies held up, and
62 did not. (Three were excluded because their significance was not clear.)
The overall “effect size,” a measure of the strength of a finding, dropped by
about half across all of the studies. Yet very few of the redone studies
contradicted the original ones; their results were simply weaker._

Also: _The only factor that did [affect the likelihood of successful
reproduction] was the strength of the original effect — that is, the most
robust findings tended to remain easily detectable, if not necessarily as
strong._

Here's the HN discussion of the study:

[https://news.ycombinator.com/item?id=10131387](https://news.ycombinator.com/item?id=10131387)

~~~
nonbel
>"Strictly on the basis of significance — a statistical measure of how likely
it is that a result did not occur by chance"

Nope!

~~~
AstralStorm
Indeed, what it might have meant is "strictly on the basis of p-values" or
confidence intervals. Still, that is very risky, p-values attained using
different tests cannot be compared. Not to the point, p-values have no meaning
if you do not have a control group, because you do not have access to a
population that might be under the null hypothesis. P-values are sometimes
computed against a non-null hypotheses, e.g. null test is giving people $5 not
nothing, while the main gives $100. This is not valid. Or against the same
subgroup, which might not really be representative.

~~~
nonbel
I don't think your point is the same as mine. The p-value only tells you how
likely it is to see a deviation at least as extreme as the data you plugged
into the equation, if the model you are testing were true. That's it.

------
aaron695
This is incorrect, right?

"If all 100 of the original studies examined by OSC had reported true effects,
then sampling error alone should cause 5% of the replication studies to “fail”
by producing results that fall outside the 95% confidence interval of the
original study"

It'd have to be a pretty poor(read corrupt) field of study of all the random-
ish studies chosen at close to 95%?

If the original studies are legit, than many/most of them would be at much
higher confidence intervals?

Only scam fields would they all be around 95%?

[https://xkcd.com/882/](https://xkcd.com/882/) \- this is more about how bad
science creeps in legitimate fields, the authors would be implying all the
studies in this fields are this bad.

This would be my reading, not sure....

~~~
maxander
You aren't by any chance a physicist or something, are you? :)

95% confidence is just about the lowest that a scientist in any field would
think about publishing with, but in many fields- especially those whose
subjects are the delicate and high-variance animals known as humans-
statistical confidence is _expensive_. It wouldn't surprise me that lots of
papers are put out that just barely reach the lowest standards of evidence,
because grant-funded researchers couldn't afford the 1000 more test subjects
it would take to get another sigma.

Which isn't _great_ , but strictly speaking its not doing society a
disservice, either- even uncertain knowledge decreases the entropy of our vast
ignorance. It does, however, add an important dimension to how these studies
should be interpreted- that small chance that any given study is incorrect
_matters_ , particularly when there's such a mind-bogglingly huge amount of
research being done in the modern day. Its entirely unsurprising to read about
a study claiming with p < 0.01 that a glass of wine a day will make your hair
fall out, when you consider that there's been _many thousands_ of papers
published about things like wine. Confidence- and p-values alone aren't going
to save us there.

~~~
aaron695
Yes, I see you're correct.

Most studies would chose a 95% confidence interval so their statement stands.

------
jkldotio
The Bayesian take on this issue is interesting.[0] In their analysis of 72
replications only 8 had evidence pointing in the other direction and with a
strong Bayes Factor.

[0] [http://alexanderetz.com/2016/02/26/the-bayesian-rpp-
take-2/](http://alexanderetz.com/2016/02/26/the-bayesian-rpp-take-2/) and
[http://journals.plos.org/plosone/article?id=10.1371/journal....](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149794)

------
flycaliguy
I've read a couple write ups on this situation and I still haven't figured out
how they picked the 100 studies. Was it random? It must have been right?

~~~
neilc
From the original paper:

"We constructed a sampling frame and selection process to minimize selection
biases and maximize generalizability of the accumulated evidence.
Simultaneously, to maintain high quality, within this sampling frame we
matched individual replication projects with teams that had relevant interests
and expertise. We pursued a quasi-random sample by defining the sampling frame
as 2008 articles of three important psychology journals: _Psychological
Science_ (PSCI), _Journal of Personality and Social Psychology_ (JPSP), and
_Journal of Experimental Psychology: Learning, Memory, and Cognition_ (JEP:
LMC)...The first replication teams could select from a pool of the first 20
articles from each journal, starting with the first article published in the
first 2008 issue. Project coordinators facilitated matching articles with
replication teams by interests and expertise until the remaining articles were
difficult to match. If there were still interested teams, then another 10
articles from one or more of the three journals were made available from the
sampling frame. Further, project coordinators actively recruited teams from
the community with relevant experience for particular articles."

------
AstralStorm
The real question is whether the research commits the so-called type 3 error:
provides a correct answer to the different question than posed. This might be
very common in psychology.

Typical example would be premature generalisation of results.

