
'Generous’ approach to replication confirms many social science findings - SiempreViernes
http://www.sciencemag.org/news/2018/08/generous-approach-replication-confirms-many-high-profile-social-science-findings
======
cultus
>And the 62% figure “certainly is consistent with there being a problem” in
the field, he says. “It seems funny that there’s been a drift in standards to
the point where 62% seems very respectable.”

I'd sure say so. Statistics in science is just so broken. Lots of
inappropriate techniques chosen on tradition, ignorance (or concealment) of
multiple comparisons, the base-rate fallacy, etc, etc. It's not just social
sciences, but medicine, biology, and just about every field that relies
greatly on lower-power studies.

Bayesian techniques are really called for in those situations.

~~~
tenaciousDaniel
I'm not a scientist, nor am I familiar with statistics. I could imagine this
being troubling for a scientist or someone familiar with the field. But for a
non-scientist it's fairly devastating if you consider the implications.

What it means is that if I trust a study - _any_ study (from my perspective) -
I'm essentially flipping a coin. If non-scientist citizens can't rely on what
is coming out of the field, then it seems like a massive problem that needs
solving before any other.

~~~
the8472
You _shouldn 't_ trust any single study. Researchers are not perfect,
reviewers are not perfect, the systems they observe are complex. Even in the
"hard" sciences even if the experimental data itself is solid there can be
confounders that were overlooked and will only be uncovered by follow-up
studies, depending on importance of the result it might take years.

That's why have meta-analysis and systematic reviews and even those aren't
exactly bullet-proof.

~~~
btrettel
> Even in the "hard" sciences even if the experimental data itself is solid
> there can be confounders that were overlooked and will only be uncovered by
> follow-up studies, depending on importance of the result it might take
> years.

I work in fluid dynamics. In one subfield I work in there is at least one
confounder in most studies (Weber number and Reynolds number are confounded,
in particular) and an important variable is often omitted (turbulence
intensity). Many people don't seem aware that turbulence intensity matters,
either, and most who do think it matters don't want to measure or estimate it,
so it tends to be ignored. Just because it's hard to measure does not mean
it's not important! This is supposed to be "hard" science, but sometimes I
feel it's not much better than social science.

How long has this been a problem? I'd say over 50 years, easily!

At a recent conference, one of my papers addressed these issues briefly (I
tried to avoid them in a data analysis), but I think it'll take more than that
to correct these problems. So I'm planning a new article specifically
addressing these concerns. I still expect progress to be slow, but any
movement in the right direction would be appreciated...

~~~
tenaciousDaniel
That's really interesting. Seems like good science is always going to be an
uphill battle against human nature/complacency.

------
hirundo
> If experts [in prediction markets] can instinctively spot an irreproducible
> finding, “that kind of begs the question of why that doesn’t seem to be
> happening in peer review,” says Fiona Fidler, a philosopher of science at
> The University of Melbourne in Australia. But if future studies can identify
> and weigh the best predictors of replicability, reviewers might be given a
> rubric to help them weed out problematic work before it’s published.

That's a troubling suggestion. Results are valuable to the extent that they're
both accurate and surprising. To systematically suppress surprising results as
a negative predictor of accuracy sounds like a formula for suppressing
surprisingly valuable papers.

~~~
closed
The danger with surprising results is that they often use the surprise part as
an excuse for small sample sizes, etc.. Another way to think about this
situation is to flip it and say that surprising (and spurious) results are
more likely from exploring a small dataset.

Andrew Gelman talks a lot about this issue on his blog:

[https://andrewgelman.com/2014/08/01/scientific-surprise-
two-...](https://andrewgelman.com/2014/08/01/scientific-surprise-two-step/)

------
jblow
> If an initial replication attempt failed, the researchers added even more
> participants.

This is an obvious way to tamper with the results; it's just more of the same
kind of p-hacking that bad researchers are so often doing. They are using "we
re-do the study with a larger population" as a way to re-roll the dice if the
first die roll doesn't come up the way they want. (Note that if the die roll
_did_ come up the way they want, they don't re-do the study with a larger
population in order to see if the replication fails).

Nobody should be taking this seriously.

~~~
timr
There are known methods to account for this, and if you adjust the statistics
properly to account for the repeat, it isn't p-hacking.

~~~
jblow
Did they? I can't tell from the article!

------
claytonjy
The prediction-market aspect is fascinating; they asked a group of experts to
predict which would and wouldn't replicate, and while we don't see the
experiment-level results, the overall average was apparently very close.

Is this common in modern replication studies? Do the results of the
prediction-market ever get announced/published prior to the replication
results themselves?

------
_rpd
> experts also participated in an online “prediction market,” trading shares
> that corresponded to studies, which paid out only if the given study was
> replicated.

> Both approaches did well at predicting the outcome for individual studies,
> and they predicted an overall replication rate very close to the actual
> figure of 62%.

I think the real headline here is that we should supplement peer review with
replication prediction markets.

~~~
philipodonnell
Is there any information on these prediction markets? Is there a reason only
"experts" can participate? This would be an interesting application for ML
side projects that are monetarily beneficial while also benefiting society as
a whole.

~~~
creaghpatr
The experts aren't necessarily credentialed experts as we would typically
think of them. You can participate in forecasting tournaments like Good
Judgement Project and if you score high enough you can participate in invite-
only events that are in partnership with entities like DARPA. But the people
come from all backgrounds- I got all this from the book Superforecasting by
Philip Tetlock.

Edit: got the title wrong

------
forapurpose
The facts are spread throughout the article, so this might help:

* Prior replication studies replicated 39% of papers in psychology journals [Ed note: That doesn't mean the other 61% were complete failures; most just didn't produce results as statistically strong as the originals IIRC] and 61% in economics journals.

* This replication study greatly increased the number of participants in some experiments, and for two papers, that changed the replication results from failure to success. Overall 62% of the studies were replicated successfully.

* One researcher "points out that the project repeated only one experiment from each paper, and in his case, it wasn’t the strongest or the most important."

~~~
TangoTrotFox
You're missing arguably the single most important sentence in this article:

* If an initial replication attempt failed, the researchers added even more participants.

~~~
bencollier49
Which is not unlike doubling-up at blackjack, which will get you thrown out of
a casino.

Start with 25 participants, then if you don't get the result you want, add 50
and try again. Carry on doubling until you get the result you want.

Not a statistician here, but as each addition is essentially a new trial,
shouldn't they have applied Bonferroni correction to the results?

------
forapurpose
I assume Science knows what it's talking about, so I'm confused on a couple of
points. Perhaps some practicing scientists could clear them up for everyone:

For any arbitrary paper, what is your assumption about its accuracy? How much
do you rely on it? Can you put a number to it? The null hypothesis that
research papers, especially in very difficult fields like social science, are
near-infallible seems to be the error, AFAICT. It seems like something people
outside science would assume, but I see scientists who are surprised by a 62%
replication rate. I'm not surprised or concerned, but maybe I should be.

Why do they call increasing the number of participants, "generous"? Doesn't
that increase accuracy, and isn't accuracy the whole point of the replication
study? Generosity implies some sort of favor, something above and beyond,
while in this case it seems necessary based on the fact that the results
changed - it would be failure of the replication study to not increase the
number of participants.

~~~
Fomite
"For any arbitrary paper, what is your assumption about its accuracy? How much
do you rely on it? Can you put a number to it? The null hypothesis that
research papers, especially in very difficult fields like social science, are
near-infallible seems to be the error, AFAICT. It seems like something people
outside science would assume, but I see scientists who are surprised by a 62%
replication rate. I'm not surprised or concerned, but maybe I should be."

There is no such thing as "an arbitrary paper". It will depend very much on
the study in question - including soft factors like who wrote it, but also the
study design, sample size, if I think they approached the statistics
appropriately, etc.

"Why do they call increasing the number of participants, "generous"? Doesn't
that increase accuracy, and isn't accuracy the whole point of the replication
study? Generosity implies some sort of favor, something above and beyond,
while in this case it seems necessary based on the fact that the results
changed - it would be failure of the replication study to not increase the
number of participants."

The reason it might be thought of as generous is the criteria they determine
for saying something is replicated is a significant effect in the same
direction as the original study (for the record, I _hate_ this criterion). A
larger study is more likely to find a significant result if indeed there is
one there, so they're giving studies that report an effect a very strong
chance of seeing that replicated if indeed it was real.

------
soneca
I would appreciate a reader-friendly list of the studies that were
successfully replicated, being it the 39% of the 2015 paper or the 62% of this
paper.

Knowing that I have no strong reason to trust most of the conclusions is
useful. But know which papers I _can_ trust is more useful.

Anyone knows of such a list?

