Take the time to read the story (many comments don't reflect more than the headline), which is far more complex and interesting than the headline. For one thing, almost all studies' effects were reproduced, but they were generally weaker.
* Most importantly, from the Times: Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.
* Also: The research team also measured whether the prestige of the original research group, rated by measures of expertise and academic affiliation, had any effect on the likelihood that its work stood up. It did not.
* And: The only factor that did [affect the likelihood of successful reproduction] was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.
* Finally: The project’s authors write that, despite the painstaking effort to duplicate the original research, there could be differences in the design or context of the reproduced work that account for the different findings. Many of the original authors certainly agree.
* According to several experts, there is no reason to think the problems are confined to psychology, and it could be worse in other fields. The researchers chose psychology merely because that is their field of expertise.
* I haven't seen anything indicating the 100 studies are a representative sample of the population of published research, and at least one scientist raised this question.
> The only factor that did [affect the likelihood of successful reproduction] was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.
This is probably just regression to the mean. The comment above suggests to me that the tendency for the findings in replicated experiments to be weaker does NOT necessarily come from any flaw in the experimental design, but from the criteria for findings to be published.
You would expect any given effect to show some variation around a mean effect size. My lab and your lab might arrive at slightly different results, varying around some mean/expected result. If your lab's results meet statistical significance, you get to publish. If my lab's don't, I don't get to. So the published results are the studies that, on average, show a stronger effect than you might see if you ran the study 100 times.
> Yet very few of the redone studies contradicted the original ones; their results were simply weaker.
If a third lab replicates the experiment, their results are more likely to be close to the (possibly non-publishable) mean value than the (publishable) outlier value. So on average, repeating an experiment will give you a LESS significant result.
If the strength of the original effect (and thus probably the mean effect strength over many repeated experiments) is larger, the chance of replicated experiments also being statistically significant is higher.
In other words, these new results are very predictable and don't necessarily indicate that anything is wrong.
My expectation is that this regression to the mean should not apply to strong effects. Where I define strong by: the strength is enough that significance level and publication criteria are unimportant. In this case, I would expect half the results to return stronger.
The first result was a random sample, and the second result was a random sample. If there's no outside bias from publication cut-off, there should be a 50% chance that either is higher.
It's concerning if the strong results consistently re-test weaker. That shows systematic bias.
Oh, the Millikan experiment. I had to do it in uni lab as a 4 hour long experiment. It's impossible to gather enough data in this short time and your eyes figuratively falls out by staring into the microscope measuring the drops' velocity. I can assure that this is the worst experiment one can do as a student.
> * According to several experts, there is no reason to think the problems are confined to psychology, and it could be worse in other fields. The researchers chose psychology merely because that is their field of expertise.
There's a tendency to stop collecting data once there are publishable findings. There's also a tendency to ignore (find reason to discount) results that aren't reproducible or don't make sense. That's even the case in physics. There's a tendency to debug until you get results that are consistent with prior work.
> There's also a tendency to ignore (find reason to discount) results that aren't reproducible or don't make sense. ... There's a tendency to debug until you get results that are consistent with prior work.
Even Einstein (Einstein, of all people) when his findings suggested a Big Bang, adjusted them so that they would fit in the then-believed-to-be-true static universe model. He called it his greatest mistake. Confirmation bias.
> The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies.
In general, if you conduct an experiment, then conditional on finding significance the estimate of the effect size is going to be an overestimate of the true effect size. This is because the data that was extreme enough to produce a low p-value is also likely to be more extreme than the true population. Obviously this effect is less important when the true effect size is large enough - in high-powered studies. So this might not actually be so important: it may also mean that the effect size estimated in the original paper was simply miscalculated and estimated to be too high.
* Most importantly, from the Times: Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.
* Also: The research team also measured whether the prestige of the original research group, rated by measures of expertise and academic affiliation, had any effect on the likelihood that its work stood up. It did not.
* And: The only factor that did [affect the likelihood of successful reproduction] was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.
* Finally: The project’s authors write that, despite the painstaking effort to duplicate the original research, there could be differences in the design or context of the reproduced work that account for the different findings. Many of the original authors certainly agree.
* According to several experts, there is no reason to think the problems are confined to psychology, and it could be worse in other fields. The researchers chose psychology merely because that is their field of expertise.
* I haven't seen anything indicating the 100 studies are a representative sample of the population of published research, and at least one scientist raised this question.