
Daryl Bem and the Replication Crisis - wellpast
https://redux.slate.com/cover-stories/2017/05/daryl-bem-proved-esp-is-real-showed-science-is-broken.html
======
onli
I might have missed it in this rather long article, but I think the whole
debate could use some people more that have read Karl Popper and are familiar
with Positivism vs Falsifiability. I would not be surprised if all this
amounts to is that you can of course "replicate" results if you try often
enough, and ignore the falsifying results when it did not work.

I mean, of course the big problem is that applying strict falsifiability to
social sciences does not work. You can always find a counter-example, the
problems studied do not work like that. It is hard to reconcile this, but I
think that is the heart of the replication crisis (together with some bad
statistics). But (having done a PhD in a somewhat related field) I see only
two options:

1\. We need a complete overhaul of how experiments and study results are
published, so that observers can see the failed results and we can try to
assess how often a theory hold up

2\. We have to limit non-hard sciences to questions that are not ambiguous,
where one well-done negative result really shows a theory is wrong.

Of course, there is a third option: Just continue as it is now and ignore all
results that seem unlikely, because given how those fields work, they are most
likely wrong.

~~~
adrianN
Why does strict falsifiability not work in social sciences? You just need to
state your theories better if you can otherwise find counterexamples easily.
Physicists don't say "smash these two particles together and you'll see a
Higgs Boson", they have sufficiently nuanced theories that sifting through
petabytes of data to find what you claim exists is justified.

~~~
onli
It's about the problems you can study.

Take the Cornell Food Lab as example. One of the examples was that people will
eat less sweets if they are stashed away and not fully visible on a table.
It's interesting and it might very well be true, overall. But it is absurdly
hard to know for sure. For one, devising an experiment for this is very hard.
But the main problem is that you will absolutely find at least one person that
eats more sweets when they are stashed away. So, the theory is wrong? Not
really.

It might be an invalid theory though. But if you think that, there are very
few things those fields could study. I'd say there were no use for them.

> _they have sufficiently nuanced theories that sifting through petabytes of
> data to find what you claim exists is justified._

That does not sound like searching for falsifiability to me. However, it
sounds like the workaround I also tried: Backing a theory - that expects an
overall result, not to be true for each individual - up with as much data as
possible so one can reasonable assume it is true. But that's not really the
correct approach.

No, for something like the Higg Boson they try to see it or its effects (and
if they were not to see it in the right circumstances, the theory were false).
Might be a bad example though given how it touches physicists theory building.

~~~
marcus_holmes
Is a theory that is true only part of the time actually any use to anyone?

For example, your example. I might be the guy who eats more sweets when
they're hidden. I go to my psych to talk about my sweet problem. The psych
says "it's a well-studied phenomenon that you will eat less sweets if they're
hidden". I hide my sweets. Boom, I eat more. What happens then?

We know there's a huge variety in humans. But statistics weeds them out and
talks about "the average human". Which is great for statistics and
researchers. But since there's no such thing as the average human, how does
this actually help us?

~~~
upvotinglurker
If, say, 95% of people tend to eat less sweets when they're hidden, it's worth
trying the method first (then discontinuing it if it you turn out to be one of
the 5%). Is a drug that cures most, but not all, instances of a specific
infection useless?

------
jonstokes
From the article:

"A few students—all of them white guys, Wu remembers—would hang around to ask
about the research and to probe for flaws in its design. Wu still didn’t
believe in ESP, but she found herself defending the experiments to these
mansplaining guinea pigs. "

Was this race- and gender-related jab really necessary? (Honest, non-
rhetorical question.)

~~~
nkurz
No, it was a mistake and distracts from the article. But empirically, the 3
previous discussions about this article have been too distracted by this "easy
target" to discuss the interesting (to me) underlying issues, and have been
user flagged to death. You can find others' answers to your question by
searching for the earlier postings. So without malice, I downvoted your
comment not because it's wrong to ask, but to try to move it to the bottom of
the page where it can do less harm.

~~~
jonstokes
Thanks for the heads-up and backstory. I didn't know this article had this
kind of history here on HN.

------
nkurz
If you can get past the linkbait title and some questionable word choices on
the part of the author, this is actually an excellent article. The article is
not trying to convince you that ESP is real, at least not in the sense of in-
the-world reality. Rather, it's an article about the standards of scientific
proof, showing that there are cases where standard statistical practices can
"prove" apparently absurd results.

The issue is that in general, it's hard to tell the difference between
"absurd" and "unexpected". In theory, the scientific method is about designing
experiments that produce replicable results that cannot be explained by
current theory, and then refining (or replacing) the theory until the can
explain the new results, while still making correct predictions about all
cases covered by the old theory.

But what do you do when results are obtained that violate the foundations of
science, such as the time order of cause and effect? Naturally, one should
start by being skeptical of the experiment. Was the data accidentally recorded
wrong? Was the data inappropriately filtered before being analyzed? Is the
experimenter lying about the data that was obtained?

Usually, thinking about these issues yields some apparent cause of error that
would explain the unexpected results without violating one's basic beliefs.
Unfortunately, an apparent reason for disbelief can often be found even if the
results truly are impeccable. Often, the result is that doubters continue
their disbelief, and the believers continue believing, until one faction or
the other retires from the field and the other belief becomes "consensus".

What Bem has done is to design an experiment that surpasses the statistical
standards of many fields, yet "proves" a result that on the surface seems
impossible. Most of the usual scientific errors have been avoided, and his
methodology and analysis are better than most. One possibility is that our
current conceptions of causality are wrong: we think that for A to cause B, A
must happen before B, but in fact this is not a requirement.

Another possibility is that something is grossly wrong with our current
interpretation of scientific results, and many longstanding theories which are
considered "scientifically" proven may in fact be mistakes, or at least have
no more "proof" than Bem has managed to show for ESP. For most scientists,
both of these answers are problematic, yet it would seem that at least one of
them must be true. One twist is that some believe that Bem's goal is not
actually to prove that ESP is correct, rather to show that the foundations of
science are faulty: [http://andrewgelman.com/2013/08/25/a-new-bem-
theory/](http://andrewgelman.com/2013/08/25/a-new-bem-theory/).

~~~
wuch
There is a video from a debate with Daryl Bem about those results. It is worth
a look, because he admits to using various methods that increasing the degrees
of freedom, and not taking those into account during statistical analysis.
Essentially, he honestly admits to p-hacking, but doesn't seem to recognize
anything wrong in that.

On those grounds, I think, it is little misleading to suggest that those
studies surpasses the statistical standards of many fields. The statistical
tests are completely misapplied, but you can't tell just by looking at
publications.

~~~
nkurz
Do you have a link to the debate you mean? I'd be interested to see it, but
when I searched I found several. One theory is that while the initial
experiments were p-hacked to some extent, the replications (by definition)
were not. Some trustworthy sources argue that the problem is that the
replications actually weren't, but I haven't looked into the specifics closely
enough to know whether that's the case. That said, I thought Bem came across
quite well in this interview: [http://skeptiko.com/daryl-bem-responds-to-
parapsychology-deb...](http://skeptiko.com/daryl-bem-responds-to-
parapsychology-debunkers).

 _I think, it is little misleading to suggest that those studies surpasses the
statistical standards of many fields_

I didn't mean to imply that the statistical standards were good ones, rather
that the standards in some fields are abysmally low. For example, there's a
high profile case going on with the Cornell Food Lab, where the extremely
prolific lead researcher seemed not just blissfully unaware but proud of the
unseen pitfalls around him:
[https://web.archive.org/web/20170312041524/http:/www.brianwa...](https://web.archive.org/web/20170312041524/http:/www.brianwansink.com/phd-
advice/the-grad-student-who-never-said-no).

 _The statistical tests are completely misapplied, but you can 't tell just by
looking at publications._

I think that's true, but (in my opinion) the problem is finding papers outside
the very hard sciences where they _aren't_ completely misapplied. For example,
I feel there is an insurmountable issue with applying formal statistics to any
sort of meta-studies, where the data is at best a convenience sample, and
(almost?) all conclusions are conditional on the unknown biases of the sample.
This of course doesn't mean that all (or even most) meta-analyses produce the
wrong answer, but I think it does mean that they should be treated as
rhetorical rather than logical arguments.

~~~
heymijo
This caught my attention: "...insurmountable issue with applying formal
statistics to any sort of meta-studies..."

If I wanted to get a better understanding of what formed your opinion on this,
where might I look?

P.S. I enjoyed the way you formatted and responded to your parent comment.

~~~
nkurz
My belief is mostly intuitive and likely more extreme than most, and I don't
know of a good single source to point to. John Ioannidis' writings are
probably a good starting point: [http://retractionwatch.com/2016/09/13/we-
have-an-epidemic-of...](http://retractionwatch.com/2016/09/13/we-have-an-
epidemic-of-deeply-flawed-meta-analyses-says-john-ioannidis/). Searching for
"convenience sampling" on Andrew Gelman's blog yields lots of good discussion
in the comments:
[http://andrewgelman.com/?s=%22convenience+sample%22](http://andrewgelman.com/?s=%22convenience+sample%22).
Miguel Hernan's book on Causal Inference gives a good sense of the pitfalls of
biased sampling: [https://www.hsph.harvard.edu/miguel-hernan/causal-
inference-...](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-
book/). Sorry I can't do better, and maybe others can add better sources.

 _P.S. I enjoyed the way you formatted and responded to your parent comment._

Thanks, the style is sometimes referred to as "interleaved" or "inline", as
opposed to "top posting" and "bottom posting":
[https://brooksreview.net/2011/01/interleaved-
email/](https://brooksreview.net/2011/01/interleaved-email/). It was the norm
for early online communications, but has mostly fallen out of favor. I think
it works very well for some situations, although it too is surprisingly
controversial:
[https://news.ycombinator.com/item?id=5233428](https://news.ycombinator.com/item?id=5233428).

------
Dowwie
Last week, I attended PyCon in Portland. Two keynote speakers were from
academia. Both speakers touched on challenges that scientific communities are
having with regards to replication. One attempt to address this challenge is
by using, and sharing as reference material, Jupyter notebooks (Python) used
during research.

~~~
Bartweiss
> and sharing as reference material

I'm hugely in favor of this. Lots of replication-crisis stuff focuses on
handling statistical issues like salami slicing. That's great, but it's a one-
problem fix. Releasing more comprehensive information is a far, far bigger
advance, which enables everything from catching data fraud to verifying
statistical analysis.

Reinhart and Rogoff, for instance, was afflicted with nontrivial formula
errors in the analysis. A standard of releasing data and code with papers
would have allowed this to be discovered in months rather than years.

------
Bartweiss
This is a very worthwhile article in terms of simple facts. It covers largely
the same ground as the Slate Star Codex piece a few years back, updated and
made more accessible. But I can't help wondering at the article's urge to
challenge all of science or none of it. It feels rather protective, as though
the author is unwilling to countenance the possibility that studies in
psychology (and related subfields) _in particular_ are in jeopardy.

 _" The replication crisis as it’s understood today may yet prove to be a
passing worry or else a mild problem calling for a soft corrective. It might
also grow and spread in years to come, flaring from the social sciences into
other disciplines, burning trails of cinder through medicine, neuroscience,
and chemistry."_

This brackets the possible outcomes, certainly. But it's one hell of an
excluded middle, implying that perhaps there will be no serious errors found
(already out of the question) and perhaps entire fields will be wiped away.
Realistically, we have a much better understanding of the crisis than this
already.

Medicine has a disturbing number of process issues, but many (like ignoring
NNTH) are unrelated to replication errors. The neuroscience result is serious
but specific to fMRI studies, and the chemistry link there is aggressively
misleading. It concerns documentation and yield statistics for very real
reactions, not the sort of "whole theories are junk" issues social psych is up
against. There is effectively zero chance that chemistry and psychology come
out of this looking equally good (or bad), and I'm disturbed by the
equivocation.

 _" If you bought into those results, you’d be admitting that much of what you
understood about the universe was wrong. If you rejected them, you’d be
admitting something almost as momentous: that the standard methods of
psychology cannot be trusted, and that much of what gets published in the
field—and thus, much of what we think we understand about the mind—could be
total bunk."_

This is hyperbole, fine, but it still aggravates me in light of the other
section. These two things are not comparable! One contravenes fundamental
physics, the other says most psychology results are unproven (not even _false_
, just _unproven_ ). Michael Inzlict says of the crisis "I feel like the
ground is moving from underneath me, and I no longer know what is real and
what is not," but even he doesn't consider this comparable to the collapse of
basic physics.

A quick look at the Engber's other articles suggests a similar thread runs
through all of them. He covers specific replication failures, but runs
headlines like "Science Is Broken. How Much Should We Fix It? More rigor in
research could stamp out false positive results. It might also do more harm
than good." He writes about Gary Taubes (recently) as though past errors on
carbs make Taubes (consistently failed) claims true. The list goes on.

There is a weird, pervasive implication - not just in Engber's writing - that
the replication crisis means we must throw everything up in the air equally.
That maybe ego depletion is still true-as-studied, and maybe basic chemistry
and biology are false. This clouds needless confusion around the information
we have already, and fuels a "teach the controversy" attitude on topics from
stereotype threat to insulin. We would be better served by a less hedged but
more cautious approach that makes real effort to discuss how confident we
should be on what points.

------
valuearb
it really brings into question whether psychology can ever be a science.

------
theparanoid
Scott Alexander wrote about Bem and parapsychology back in 2014,
[http://slatestarcodex.com/2014/04/28/the-control-group-is-
ou...](http://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-
control/)

~~~
nkurz
I thought the back and forth between Johann and others deep in that thread
added a lot to the piece: [http://slatestarcodex.com/2014/04/28/the-control-
group-is-ou...](http://slatestarcodex.com/2014/04/28/the-control-group-is-out-
of-control/#comment-67143)

One conclusion seems to be that seeing successive replications of an
experiment that you believe to be flawed does not mean that one should
eventually lose one's doubt. Unfortunately, it also does not mean that one's
doubt is justified, and that the results can safely be ignored. Rather, it
means that at some point (if you want to improve your knowledge) you have to
figure out some way to analyze the experiment from another angle and remove
(or confirm) the base cause of the doubt.

~~~
Bartweiss
A line from Andrew Gelman I really appreciate:

"Again, they’re placing the original study in a privileged position. There’s
nothing special about the original study, relative to the replication. The
original study came first, that’s all. What we should really care about is
what is happening in the general population."

There are two very different questions about replication.

One is whether the study got its results by chance, including forced-chance
techniques like forking paths and salami slicing. This can be handled with
either preregistration or exact replication. (And at p < .05, replication is a
must because 5% of un-forced results will still be off!)

But the other is whether the study got its non-chance results by
methodological flaws or an actual insight about the world. Exact replications
are no good for this - doing the wrong thing twice is no better than doing it
once. The power poses study, for instance, used testosterone sampling
procedures that introduced known confounders. What would help is a study
equivalent of N-version programming: settle, preferably preregistered, on
multiple tests for the same effect. If they all work, you win. If some work
(repeatably) and others don't, you've either made a design error or found a
different effect than the one you were looking for.

This also explains how to work your confidence levels (a topic discussed in
that SSC thread). You can't replicate a study endlessly and gain confidence
every time. Given a prior for P(effect), exact replications boost P(effect ∪
bad study), and your P(effect) belief is bounded by the odds of a methodology
error. It's a point I'd never considered until that SSC post, and one a lot of
actual researchers still seem to miss.

------
lloydde
Frontpage six days ago with different title
[https://news.ycombinator.com/item?id=14364573](https://news.ycombinator.com/item?id=14364573)

~~~
dang
Yes, we invited this repost in the hope that a different title would make for
a less lame discussion:
[https://news.ycombinator.com/item?id=14372695](https://news.ycombinator.com/item?id=14372695).

~~~
misnome
Resampling the comments until you get the results you want is ironic,
considering the problems this causes in some (related to the article)
scientific fields.

~~~
dang
We're not running a study to see what comments arise under random conditions.
We already know that! In fact it's what we're trying to avoid.

