
The Flawed Reasoning Behind the Replication Crisis - dnetesn
http://nautil.us/issue/74/networks/the-flawed-reasoning-behind-the-replication-crisis
======
dwheeler
The article says:

> The main reason scientists have historically been resistant to using
> Bayesian inference instead is that they are afraid of being accused of
> subjectivity. The prior probabilities required for Bayes’ rule feel like an
> unseemly breach of scientific ethics. Where do these priors come from?

It's not just "being afraid"; the problem is that random guessing (of priors)
is not a reasonable replacement for science.

Bayes' rule is great if you can find a reasonable justification for a prior.
Bayes is widely used for decision-making (for example), where you need an
answer quickly & you aren't trying to make general scientific claims. But if
you can't find a justifiable prior in a scientific work, using Bayes' rule
just replaces one statistical fallacy for another one. After all, Bayes' rule
will give nonsense answers if you give it a nonsense prior!

Bayes' rule is a great tool in many circumstances! But it has a great
weakness: it requires a prior. That doesn't make it useless; few tools are
useful in all circumstances. But requiring "everyone to use Bayes' rule, even
though we have no reasonable way to find a good estimate of the priors," is
unlikely to ever happen (and rightly so). The article rightly points out a
serious problem with the typical application of statistics, but there needs to
be a better justification for priors than is suggested in this article.

I could imagine systemic worldwide ways to deal with this. For example,
perhaps the scientific community could allow people to propose initial priors,
and then allow multiple different papers to improve the estimation of the
probability over time. But that would require much more than articles
repeatedly saying "there's no serious problem with priors"; having a
justifiable way to estimate and update priors _is_ the fundamental problem
with Bayesian analysis in the scientific community.

~~~
pdonis
_> random guessing (of priors) is not a reasonable replacement for science._

Priors in Bayesian statistics are not randomly guessed. The prior is supposed
to reflect your state of knowledge prior to looking at the data from the
experiment/study/test/whatever (hence the name "prior"). For example, in the
mammogram example, it is assumed that the doctor knows the base rate of breast
cancer in the population (which is a very reasonable assumption since there
are mountains of data on such rates) and the false positive and false negative
rates of the test (which come from data on previous tests). Those are the
prior. The doctor doesn't just make up those numbers; they come from prior
knowledge.

 _> if you can't find a justifiable prior in a scientific work, using Bayes'
rule just replaces one statistical fallacy for another one_

In other words, if you have no prior knowledge about something, you can't
expect to do statistics on some small set of data and get reliable answers.
Yes, that's true. And Bayesianism tells you this is true, by telling you that
you can't find a justifiable prior. (Technically, you can always find a
maximum entropy prior, but in most cases that's not really any better than
having no prior at all since it is basically telling you you need a lot more
data before you can conclude anything.) Whereas statistical methods as they
are currently done in many scientific fields simply ignore this issue
completely and pick arbitrary thresholds for statistical significance. Which
is the article's point.

In other words, Bayesianism, properly viewed, is not a magic bullet for
extracting statistical answers from nowhere. It is a tool for exercising
discipline, so that you _know_ when you have to say "sorry, not enough data
for a meaningful answer".

~~~
mcguire
" _It is a tool for exercising discipline, so that you know when you have to
say "sorry, not enough data for a meaningful answer"._"

As opposed to the frequentist approach of saying, "We have data D and
hypothesis H. If the probability of D when not-H is true is below a threshold,
H is true," which hides an awful lot of assumptions.

The advantage of Baysian statistics is that _you have to explicitly state your
assumptions_ in terms of a prior.

~~~
BeetleB
>As opposed to the frequentist approach of saying, "We have data D and
hypothesis H. If the probability of D when not-H is true is below a threshold,
H is true," which hides an awful lot of assumptions.

My textbook (what would be called a frequentist book) is very explicit: You
either reject the _null_ hypothesis, or you fail to reject the null
hypothesis. A significance level test never concludes that the null hypothesis
is true. It's nuanced, but is not quite the same as your statement.

~~~
pdonis
Yes, and under this description, the equivalent of the Bayesian prior is
picking what the null hypothesis is. Frequentists simply fail to admit (or
perhaps fail to understand) that that choice is just as much of a subjective
judgment as the Bayesian choice of prior.

~~~
BeetleB
>Frequentists simply fail to admit (or perhaps fail to understand) that that
choice is just as much of a subjective judgment as the Bayesian choice of
prior.

Again, this claim is something I see only from self described Bayesians. I've
not met a professional statistician ever fail to admit it. My textbook
(written in the 90's), in at least two places, talks about it. In one place it
explicitly warns against assuming the "traditional" approach is more
objective, and points out that the assumptions exist in the model and the
practitioner should be aware of those assumptions.

And really, when it comes to p-values, I've never seen any stats textbook not
give a proper description and interpretation. No statistics textbook I've read
describes significance testing as a mere prescription from which you can
deduce binary answers. My book discusses factors in picking your p values, and
the implications, and that the appropriate value is very dependent on the
problem.

~~~
nkurz
I think it would be useful if you could apply these principles to one of the
examples in the article. Assume we have a woman who has a mammogram that
indicates malignancy. The question that the patient would like answered is
"What is the chance that I have breast cancer?"

How does frequentist statistics answer this question? What's the null
hypothesis that should be used here? Can this question be answered without
assuming a prior on the likelihood that the individual had breast cancer
before the test was performed?

~~~
BeetleB
>How does frequentist statistics answer this question?

By applying Bayes Theorem, just as the article did. By invoking Type I and
Type II errors.

This is a standard problem in stats textbooks - not breast cancer but the
general problem of a diagnostic measure that is 99% accurate, for a disease
that has less than 1% prevalence.

Frequentists don't insist on using p-values to solve all problems.

>Can this question be answered without assuming a prior on the likelihood that
the individual had breast cancer before the test was performed?

Why should one not assume a prior? There is a base rate for breast cancer. Why
should a frequentist not use that information? I've never heard a statistician
who does not describe himself as a Bayesian say one shouldn't. Every textbook
suggests one should. Where are people getting the idea that frequentists don't
use Bayesian methods. They always have used them.

I'm finding this whole discussion orthogonal to Bayesian vs frequentist
statistics. The main issue frequentists have with a lot of Bayesian approaches
is that Bayesians often want to assign probabilities to one off events,
whereas frequentists insist only on _repeatable_ events. That is where the
accusation of "subjective" comes from. Frequentists like to believe that any
question of probability can be decided by taking N samples and seeing the
outcome (even if only conceptually).

For problems where there is a population, and one can do sampling (i.e.
repeatability), there's never a problem with using Bayesian methods.

~~~
FabHK
Good point. It would be illustrative to get some examples where a prototypical
(but competent) Bayesian and a prototypical (but competent) frequentist would
give different answers.

It seems to me that when there really is a known base rate, the answers would
coincide.

And if there is no base rate, the Bayesian would guess (but explicitly) or
refuse to answer, while the frequentist would give a plausible answer, which,
however, very much hinges on the unknown base rate.

In the light of the replication crisis, maybe the Bayesian approach is better.

------
tunesmith
One other example of the flaw (I think) that came out in pop culture was a few
years ago during the Serial podcast about Adnan Syed.

Near the end when the hosts were wrestling about whether they should think he
was guilty, the interviewer's friend (the producer? Dana?) came on and shared
why she thought he was guilty, and it was this convoluted argument about if
Syed was innocent, he'd have to be the unluckiest guy in the world, so
therefore he was probably guilty.

Her point was that it was just too strong a coincidence that it was _his_ ex-
girlfriend that got killed, on that particular day during some period of time
when he didn't have an alibi, etc etc.

That seemed to me like it could be a bad Bayesian argument - like it would
have been a good argument had Serial selected a random citizen to interview,
but they selected someone they already knew was connected to the story.
Murders happen and by definition the circumstances are always highly unlikely
because murders are so rare, and innocent people closely connected to the
circumstances of the crime are very, very unlucky by definition. You can't
point to that unluckiness as evidence that they're probably guilty.

~~~
roywiggins
Add to that, highly unlikely events occur all the time, but most of the time
they're not anywhere near a murder. Once you start digging around for unlikely
things, you'll find them.

Heck, even if you dig into a random person _associated with a murder_ , you'll
find some unlikely things, but if you dig into a person who has already been
accused of the crime (rightly or wrongly), of course there will be loads of
suspicious circumstances. That's why they were accused in the first place.

~~~
richk449
> highly unlikely events occur all the time

They do?

~~~
bzbarsky
Yes. Your chance of being hit by lightning in a given year is quite low. The
average number of people struck by lightning in the US annually in the last
decade is 270, according to [https://www.weather.gov/safety/lightning-
odds](https://www.weather.gov/safety/lightning-odds) (and while that's an
estimate, the 27 average deaths/year over that period is not).

So a specific person being killed by lightning within a given year is pretty
unlikely (less than one in 10 million chance), and on average it happens once
every two weeks in the US. Lots of people, see.

Same thing with other situations where there are lots of observations...

~~~
richk449
The likelihood of me being hit by lightning is very low. Therefore, it is an
highly unlikely event. And it doesn't happen very often.

The chance of someone being hit by lightning is not very low. Therefore, it is
not a highly unlikely event. And it does happen relatively frequently.

Unlikely events don't happen very often. That's just a definition. If event X
happens frequently, then it isn't an unlikely event.

~~~
bzbarsky
Any given unlikely event does not happen very often.

But if you have a whole bunch of possible unlikely events, then one of them (a
different one each time, usually) can happen fairly often.

Back to the lightning example, any given person being hit by lightning is an
unlikely event. But as you not "someone being hit by lightning" is not,
because we are now observing these unlikely events across so many people.

All of which is to say, observing that an unlikely event happened doesn't
provide much information on its own, if you have observed a lot of things
happening in general...

~~~
dTal
Beautifully put - I nominate this as the official summary of the subthread.
It's the reason why we need to pad time estimates for projects and leave early
to catch trains, even when we can't think of any likely reason for delay - the
sum of all unlikely reasons can cross the threshold into "likely".

So it's quite correct to say "unlikely events happen all the time".

------
chrisco255
"The point is that we have good reason to be skeptical, and we should follow
the mantra of the mathematician (and Bayesian) Pierre-Simon Laplace, that
extraordinary claims require extraordinary evidence. By ignoring the necessity
of priors, significance testing opens the door to false positive results."

Such an important quote. I believe the replication crisis is especially
present in nutrition studies. We live in an age where headlines go viral and
new wave diets are taken on in rapid succession. Take all nutrition studies
with a grain of salt. (assuming, of course, that salt is good for you...or is
it???)

------
bambax
> _we should follow the mantra of the mathematician (and Bayesian) Pierre-
> Simon_ [de]* Laplace, that extraordinary claims require extraordinary
> evidence*

We should, but we don't, because we crave extraordinary results and are ready
to give up reason to get them. But if it's too good to be true (or too
spectacular), it probably is.

------
mcguire
" _Harvard Business School professor Amy Cuddy’s 2010 study of “power posing:”
the idea that adopting a powerful posture for a couple of minutes can change
your life for the better by affecting your hormone levels and risk
tolerances._ "

Hey, now. I _like_ standing with my hands on my hips, imagining I have a cape
flowing in the breeze behind me.

------
madhadron
<sigh> Okay, let's do this again.

Anyone arguing over frequentist versus Bayesian statistics is missing the
foundations of their statistical training. Both are subsumed by the framework
of decision theory. This isn't new. It goes back to Wald's work in the 1950's.

And the examples he is talking about wouldn't be saved by different
statistical methods. No fiddling with techniques at the end can save a failed
design of the trial.

~~~
skygazer
I get that you're probably trying to leave it as an exercise to the reader,
but the word "frequentist" doesn't even appear on the wikipedia article for
Decision Theory, nor the Stanford Encyclopedia of Philosophy.

I can't tell if you're being hyperbolic, or if it requires deep study to grasp
how bayesian and frequentist statistics are rendered irrelevant by decision
theory.

Anyone care to take a stab at a layman's summary for the benefit of the under-
educated folks around here?

~~~
James_Henry
Googling <statistical decision theory> might help you out. My opinion is the
Wikipedia page for decision theory could use a good rewrite. Someday I may get
around to it.

Also, madhadron didn't quite say that Bayesian and frequentist statistics are
rendered irrelevant by decision theory. Rather, statistical decision theory
includes Bayesian and frequentist statistics as possible statistical decision
rules. You might want to not use Bayesian or frequentist rules and instead use
minimax regret or something else.

I do think that it is fine that people argue about Bayes vs frequentist. I
wish they'd consider everything else though.

~~~
jbay808
>> You might want to not use Bayesian or frequentist rules and instead use
minimax regret or something else.

... What? Any such decision algorithm at least in theory still takes a
probability distribution as an input. You still need to approximately follow
the proper rules of probability to get that distribution, there's no way
around it.

~~~
James_Henry
You don't need to have a distribution to make a decision. I don't think I
understand what you are saying.

~~~
jbay808
Sorry for the super late reply. I don't know if anyone will see this... But...

In minimax regret, you have a set of available decisions D, and a set of
possible states of nature N, and a utility U(D,N). Each state of nature also
has a probability P(N) (which can be influenced by the decision too in some
problems).

States of nature include "interest rates rise 1%", "interest rates fall 1%",
and "interest rates stay the same". Decisions include "invest in stocks" and
"invest in bonds".

Minimax regret proposes to ignore the probabilities P(N), instead suggesting a
way to make a decision purely based on the utilities of the outcomes. But that
is actually an illusion.

Outside of math class word problems, we don't have N or U(D,N) handed to us on
a silver platter. There is always an infinite range of possible states of
nature, many of which have a probability approaching but never reaching zero,
including states such as "win the lottery", "communist revolution", and
"unexpected intergalactic nuclear war".

In commonsense decision-making we don't include those states of nature in our
decision matrix, because our common sense rules them out as being implausible
before we even think about our options. You wouldn't choose to invest in bonds
just because stocks have the most regret in the event of a communist takeover.

So what actually happens is we intuitively apply some probability threshold
that rules out states of nature falling below it from our consideration. Then
we minimize max regret on the remaining "plausibly realistic" states of
nature.

Humans are so good at doing probability mentally that this step happens before
we even realize it. But if you are writing code that makes decisions, you'll
need to do it, and so you'll need to have at least a rough stab at the
probability distributions.

------
CodiePetersen
Wow, I had read an argument criticising statistcal significance before but it
made nowhere near as good of a case as this article has.

------
YeGoblynQueenne
I was reminded of this article, particulraly the bit about Sally Clark's case,
when I read this today:

[https://www.theguardian.com/uk-news/2019/aug/02/louise-
porto...](https://www.theguardian.com/uk-news/2019/aug/02/louise-porton-
jailed-for-life-for-murdering-young-daughters)

It's about a woman who was jailed for killing all three of her kids. The
prosecutors alleged she wanted to sleep around and she couldn't because of the
kids. I'm struct by the fact that there was no direct evidence that the mother
was responsible for the deaths. The only "evidence" seems to have been that it
was not clear why the girls died:

 _Both deaths were consistent with deliberate airway obstruction, and doctors
could not find “any natural reason why either, let alone both, should have
died”, prosecutors said._

I would have thought that if there was uncertainty about the cause of death
(and that is exactly what "the doctors could not find any natural reason"
states: uncertainty about the cause of death) then there is no sufficient
evidence to convict.

But, I'm going by only what's in The Guardian article and I don't know the
details of the case.

------
mltony
I see there's one more problem with thinker sculpture study, that has nothing
to do how we analyze the data. There was another study a while ago, where a
dime was placed (or not placed for control group) in the public copying
machine. And then researchers would ask those people who were using that
copying machine a few questions, one of them was "on a scale from 0 to 10 how
happy you are?" People who found their dimes reported significantly higher
happiness scores, but I don't remember by how much. Assuming this experiment
was true, and not some journalists twisting the data wihtout understanding
statistics, I see there's some analogy with the sculpture study. In
particular, the conclusion I would make is that asking people questions about
happiness or religion or anything whatsoever doesn't tell you much about their
actual believes, since their answers are heavily affected by their mood at the
moment. I asume this statement can be turned into another study verifying its
correctness, and then after we collect the results, we'd have to start arguing
about Bayes vs. Fischer all over again...

------
6gvONxR4sf7o
It's easy to challenge the subjectivity of a prior, pretending that
frequentist testing is objective... until you design your own experiment,
rather than just analyze one. Turns out this is a more important part. It'd be
illustrative to explain that process.

A p-value of 0.05 (or a test with 95% significance) will, if there's actually
nothing there, tell you'll think there's something there 5% of the time
(wrongly). It says nothing about error rates if there is something there,
which is presumably what we're interested in. So for that, you'll do what's
called a power analysis. Power of the conventional 80% tells you, if there's
actually an effect size of X and you know all about your noise, you'll think
there's nothing there 20% of the time (wrongly).

5% comes out of nowhere. 80% comes out of nowhere. A priori knowing your noise
is often feasible, but not always. X, the presumed effect size you have to
magically know before you run your experiment, comes out of nowhere. Getting
this right is crucial to a reliable experiment, and it needs to really reflect
your prior belief, even though it's frequentist. It's totally subjective.

What's worse is that X is often taken to be whatever leads to a
feasible/fundable experiment that'll be done fast enough, not anything
scientific.

It's waaay harder to understand the impact of the subjective 5%/80%/X
decisions than it is to understand the impact of your prior. A prior takes way
less training. Better yet, you can report your results assuming many different
priors and let your reader subjectively decide what to think, so you don't
have to commit as hard.

tl;dr The "gold standard" way of doing science is already really subjective
and that's okay. Equally subjective alternatives can still be better science.

------
fromthestart
So how does one go about choosing reasonable bayesian priors for an
experiment?

~~~
afthonos
It becomes part of your experimental design. Just like people can quibble with
your setup, with your questions, with your procedures, they can quibble with
your priors. The difference is that it's out there, explicit.

Note that a big weakness of Bayes rule is that you can look at any data and
specify a prior that will make it look good. To continue with the mammogram
example, suppose the doctor says "We really don't know if you are likely to
have cancer or not. So we're just going to give 50-50 odds, and see what the
test comes back with." That's a very different prior from the known base rate.
The results would be, where C means "Cancer" and "R" means "Positive Result":

    
    
      P(C|R) = P(R|C) * P_prior(C) / P(R)
             = 1.0 * 0.5 / (1.0 * 0.5 + 0.05 * 0.5)
             = 0.5 / 0.55
             = 0.9
    

A _much_ higher probability. As you can imagine, you can do that in a paper as
well: you know the data you have, and you come up with a "plausible" prior to
make the data seem important.

In my opinion, in any switch to using Bayesian analysis in scientific work,
pre-registering priors will be essential.

~~~
pdonis
_> a big weakness of Bayes rule is that you can look at any data and specify a
prior that will make it look good_

This isn't a weakness in Bayes' rule, it's a weakness in your experimental
protocol. You're supposed to pick the prior _before_ doing the experiment and
seeing the data.

 _> In my opinion, in any switch to using Bayesian analysis in scientific
work, pre-registering priors will be essential._

Pre-registering statistical criteria and assumptions should already be
essential, whether you're a Bayesian or not. The fact that it isn't is a key
factor behind the replication crisis.

~~~
afthonos
> _This isn 't a weakness in Bayes' rule, it's a weakness in your experimental
> protocol. You're supposed to pick the prior before doing the experiment and
> seeing the data._

Sure, if the goal is to get to something true. If the goal is to publish or to
maintain your position, though, you’ll work differently.

You don’t even need to have seen the data. If I set my priors for the earth
being flat extreme enough, it’ll take a long time for even good faith updating
to converge to reality.

As I said in another reply, I was simply pointing out that Bayesian analysis
can also be abused, and that proper protocols still need to be followed. A
point on which I believe we agree. :-)

~~~
pdonis
_> Bayesian analysis can also be abused, and that proper protocols still need
to be followed. A point on which I believe we agree. :-)_

Yes, indeed. :-)

------
pmisans
Has anyone heard of a scientific journal adopting a peer-replicated
approach(only accept papers that peers can replicate)? With the replication
rates so low, i'd imagine some of these prestigious journals would want to vet
their papers more strictly, but a cursory google search didn't dig anything up
for me.

------
cortesoft
My favorite way to show the base rate fallacy is to take the most extreme
example.

Suppose I have an invisible dragon detector that is accurate 99.9% of the
time... if it tells me there is an invisible dragon in the room, it doesn't
mean there is only a 0.1% chance there is not a dragon... there is a 100%
chance there is not an invisible dragon in the room, because they don't exist.

~~~
TheOtherHobbes
That only works if you already know something is/isn't true with absolute
certainty.

It's useless for fundamental research, because by definition you're exploring
what you don't know yet.

The real problem is more one of labelling. "True according to science" isn't a
binary, but it's treated as if it is - especially by marketers.

Science is more like a set of concentric circles of decreasing confidence. You
can be very confident indeed about the contents of the centre circle which
includes undergraduate physics and engineering. You can also be confident that
there are commonly agreed edge cases, areas of inaccuracy, and extreme
circumstances where the science stops being reliable.

As you get further away from the centre confidence decreases. A lot of the
debate about replication is about research that is a long way from the centre,
where uncertainty is high.

But neither researchers nor the science press nor the mainstream media will
report this. Studies are usually presented as "Science says...", as if you're
supposed to be just as confident of the results of a psychological study that
asks a population of 30 undergrads from the same college and the same year
some poorly designed questions as you are in Special Relativity.

~~~
cortesoft
It is just to show why base rate matters by using the most extreme example...
it doesn't imply that it is easy to calculate.

------
BoiledCabbage
It's incredible it's held up this long. It shows how much science actually is
dogma. Human nature wins out over rationality.

~~~
xamuel
Not so incredible if you consider how the sausage is made. People who hate
math pursue social science PhDs and are taught "p-values = truth". These same
people are pressured to produce lots of research, and not just boring research
but preferably shocking research. Journals are incentivized to publish such
shocking research because that kind of research gets cited. When you think
about it, this stuff will probably hold up for a long time to come.

~~~
SantalBlush
It has little to do with stereotypes about social scientists hating math or
misunderstanding p-values.

The current system incentivizes p-hacking. Nobody wants to throw away their
work if it doesn't meet the p < 0.05 criterion, especially when their career
is on the line.

~~~
SkyBelow
While it isn't the most concrete sample size, when I was at college the
mathematics requirements for the soft sciences were far inferior to the
mathematics requirements for hard sciences. To the extent that even the easier
statistics course in the math department wasn't require and instead was
replaced by a statistics class in the social science department that was only
valid for social sciences.

------
master_yoda_1
The author teaches a basic course on statistics on harvard extension school
and charges ~$2800 for that. He wrote a clickbait random article and I can see
it on hacker news front page. And we say only facebook can publish fake news
;)

~~~
marcus_holmes
sooo... what's fake? You made a ton of ad-hominem there, but what's the actual
criticism of the article?

~~~
reallydude
Just from a casual look, the cancer diagnosis modeling is flawed.

> the doctor would need to consider the overall incidence rate of cancer among
> similar women with similar symptoms, not including the result of the
> mammogram

That's a bad assumption. Mammogram is a radiology tool to investigate tissue.
It's not a randomizing element as it's fundamental to arriving at the thesis,
which is then correlated from MULTIPLE vectors.

> a similar patient finds a lump it turns out to be benign

A manual inspection is not the same. In good faith, let's assume they are the
same for no reason but to argue about how not to do medicine as some precept
for "Base rates are effectively random".

Turns out, Base rates are not random guesses.

What's interesting about Bayesian Theory is we use it all the time and then
observe and collect statistical data about the outcomes, after making
assumptions (like a specific Base rate) and use back propagation to correct
until models fit measurable events. This is why, tests often have caveats
about efficacy. Base rate is sometimes, reasonably, unknown because there is
no additional correlation. This doesn't indict Base rates wherein the vast
majority of cases, there are multiple vectors (or new vectors are generated)
that show this process is reliable (beyond a few dice rolls). There have also
been cases where there is no corroboration from new measures and the deduction
is the that original measure and Base rate were random.

It's a lot of hand waving from a classic troll. Why? Probably for students who
want to feel like they have "discovered" how the establishment is ignorant.

~~~
DanBC
The author mangled the mammogram example, but when written correctly it's a
good example to use.

The author should have used a woman with no symptoms who goes for a mammogram
screening test, not a woman with symptoms who goes for a diagnostic test.

[https://www.harding-center.mpg.de/en/fact-boxes/early-
detect...](https://www.harding-center.mpg.de/en/fact-boxes/early-detection-of-
cancer/breast-cancer-early-detection)

