
Thomas Bayes and the crisis in science - agonz253
https://www.the-tls.co.uk/articles/public/thomas-bayes-science-crisis/
======
rossdavidh
On the one hand, I have (in the semiconductor manufacturing industry)
encountered statisticians who were greatly averse to making any prior
assumptions about the likelihood of something. Also, any article with the
phrase "nonsense on stilts" in it is not entirely unworthy. It also does a
credible job of explaining Bayes' theorem.

On the other hand, I think it somewhat exaggerates the extent to which Bayes'
theorem and its associated work are rejected by statisticians, and also the
extent to which a magical p > 0.05 limit was advocated. I believe it was non-
statisticians who wanted easy tests that they could apply without
understanding much about statistics, who were to blame in that case.

~~~
analog31
A statistician could talk me out of this, but I've always been puzzled by the
use of "prior." Nothing in Bayes's Theorem says that the prior has to be
established _before_ anything else. It makes Bayesian statistics seem like a
matter of letting your expectations influence your results.

Instead, an interpretation that seems more favorable to me, is that you
consider all of the information at your disposal, that can be brought to bear
on a problem, and this could include known constraints on likelihoods. Bayes's
Theorem becomes a tool for working through problems where a single statistic
can't be readily used to analyze an entire data set, e.g., when data come from
disparate sources and can't be readily combined.

~~~
sl8r
Part of the issue is that you can't _not_ assume a prior; it's unavoidable.
The Bayesian POV just makes this assumption clear / explicit, while many
frequentist methods (if naively applied) amount to choosing a uniform prior.

E.g., imagine you have an e-commerce site that has, historically, had a 2%
conversion rate (landing page to purchase). Now you run an A/B test with two
variants, a control (A) and a treatment (B). Both buckets get 10,000 landings,
of which A converts 200 of them and B converts 250. How can you tell if A is
better than B? Cutting to the chase, the frequentist approach (if applied
naively) would be to model B as some distribution centered around 2.5%, for
example N(2.5%, 0.15%) or Beta(251, 9751).

A Bayesian would say that this assumes a uniform prior - but that this is
probably a bad prior because it ignores what we know about the historical
conversion rate of 2.0%. Said another way, the above amounts to saying (before
we run the test) that we think it's just as likely for B to have a conversion
rate of 2.5% as it is to have a conversion rate of 100%. Clearly we don't
actually believe this.

~~~
adwn
> _Cutting to the chase, the frequentist approach (if applied naively) would
> be to model B as some distribution centered around 2.5%_

Why would that be wrong?

> _A Bayesian would say that this assumes a uniform prior - but that this is
> probably a bad prior because it ignores what we know about the historical
> conversion rate of 2.0%._

What would a Bayesian conclude instead?

~~~
sl8r
> Why would that be wrong?

The issue is that modeling B with a distro centered around 2.5% ignores what
we know about the historical conversion rate (2.0%) and the control bucket's
conversion rate (also 2.0%). If our goal is to make the best estimate for the
future that we can, we should take this data into account when evaluating B.
As a thought experiment, imagine that you have A at 2.0% and B at 2.5%
conversion for Week 1, with a historical conversion rate of 2.0%. Someone says
they'll pay you $100 if you correctly guess what B's conversion rate will be
next week, either (i) in the range 2.0% to 2.5%, or (ii) in the range 2.5% to
3.0%. I'd prefer to bet on (i) than on (ii).

> What would a Bayesian conclude instead?

One simple approach would just be to start with a more informative prior, like
Beta(2+1,100-2+1) instead of Beta(1,1). This would pull bucket B's posterior
distribution closer to 2.0%. Another approach is to use a hierarchical model
[1], which will fit the individual buckets' priors for you.

[1] Here's something I wrote on this a couple years ago, more focused on
solving multiple comparisons problems but with the same proposed solution:
[http://normal-extensions.com/2014/07/16/ab-testing-
hierarchi...](http://normal-extensions.com/2014/07/16/ab-testing-hierarchical-
model/)

~~~
adwn
> _The issue is that modeling B with a distro centered around 2.5% ignores
> what we know about the historical conversion rate (2.0%) and the control
> bucket 's conversion rate (also 2.0%)._

Both the historical and the control bucket used version A of the website, and
they are consistent in their 2.0% conversion rate. Version B is different, and
it appears to have a different conversion rate of 2.5%. So why should it not
have a future conversion rate close to 2.5%?

Let's replace the website with a 6-sided die. Historically, the probability of
throwing a 3 was 1/6\. Now you replace your die with a _different_ die and
throw it 10,000 times; the 3 comes up 2560 times. If I had to guess how many
times the 3 comes up the next 10,000 throws, I certainly would bet that it's
closer to 2560 times than to 1667 times.

> _Someone says they 'll pay you $100 if you correctly guess what B's
> conversion rate will be next week, either (i) in the range 2.0% to 2.5%, or
> (ii) in the range 2.5% to 3.0%._

Case A: The historical version A of the online shop had some influence on the
conversion rate during the testing of version B, drawing the conversion rate
of B down. This influence will fade away in the future, so B's conversion rate
will be closer to [2.5%, 3.0%] than to [2.0%, 2.5%].

Case B: The historical version A of the online shop _did not_ have any
influence on the conversion rate during the testing of version B (compare the
dice example above). Then both ranges are equally plausible. But "[2.0%, 2.5%]
vs [2.5%, 3.0%]" is a bad dichotomy. A more relevant one would be "[1.75%,
2.25%] vs [2.25%, 2.75%]". In that case, I would bet on [2.25%, 2.75%].

~~~
sl8r
Late to the party, but:

> Both the historical and the control bucket used version A of the website,
> and they are consistent in their 2.0% conversion rate. Version B is
> different, and it appears to have a different conversion rate of 2.5%. So
> why should it not have a future conversion rate close to 2.5%?

It's all a matter of degree. You'd model B's rate as _closer_ to 2.5%, but
probably not centered around 2.5%. As you observe more data, the prior becomes
less important. E.g., with 10k samples as in the original example, if you used
Beta(2+1,100-2+1) as your prior, your posterior would be Beta(252+1,
10100-2+1) as your posterior, which is centered at 2.495%. But if you only had
1000 samples (and 25 conversions), you'd get a distro centered at 2.45%. And
if you only had 200 samples (and 5 conversions), you'd get a distro centered
at 2.33%. Etc.

> Let's replace the website with a 6-sided die. Historically, the probability
> of throwing a 3 was 1/6\. Now you replace your die with a different die and
> throw it 10,000 times; the 3 comes up 2560 times. If I had to guess how many
> times the 3 comes up the next 10,000 throws, I certainly would bet that it's
> closer to 2560 times than to 1667 times.

In the case of a die where you believe any weighting of the faces is equally
likely, this would be true. So this may be an appropriate model in this case.
But in the case of the website, I don't think the conversion rates are equally
likely, even for a new, un-tested site. If the historical conversion rate is
2.0%, and I'm forced to bet on the most likely conversion for a new (never
before seen) variant B, I'd much rather bet on a number near 2.0% than a
number like 99%.

> Case B: The historical version A of the online shop did not have any
> influence on the conversion rate during the testing of version B (compare
> the dice example above). Then both ranges are equally plausible.

This is exactly what I'm claiming _is not_ true. It's not that A influences B,
it's that A tells you something about the likely range of A and B (in this
specific case of an e-commerce site). (The reason I chose the ranges [2.0%,
2.5%] vs [2.5%, 3.0%] is that if you model B independently, you'd be
indifferent between these ranges; but if you use A to inform a prior, you'd
prefer [2.0%, 2.5%].)

------
hooloovoo_zoo
It's amusing that the people who are most militantly Bayesian aren't Bayesian
statisticians. It's almost as if there are advantages and drawbacks to the
Bayesian perspective. By the way, frequentism vs. Bayesianism has very little
to do with Thomas Bayes. All statisticians accept the validity of Bayes'
Theorem. Moreover, the two approaches are not mutually exclusive. It's always
interesting to investigate the frequentist properties of Bayesian estimators
and the implicit priors of frequentist estimators.

~~~
RobertRoberts
I don't mean to segue here, but I got caught in a logical argument that ended
abruptly with no definitive answer.

I mentioned the Monty Hall problem to a friend (an engineer) and discussed the
statical analysis done on this issue. (ie, statistically it's better to switch
doors after the first is opened, as it improves your odds of winning slightly)

But he only answered with "nope, Bayes' Theorem says the odds are the same no
matter what you do". The only other point he added was that if you do
something only once, you have the same odds every single time.

I found this frustrating simplistic, because we _know_ through testing that
this isn't true. That the odds are better if you switch doors.

Is this the fanaticism of Bayesians? He even called me a "frequentist" as if
it were some kind of pejorative.

I've researched this issue, and even mentioned to him his logical flaw (you
choose twice, not once) and still "Bayes theorum says there's no difference
and you're wrong because you are a frequentist"

sigh.

~~~
AlexCoventry
The easiest way to convince your friend of his error is to imagine the odds
when there are _n_ doors, and Monty opens _n-2_ of them, the ones which don't
contain the prize and aren't your first pick.

~~~
DougBTX
How could Bayesian statistics be applied in this case? I’m wondering if the
situation is just too “simple” to make it applicable.

Before Monty opens a door, so the prior probability, is 1/3 for each door.

After he opens a door, that door will have zero probability, as we know he
opens a door without a car, then how to update the probabilities afterwards?

Seems the simple way to look at it is to not partition by door, but by chosen
vs not chosen. Chosen is 1/3 and not chosen is 2/3 before and after Monty
opens a door, so perhaps there is no “Bayesian information” revealed by
opening the door anyway.

~~~
Bromskloss
> Seems the simple way to look at it is to not partition by door, but by
> chosen vs not chosen.

Right. That might be the easiest way for this problem.

More straightforwardly, without that shortcut:

\- Call the doors _a_ , _b_ , and _c_. Assume, without loss of generality,
that we choose door _a_ initially. Let the random variable _X_ ∈ { _a_ , _b_ ,
_c_ } be the door with the car.

\- Our prior probability is uniformly distributed: Pr( _X_ = _a_ ) = Pr( _X_ =
_b_ ) = Pr( _X_ = _c_ ) = 1/3.

\- The data _Y_ that we collect is our observation of which door gets opened
by the host. The likelihood function Pr( _Y_ = _y_ | _X_ = _x_ ) is the
probability of the observation being _y_ ( _i.e._ , that _Y_ = _y_ ), given
that the underlying state is _x_ ( _i.e._ , that _X_ = _x_ ). The only non-
zero likelihoods are Pr( _Y_ = _b_ | _X_ = _a_ ) = Pr( _Y_ = _c_ | _X_ = _a_ )
= 1/2 and Pr( _Y_ = _c_ | _X_ = _b_ ) = Pr( _Y_ = _b_ | _X_ = _c_ ) = 1.

\- Bayes' theorem, Pr( _X_ = _x_ | _Y_ = _y_ ) = Pr( _Y_ = _y_ | _X_ = _x_
)·Pr( _X_ = _x_ )/Pr( _Y_ = _y_ ), gives the answer, the posterior
probability, which should be seen as a function of _x_. The denominator Pr(
_Y_ = _y_ ) = ∑ Pr( _Y_ = _y_ | _X_ = _x_ ), sum over _x_ ∈ { _a_ , _b_ , _c_
}, is a normalising factor that makes the posterior probability distribution
sum to 1. It is also the probability we, at the start of the game, assign to
_Y_ = _y_. In our problem, Pr( _Y_ = _a_ ) = 0 and Pr( _Y_ = _b_ ) = Pr( _Y_ =
_c_ ) = 1/2.

How about you put in the numbers and see if it comes out right or if I have
made a mistake? :-)

Edit: Sorry, I just realised that we could have made it simpler by assuming
that the host opens, say, door _b_.

------
AlexCoventry
You can abuse Bayesian methods just as easily you can hack a p-value. Maybe
more easily, since fewer people would be aware of the issues.

What's needed is a shift in researcher attitudes and incentives, to emphasize
development of reliable knowledge instead of publication record. Just changing
the rules of the game slightly will only lead to people adjusting their game
slightly.

[http://www.stat.columbia.edu/~gelman/research/unpublished/p_...](http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf)

~~~
outlace
Yes but with traditional frequentist approaches most scientists don’t
understand what’s going on under the hood of the statistical tools they’re
using. There are dozens and dozens of named tests like “Fisher’s exact test”
and “Wilcoxon signed-rank test” and most scientists just sort of follow
received wisdom about which test to use in which situation.

Bayesian methods force you to actually think about how your data and model
parameters are distributed, and explicitly specify a model.

~~~
eli_gottlieb
>Bayesian methods force you to actually think about how your data and model
parameters are distributed, and explicitly specify a model.

Yep. I know some very senior scientists who get a lot of mileage out of
finding places where "model-free" methods are implicitly assuming a _bloody
stupid_ model, and then attacking them.

------
Eliezer
For a more detailed take advocating the particular solution "report
likelihoods, not posteriors or p-values", see "Likelihoods, p-values, and the
replication crisis":
[https://arbital.com/p/likelihoods_not_pvalues/?l=4xx](https://arbital.com/p/likelihoods_not_pvalues/?l=4xx)

------
eli_gottlieb
>We are living in new Bayesian age. Applications of Bayesian probability are
taking over our lives. Doctors, lawyers, engineers and financiers use
computerized Bayesian networks to aid their decision-making. Psychologists and
neuroscientists explore the Bayesian workings of our brains. Statisticians
increasingly rely on Bayesian logic. Even our email spam filters work on
Bayesian principles.

While that's true, strictly speaking, it's akin to writing that, "We are
living in a new logical age" when Boolean algebra was first finding wide
application in engineering and the natural sciences. "Bayesian methods" just
mean using statistical modeling techniques that conform to Bayes' rule as
their normative guide. In "machine learning" or "frequentist" terms, it just
means that good approximate-Bayesian reasoning minimizes the KL divergence
between the true posterior and the approximate model (whether variational or
by sampling or by training a neural network, whatever), as opposed to
minimizing the classification hinge-loss or the mean squared error (though
some of those losses have formally equivalent Bayesian priors).

~~~
plassma
Strictly speaking, health care and genocide are the same thing, you just
minimize different metrics.

------
gfodor
Is there a metholodgy for bayesian analysis that avoids the choice of a
specific prior but instead provides conclusions in the form of
boundary/regions in "prior-space" and their effect on belief? For example, it
would be incredibly useful if the output of research allowed a reader to gauge
support of the conclusion in a minimally subjective way by explaining what
effect choices in prior have on results. I'm not a statistician so I'm
assuming this is a well understood thing, but would be curious to know if and
how it is practiced.

~~~
hooloovoo_zoo
Generally speaking, a good Bayesian analysis includes what's known as a
"sensitivity analysis" which seeks to measure how sensitive the results are to
the particular choice of prior. Additionally, if strong prior assumptions are
not available, an "uninformative" prior is used. In such cases, the results
tend to be pretty close to those from frequentist methods, except the
frequentist methods lack the Bayesian probabilistic interpretation.

