
What’s Wrong with Bayes - luu
https://statmodeling.stat.columbia.edu/2019/12/03/whats-wrong-with-bayes/
======
signalsmith
For me, I really appreciate the Bayesian approach because it makes it very
explicit that you pick a prior.

Perhaps my experience is limited, but every (supposedly non-Bayesian) model
I've used in practice has been possible to re-express using Bayesian terms,
priors and beliefs and so on. Then I get to look at the intitial assumptions
(model/prior) and use suitable human hand-wavey judgement about whether they
make sense.

Bayes is a good way to _update_ models, but if you lose sight of the fact that
the bottom of your chain of deduction was a hand-wavey guess, you're in
trouble.

~~~
eanzenberg
Yeah, no thanks though. I don't want every rando adding "priors" that "feel"
right to their analysis. Frequentist is straight forward. Both can (and are)
abused to prove bias.

~~~
jules
The difference between a frequentist and a Bayesian is that the latter admits
that he picks a prior. A frequentist smushes together (1) the statistical
assumptions (2) the approximations that make the problem computationally
tractable and (3) the mathematical derivations, into one big mess. Just
because you're not stating your assumptions doesn't mean there are none.
Consider maximum likelihood estimation. It is not invariant under coordinate
transformations. So which coordinates you pick is an assumption. In fact, with
Bayesian estimation you can do the same thing: picking a prior is equivalent
to picking the uniform prior in a different coordinate system. So frequentist
estimation does involve picking a prior by picking a coordinate system, even
if the frequentist does not admit this.

Frequentist methods are conceptually anything but straightforward. The
advantage of frequentist methods is that they are computationally tractable.
Usually they are best understood as approximations to Bayesian methods. For
instance, MLE can be viewed as the variational approximation to Bayes where
the family of probability distributions is the family of point masses, and the
prior is uniform.

~~~
Akababa
What do you mean by coordinate transformation? MLE is invariant under
parameter transformations because it's just the argmax of the likelihood.

~~~
knzhou
Adding to the other comments, you still have prior-dependence on a more subtle
level, because it depends on what hypotheses are allowed.

Here's an extreme example. Consider flipping an apparently fair coin and
getting "THHT". The hypothesis that the coin is fair gives this result with
likelihood 1/16\. The hypothesis that a worldwide government conspiracy has
been formed with the sole purpose of ensuring this result... has a likelihood
of 1.

But nobody would ever declare this the MLE, because "government conspiracy"
isn't one of the allowed options. But it isn't precisely because it's
unlikely, i.e. because of your prior. Of course this is an extreme example,
but there are more innocuous prior-based assumptions baked in too.

~~~
closed
Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad
thing--unless you erroneously assume that value is _evidence for your null
hypothesis_.

Consider that if your data generating process really is a fair coin, then the
conspiracy outcome you mention only occurs 1 our of 16 times, so 15 out of 16
times you observe a likelihood of 0. 15 out of 16 times your reject the
conspiracy case.

There is also a tricky component here, because the notion of sample size is
not clearly defined (can we generate multiple 4-tuples of flips, and consider
each one a sample? Is your example really just a funky way of discussing type
II power?)

~~~
knzhou
> Wait, in frequentist statistics getting, say, a p-value of 1 is not a bad
> thing--unless you erroneously assume that value is evidence for your null
> hypothesis.

That's exactly what I'm saying. Suppose you get HHTHT. Then you run the
following statistical test:

Hypothesis: a government conspiracy has been hatched to make you get HHTHT.

Null hypothesis: this is not the case.

The p-value is 1/32, so the null hypothesis is rejected.

This is bad reasoning for two reasons: first the alternative hypothesis is
incredibly unlikely, and second the choice of alternative hypothesis has been
rigged after seeing the data. These are exactly the two reasons so many social
science studies running on frequentist stats have done terribly, and why we
would benefit from Bayesian stats which force you to make these issues
explicit.

~~~
bonoboTP
> The p-value is 1/32, so the null hypothesis is rejected.

No, the p-value is defined as the likelihood of a result _at least as extreme_
as the one we obtained, under the null hypothesis. It's not simply the
likelihood of the particular result you obtained, as that would always be zero
for continuous quantities! (Remember that the p-value's distribution is
_uniform_ over the 0-1 interval under the null, so any criticism that says the
p-value is almost always small just by chance must be wrong somewhere).

So first you need to establish a way to say what result is how extreme. This
is very often trivial and quite objective (the more people cured/made sick,
the more extreme the effect of the drug). For the coin flip case, one way
would be to call results with more imbalanced ratio more extreme. Then in your
3 heads out of 5 case, the (one sided) p-value would be the likelihood of
getting 3, 4 or 5 heads out of 5. You can also come up with a different way to
define what "more extreme" means (and put it forward in a convincing way),
otherwise you can just not talk about p-values. You can keep talking about
likelihoods, but not p-values.

~~~
knzhou
> No, the p-value is defined as the likelihood of a result at least as extreme
> as the one we obtained, under the null hypothesis.

Define for me in an objective way what "at least as extreme" is. Let's say I
think the string "HHTHT" is extremely indicative of conspiracy. Then the
p-value is 1/32 on the measure of "strings of coin flips at least this
extremely indicative of conspiracy".

See, this sounds completely ridiculous, but it's not in principle any
different from what it done in thousands of social science papers a year. All
these supposedly objective procedures have tons of ambiguity. For example:

> For the coin flip case, one way would be to call results with more
> imbalanced ratio more extreme.

Why an imbalanced total ratio? Why not average length of heads? Average number
of occurrences of "HT"? Frequency of alternations between H and T? Average
fraction of times H appears counting only even tosses? Given the combinatorial
explosion of possible criteria, I guarantee you I can find a simple-sounding
criterion on which any desired string of fair tosses gets a low p-value.

~~~
6gvONxR4sf7o
>Define for me in an objective way what "at least as extreme" is.

Come up with some one dimensional test statistic T whose distribution D you
know under your null hypothesis. Define a one sided p value for data x as p(t
<= x).

It sounds like your statistic is 0 if the sequence is always "HHTHT" and 1
otherwise? In this case your p value is 1 unless every attempt is "HHTHT" in
which case it's zero, so the test statistic is 0 with probability 1/32^k for k
attempts. The more attempts you do, the smaller p gets if the null is false.
It's working as intended. For this test, a threshold of p=0.05 would be dumb,
but it's always dumb.

It's not an awful test assuming you came up with your test statistic and
"HHTHT" before collecting your data. It meshes with the intuition of betting
your friend "Hey I bet if you flip this coin you'll get HHTHT." If they
proceed to flip it and see HHTHT, they are reasonable to think maybe you know
something they don't.

If you come up with your test statistic after the fact, there's theory around
p hacking to formalize the intuition of why it's not convincing to watch your
friend flip some sequence of coins and then tell them "dude, I totally knew it
was going to be that" after the fact.

~~~
cygaril
A more general method is to use the likelihood ratio, ie the ratio of the
likelihood of an outcome under the alternative hypothesis to its likelihood
under the null hypothesis. And then pick the outcomes which for which this
ratio is highest as the ones which will cause you to reject the null
hypothesis. Equivalently, the p-value is the probability under the null
hypothesis that the likelihood ratio would be at least this large.

This works in the discrete case too, and gives p=1/32 in the original coin
flip case.

~~~
6gvONxR4sf7o
Is the likelihood ratio test more general? I thought that one of the benefits
of the usual NHST framework was that you only need the distribution of your
stat under the null. With LRT don't you need the distribution under both the
null and the alternative? How do you frame a null of mu = 0 against an
alternative of mu != 0 with x ~ D_mu in this way?

~~~
cygaril
You don't necessarily need the distribution under the alternative to determine
the values for which the likelihood ratio will be highest. In your example,
the tails will be the areas of maximum likelihood for any (symmetric)
alternative.

~~~
6gvONxR4sf7o
Huh, TIL. Thanks :)

------
abeppu
I feel this post should be considered along with its sibling:
[https://statmodeling.stat.columbia.edu/2019/12/04/whats-
wron...](https://statmodeling.stat.columbia.edu/2019/12/04/whats-wrong-with-
null-hypothesis-significance-testing/)

I think reading either alone is prone to lead readers to a false understanding
of Gelman's perspective.

------
syrrim
If the goal is to avoid bankruptcy, then the probability needs to be
interpreted differently. If you bet the house every time, you're guaranteed to
go bankrupt eventually. Suppose instead you bet half your money on an event of
50% probability. If you take 1:1 odds on this, then when you lose, your money
is divided by 2, but when you win it is only multiplied by 1.5. Your money
will tend to decrease over time. You need to pick odds 1:a such that 1+a/2=2
=> a=2.

We recover our regular betting odds by betting a smaller portion of our money.
If we bet a portion 1/d of our money on an event of probability 1/p, we needs
odds 1:a such that 1+a/d=(d/(d-1))^(p-1). For large enough d we get a=p-1, as
we would expect.

Assume again you're betting half your money each round, but take a probability
of winning of 84%, as in the article. You should take that bet at 1:1.14 odds,
much less than the recommended 1:5 odds.

~~~
ikeboy
This has nothing to do with interpreting probability, but with a utility
function that's not linear in terms of wealth. With decreasing marginal
returns to wealth, the same bet becomes less attractive at lower wealth
levels.

Although this can't fully explain risk aversion, see
[https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.15.1.219](https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.15.1.219)

------
jefft255
In robotics, particularly in bayesian filtering (KFs and so on), I find the
idea of a "prior" solid and I don't see any frequentist alternatives. Your
prior is easy to understand: whatever your posterior for your state was at the
previous timestep, updated using the actions you wanted your robot to
accomplish. Inference is then refining this prior using the observation that
the robot makes.

There's nothing hand-wavy about that; if you do bayesian statistics with bad
priors of course you're going to get bad inference. I guess the author just
warns about being careful about your assumption which is always good.

~~~
skybrian
I'm curious what happens when you reboot your robot. What's the first prior?

~~~
dTal
What do you do when you wake up? Assume you're in the same place as when you
went to sleep. You won't be surprised to find yourself on the other side of
the bed - slightly more surprised to find yourself on the floor, and very
surprised to find yourself in another country. A large belief update is always
a bit of a shock.

------
Majromax
> Example abridged: a draw from N(phi,1) for unknown phi is 1. Bayesian
> reasoning with a uniform prior gives an 86% posterior probability that phi >
> 0

I'm not sure I see the problem here? If it's counterintuitive, it's only
because we treat N(0,1) as _the_ normal distribution, so our true prior is
that if we pick a distribution out of a hat we're more likely to have N(0,1)
than anything else.

Suppose I truly know nothing but what is given in the quote. On the basis of
symmetry, I'd have to conclude that P(phi<0) is the same as P(phi>2). If the
blogger had phrased this as "86% posterior probability that phi < 2", I don't
think it would be so surprising.

In fact, the blogger describes this draw as:

> after seeing an observation that is statistically indistinguishable from
> noise.

which to me presupposes a _great deal_ of information about what 'noise' is
supposed to look like.

------
Akababa
I don't know, this seems to be a really low-effort blog post. The given
example is obviously contrived from the unreasonable improper (-\infty,\infty)
prior and the low \sigma^2=1 likelihood. If it was really "pure noise" then
you'd have \sigma^2=\infty which rightly gives you a flat posterior.

For sure Bayesian gives you more flexibility with your assumptions, so it's
easier to shoot yourself in the foot. But when used correctly it can be more
powerful, and often easier to interpret.

~~~
contravariant
Ironically the article that the example is from offers quite a nice rebuttal:

> None of these examples are meant to shoot down Bayes. Indeed, if posterior
> inferences don’t make sense, that’s another way of saying that we have
> external (prior) information that was not included in the model. (“Doesn’t
> make sense” implies some source of knowledge about which claims make sense
> and which don’t.) When things don’t make sense, it’s time to improve the
> model. Bayes is cool with that.

------
roenxi
There is a certain intellectual laziness in this perspective as might be
expected from a short blog post - obviously Bayes' formula is theoretically
sound because it is trivial to deduce and prove.

So we know that if the conclusion is not acceptable then either the method,
the prior or the evidence is not acceptable. Evidence and method can be ruled
out; so the prior was not reasonable.

Basically, he's saying that he doesn't believe the prior is flat. A reasonable
thing to say too - as he says practically speaking if we suspect the
distribution is probably random noise then the prior is we are probably
looking at noise. So in practice the prior is heavily weighted towards 0. It
isn't intellectually honest to use an uniformed prior unless you think the
probability of a process being statistical noise is almost 0.

~~~
6gvONxR4sf7o
>obviously Bayes' formula is theoretically sound because it is trivial to
deduce and prove.

Quantum mechanics doesn't follow the usual probability rules, so you can't
really say "obviously Bayes' formula is theoretically sound." It certainly
seems like Bayes theorem should apply universally but apparently it doesn't.
Or at least, the jury's still out.

[https://en.wikipedia.org/wiki/Quantum_probability](https://en.wikipedia.org/wiki/Quantum_probability)

------
knzhou
But this isn't actually a criticism of Bayes at all. Yes, the result depends
on your prior. But the result _always_ depends on your preconceptions -- even
in frequentist statistics, where it determines which statistical tests you use
and which hypotheses you test and what p-value cutoff is reasonable. It's
better to have this up front.

Or, you can publish Bayesian update factors, which are prior-independent.

------
j7ake
The example should of course ring caution bells but at least in Bayes you can
figure out why your inference is doing unreasonable things by examining each
of your assumptions. In this case it’s the prior that needs fixing.

Are there alternative methods that are better than the Bayes method for this
toy example?

~~~
TTPrograms
Seriously, as soon as he said "flat prior on theta" I had huge alarm bells go
off. Garbage in garbage out.

------
olooney
Just for context, Andrew Gelman is one of the creators of Stan[1], one of the
most popular probabilistic programming platforms for Bayesian interference. He
has written a popular textbook on Bayesian methods, _Bayesian Data Analysis_
[2].

Everyone hates picking priors in Bayesian analysis. If you pick an informative
prior, you can always be criticized for it (in peer review, for a business
decision, etc.) The usual dodge is to use a non-informative prior (like the
Jeffreys prior[3].) I interpret Gelman's point as saying this can also lead to
bad decisions. Thus, Bayesian analysts must thread the needle between Scyllia
and Charybdis when picking priors. That's certainly a real pain point when
using Bayesian methods.

However, it's pretty much the same pain point as choosing regularization
parameters (or choosing not to use regularization) when doing frequentist
statistics. For example, sklearn was recently criticized for turing on L2
regularization by default which could be viewed as a violation of the
principle of least surprise, as well as causing practical problems when inputs
are not standardized. But leaving regularization turned off is equivalent to
choosing an non-informative or even improper prior. (informally in many cases,
and formally identical for linear regression with normally distributed
errors[4].) So Scyllia and Charybdis still loom on either side.

 _My_ problem with Bayesian models, completely unrelated to Gelman's
criticism, is that the partition function is usually intractable and really
only amenable to probabilistic methods (MCMC with NUTS[5], for example.) This
makes them computationally expensive to fit, and this in turn makes them
suitable for (relatively) small data sets. But using a lot more data is the
single best way to allow a model to get more accurate while avoiding over-
fitting! That is why I live with the following contradiction: 1) I believe
Bayesian models have better theoretical foundations, and 2) I almost always
use non-Bayesian methods for practical problems.

[1]: [https://mc-stan.org/](https://mc-stan.org/)

[2]: [https://www.amazon.com/Bayesian-Analysis-Chapman-
Statistical...](https://www.amazon.com/Bayesian-Analysis-Chapman-Statistical-
Science/dp/1439840954)

[3]:
[https://en.wikipedia.org/wiki/Jeffreys_prior](https://en.wikipedia.org/wiki/Jeffreys_prior)

[4]:
[https://stats.stackexchange.com/questions/163388/l2-regulari...](https://stats.stackexchange.com/questions/163388/l2-regularization-
is-equivalent-to-gaussian-prior)

[5]:
[http://www.stat.columbia.edu/~gelman/research/published/nuts...](http://www.stat.columbia.edu/~gelman/research/published/nuts.pdf)

~~~
perl4ever
"Everyone hates picking priors in Bayesian analysis."

Everybody hates searching for their keys in the dark.

------
howlin
Bayesian modeling can be very powerful when it works but it can also be
catastrophic when it fails. It helps to think about this in an adversarial
decision theoretic context where you play a prediction game against an
opponent (usually called Nature).

We can think of the game as discovering the best model to explain a set of
observations. The Bayesian believes that Nature picks the true model that
generated the observations by sampling the prior. This is actually a huge
assumption to make, which is why Bayesian methods work so well when the
assumption is close to the truth.

Frequentists make the assumption that Nature chooses the underlying true model
from a set of possible models. Beyond restricting the set of models Nature can
choose from, frequentists make no further assumptions about the selection
process. This is a strictly weaker assumption than the Bayesian makes, which
means frequentist methods will do better when the specified prior grossly
misrepresents Nature's decision making process.

There are even weaker assumptions that can be made about how Nature chooses
the data. Regret-based model inference allows for a more adversarial game with
Nature where the data may not come from the class of models considered at all.
If Nature truly behaves this way, then Bayesian decision making can
catastrophically fail.

~~~
c2471
This ignores the main strength of a Bayesian workflow. You can straight
forwardly quantify the effect of your prior choice on your inference - pick a
different prior; how much does that change the inference, etc etc. A good
Bayesian workflow does not assume a prior to be true; it should be based on
available evidence, and then stressed. To be a bit more concrete, let's say we
wish to model the height of kangaroos. We come up with a model form, say
regression, and a bunch of potential features. If we are Bayesian we might
say; "I think nature prefers simple stable solutions, so I'll put a N(0,d)
prior on my weights. We then compute a posterior and get a range of credible
values. We can then say, "hey, what if I'm wrong and actually it's a student
t, or it's flat prior or X or y or z", and use principled tools like marginal
likelihood to say which family of models works best, do prior posterior
comparisons to see how observations changed our prior etc etc.

If we do this under a frequentist framework we compute the regression
coefficients, and can get some confidence bounds with some appeal to
asymptotics (and nobody I've ever seen actually makes any attempt to validate
these assumptions). And even when we are done, we get a confidence interval
that has such a truly unintuitive definition that almost every person who is
not a stats PhD fundamentally misinterprets.

To say frequentists make less assumptions is not true- they are just less
explicit, and I consider it a strength not a weakness to highlight choices
made by the statistician.

~~~
nazgulnarsil
Right, one should run a sensitivity analysis in general, and your prior is one
of the parameters you definitely check the sensitivity of.

~~~
analog31
As a thought experiment, could you choose priors by setting the derivative of
the solution with respect to the priors equal to zero? This would be the case
of minimal sensitivity.

------
selectionbias
My problem with the 'Bayes=rationality' type of argument is that it ignores
context and isn't really a case for reporting Bayesian vs frequentist
estimates. If I am a researcher publishing results then I have an audience who
interpret my results. If my audience is Bayesian and accept my model then all
I need to do is report sufficient statistics and they can make their own
Bayesian inferences given their priors, or better yet, I can just post my
whole dataset. The very reason we need to report things like credible sets or
confidence intervals rather than just sufficient statistics is because
audiences in the real world want summary stats that they can easily interpret
and are transparent. The best approach to inference is one that is the most
useful to audiences, and that depends on context and practicalities rather
than on some underlying philosophy of subjective vs objective probabilities.

------
metasj
Many analyses of the world aren't bayesian /or/ frequentist, they use much
simpler pattern-matching, with feedback loops that update the approach used as
well as the conclusion. Problems start w/ assuming you have to choose one of
those approaches to estimate the future...

------
ummonk
_> Put a flat prior on theta and you end up with an 84% posterior probability
that theta is greater than 0. Step back a bit, and it’s saying that you’ll
offer 5-to-1 odds that theta>0 after seeing an observation that is
statistically indistinguishable from noise. That can’t make sense. Go around
offering 5:1 bets based on pure noise and you’ll go bankrupt real fast._

If you think it's likely to be pure noise, why the hell would you put a flat
prior on it?

Note also that nonflat priors are implicit in significance testing - e.g. p95
significance is similar to putting a 95% prior on the null hypothesis, and p99
significance is similar to putting a 99% prior on the null hypothesis.

------
pontusrehula
To criticize is easy but it feels incomplete if one doesn't provide any clues
of what the supposedly better alternatives would be.

------
mycall
84% isn't that great for predictions compared to DNNs, RNNs or other modern ML
algorithms.

------
gweinberg
The author has a major fundamental misconception as to how probability works.
If I say "the probability that proposition X is true is 0.5", that means that
based on the information available to me right now it's equally likely likely
to be true as false. That's not even remotely similar to saying I would offer
an even money bet.

~~~
baron_harkonnen
Ignoring the fact that “the author” is one of the most respected statistians
in the world today... there is no debate on how to translate probabilities
into odds:

odds(x) = p(x)/(1-p(x))

Thats the definition of “odds”. so in this case it is quite clear that the
odds for X is 1, implying and even money bet.

------
sunstone
The human brain is the best Bayesian model builder that evolution has yet
devised. A good place to start assessing its weaknesses is to observe your own
brain messing up. This shouldn't be hard to do.

~~~
madhadron
Why do you think that the human brain is Bayesian?

~~~
c2471
I bring out a coin; I tell you nothing, and ask you to guess what the
probability of heads is. What do you guess?

Unless you have reason to believe I am trying to deceive you, it will be able
50% because you have a lot of knowledge from other contexts that tells you
this is true.

The arrow is probably the other way round than you state -the brain probably
isn't Bayesian; being Bayesian is modelled on how humans process and
contextualise decisions.

I'm not even sure how a frequentist would construct a model to estimate an
outcome with no observations.

~~~
madhadron
> I'm not even sure how a frequentist would construct a model to estimate an
> outcome with no observations.

The same way a Bayesian would, since it's a question about probabilities of
hypothetical experiments, not about statistics. Or you go through decision
theory instead of mucking about with half baked ideologies.

> being Bayesian is modelled on how humans process and contextualise
> decisions.

This is false.

------
kylebenzle
A good post, but the TL:DR.

What's wrong with Bayes? Nothing.

~~~
neonate
That is not what the article says.

------
bonoboTP
"Bayesian" is an overloaded term. There's Bayes' theorem/rule, which basically
everyone agrees with, since it's a theorem that's very simple to prove with a
few high school math operations.

Then there is the philosophical Bayesian interpretation of probability, that
claims that probabilities are fundamentally about our own mental state of
belief, as opposed to frequencies at the limit of infinite repetition of some
experiment.

Then there is the Bayesian methods of statistics / machine learning etc, which
are about handling parameters as random variables and the observed data as
fixed, as opposed to assuming that there's one fixed parameter (without a
distribution to talk about) and the data should be modeled as random (from an
oversimplified bird's eye view). And it was also oversold as a miracle cure
for all our problems: for some time, before the deep learning era, you just
_had_ to have "Bayesian" in your ML paper title to make it sexy and
interesting.

Then there is the online Bayesian rationalist community, where Bayes is used
to explain the meaning of life, the universe, it's the great grand explanation
of everything, a self help tool, the key to seeing the light, a semi-religious
experience, the way to enlightenment (they even call it the Way, capitalized -
I guess a Buddhist reference?). As if being Bayesian was this secret club,
that sets you apart from average people, a symbol of belonging to the in-group
etc. [1]

It's important to keep these apart.

[1] For example:
[https://youtu.be/NEqHML98RgU?t=73](https://youtu.be/NEqHML98RgU?t=73) (it's
explicitly not about the math but about self-help and intuition to benefit our
lives etc...)

