
Bayesian vs. frequentist: squabbling among the ignorant - madhadron
http://madhadron.com/posts/2014-08-30-frequentist_and_bayesian_statistics.html
======
bjterry
This article stands in contrast to the thoughts expressed in a recent
interview with George Ellis[1]. The question of Bayesian vs. frequentist is
fundamentally a philosophical question, and it's often discussed in these
terms in the articles about the debate, including the recent one which caused
this post. The philosophy that we accept affects the tools that we use to
solve problems, and the way we think about problems, so you can't simply
ignore the debate. Ellis' argument even extends to his example, about ψ (the
quantum wave function), about which there is extensive philosophical debate.

When people identify as supporting Bayesian methods over frequentist methods,
that actually changes the way they perform science. It's an argument against
bad statistical interpretations that frequently arise in the social sciences
and medicine[2] and elsewhere because p-values have become the altar at which
all publishing researchers must prostrate themselves. It's not a case of a
debate without substance, which would have no observable effects.

I can understand the frustrations of the author. Every tool has its use, and
frequentist statistical methods are frequently useful. They help you get
published, they are broadly supported by software packages, they are often
faster and easier to perform, etc., so there are many situations in which they
are the right choice. But I feel like in attempting to be agnostic about
methods, the author is losing nuance in their argument. Bayesian reasoning
really does represent the ground truth (assuming imperfect knowledge), and
even when we are using frequentist methods, we are only doing so because the
benefits outweigh the costs.

1: [http://blogs.scientificamerican.com/cross-
check/2014/08/21/q...](http://blogs.scientificamerican.com/cross-
check/2014/08/21/quantum-gravity-expert-says-philosophical-superficiality-has-
harmed-physics/?-k)

2: [http://www.sciencebasedmedicine.org/prior-probability-the-
di...](http://www.sciencebasedmedicine.org/prior-probability-the-dirty-little-
secret-of-evidence-based-alternative-medicine-2/)

~~~
keithwinstein
Don't think that's quite what the author was getting at. A "frequentist" who
makes a decision based on a p-value or confidence interval, and a "Bayesian"
who makes a decision based on a posterior or predictive probability
distribution, can both be viewed as making a decision according to a procedure
that minimizes some statistic of a loss function.

Both of these broad families of methods are diverse, but to generalize
broadly, decisions informed by "frequentist" tools are often concerned with
the worst-case value of the loss function within some universe, and decisions
based on "Bayesian" methods can be viewed as caring about the expected value
of the loss function given some prior.

Both of those families (and many others) are "the ground truth," in that they
both make statements that can be proved as theorems of mathematics. A
"frequentist" confidence interval will always achieve its guaranteed coverage
no matter what, if the likelihood function is true. A "Bayesian" credibility
interval will include the true value of the parameters at exactly the
advertised rate, when averaged over each possible value of the parameters
weighted according to the prior, and assuming the likelihood function is true.

The author says, and I agree, that the thing worth arguing over is the "norm
we use to choose our optimal procedure" and the nature of the loss function.
Whether you care about controlling the worst case or the expected value or
something else depends on what matters. (The author points out that in an
adversarial situation where your strategy is known to your opponent,
minimizing the worst case may be advisable...)

Some days, we care about the worst-case performance of QuickSort, and some
days we care about its average-case performance (given an assumption that all
input orderings are uniformly likely). It's ok to care about different things
depending on the application; we don't have to split into warring tribes over
it.

More here: [http://blog.keithw.org/2013/02/q-what-is-difference-
between-...](http://blog.keithw.org/2013/02/q-what-is-difference-between-
bayesian.html)

[http://jsteinhardt.wordpress.com/2014/02/10/a-fervent-
defens...](http://jsteinhardt.wordpress.com/2014/02/10/a-fervent-defense-of-
frequentist-statistics/)

[http://cs.stanford.edu/~jsteinhardt/stats-
essay.pdf](http://cs.stanford.edu/~jsteinhardt/stats-essay.pdf)

~~~
jules
Bayesian statistics gives you a posterior distribution. What you do with that
distribution is up to you. If you want to find the decision that minimizes the
maximum loss instead of the expected loss, that fits perfectly well within the
Bayesian framework. The posterior gives you all the information that you need
to make a decision, whatever your loss function.

Frequentist statistics on the other hand gives you no such thing. You can't
make decisions based on frequentist statistics. Frequentist statistics is all
about reasoning about things that _did not happen_. Things that did not happen
are irrelevant for making decisions in a situation where that thing did
happen. So while frequentist statistics is mathematically correct, it's also
strictly speaking useless in practice. It's only useful insofar as it gives us
heuristics for decision making when the Bayesian approach is impractical.

Here's an example. Suppose there are two types of berries: edible and
poisonous. We devise a statistical procedure where you measure some properties
of the berry (lets say we look at the color), and the procedure should help
you decide whether to eat that berry. Now in frequentist statistics, you'll
get a procedure that gives you the correct answer with probability at least p
regardless of what your measurement was. Suppose that p=90% and we observe
that the color is blue, and the procedure says: this berry is edible. Can we
now eat the berry? No! This says absolutely nothing about the probability of
the blue berry being poisonous or not. For example suppose 95% of berries are
edible and red, and 5% of the berries are poisonous and blue. Then a valid
procedure would be one that says that all berries are edible. It's correct
>90% of the time, so yay! But if the berry we are holding in our hands is
blue, it's incorrect 100% of the time. The fact that the procedure would have
given us the right answer if the berry we found was red is _irrelevant_ for
making the decision of whether to eat the berry in the situation where the
berry we found was blue. Things that did not happen are irrelevant for making
decisions in a situation where that thing did happen!

tl;dr: frequentist vs bayesian is not about worst case vs average case. It's
about P(measurement | true value) vs P(true value | measurement). The former
is irrelevant for making decisions, the latter is exactly what you want.

~~~
keithwinstein
That's a good example to demonstrate one of the major criticisms of the
frequentist tools.

But there's no free lunch here. We can flip the example around and produce an
example that demonstrates one of the criticisms of the Bayesian tools.

"Suppose there are two villages, Frequentistburg and Bayesianville, harvesting
berries grown in a field between them. There are two types of berries: edible
and poisonous. Suppose 86% of berries in the field are edible and float in
water, 9% are edible and sink in water, 4% are poisonous and float, and 1% are
poisonous and sink. Both towns are interested in devising a decision rule
where a citizen measures some property of the berry (let's say we look at
whether it floats or sinks) and the procedure should let them decide whether
they can eat that berry with at least 80% certainty that it is edible.

"In Bayesianville, the town leaders announce this procedure in the newspaper:
'Take the berry out of the wrapper and see if it floats in water. Given this
observation, calculate the posterior probability that the berry is edible, and
if that number is more than 80%, eat away. If everybody follows this
procedure, on average only 20% of our town will get poisoned by their morning
berry.' Is this a good decision rule for the town? Not really. In practice,
the town's citizens will end up eating ALL berries. (p(edible|floats) = 86 /
(86 + 4) = 95.6% and p(edible|sinks) = 9 / (9 + 1) = 90%). The town's faraway
enemy subscribes to their newspaper, learns the decision rule, and exploits a
vulnerability: they rearrange the berry crops on the field so that the berries
closest to the Bayesianville harvesters are all the poisonous crops. The next
day, 100% of the citizens will do the experiment, 100% of the citizens will
conclude that they have a <= 10% chance of getting poisoned by their morning
berry, 100% of the citizens will eat that berry, and 100% of the citizens will
get poisoned by it.

"In Frequentistburg, the town leaders announce a different procedure in the
newspaper: 'We have devised a hypothesis test to reject the hypothesis that
your morning berry is poisonous. If the berry sinks, then with p = 0.2, you
can reject the hypothesis that the berry is poisonous. If the berry floats,
then with p = 0.8, you can reject the hypothesis that the berry is poisonous.'
A citizen who uses a tolerance for false positive (mistaken eating) of
alpha=20% will end up eating the berry if and only if it sinks. The
Bayesianvillagers regard this behavior as bizarre: it's the _floating_ berries
that have a higher posterior probability of being edible! But in this
procedure, because of the minimax criterion, there is no similar vulnerability
that be exploited by an enemy town -- the procedure will preserve 80% of the
citizenry even if all of their morning berries are somehow manipulated to be
the poisonous kind. (Of course, the procedure also ends up discarding 90% of
the berries.)

Frequentistburg sees all of Bayesianville's citizens get poisoned by a bad
harvest and replies to your critique: " _BOTH_ towns are caring about 'things
that did not happen.' Here in Frequentistburg, we constructed our hypothesis
test by caring about _observations_ (e.g. float/sink) that did not happen.
Your citizens in Bayesianville calculated their posterior by doing a weighted
average over _values of the parameter_ (e.g. edible/inedible) that did not
happen."

Moved by the painful experience, the neighboring towns met for a joint summit
in a neutral location and explained their desiderata to each other in terms of
the common language of decision theory and then they all lived happily ever
after.

(In my first link above, I show the same basic problems using a uniform prior
among four options.)

~~~
jules
All you've shown here is that if you optimize one loss function (average
number of people dying), you may do badly on another loss function (maximum
number of people dying). Or if you are completely wrong about your prior, then
you may do badly too. It's a classic "garbage in, garbage out" situation. This
reminds me of the Charles Babbage quote:

"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the
machine wrong figures, will the right answers come out?' I am not able rightly
to apprehend the kind of confusion of ideas that could provoke such a
question."

Furthermore, the frequentist method doesn't do well either, they just aren't
eating most of the berries (and the berries that they are eating, are the
wrong ones!). If apparently eating berries isn't worth much to you, but dying
has a big negative cost, you should give that to the Bayesian loss function,
and it too will be conservative about eating berries. I'm very surprised that
you seriously consider a method that lets you eat the more poisonous berries,
simply because they are rarer, a valid criticism of Bayesianism! If you had
given the correct loss function to the Bayesian, he would simply only let 20%
of the people eat berries, but that 20% would be eating the mostly edible
berries, and not the mostly poisonous berries of course.

Saying that Bayesians also care about things that did not happen because one
of edible/inedible is something that did not happen is a bad comparison. It is
unknown whether the berry is edible/inedible, so it makes sense that we
consider both. On the other hand, it is known that the berry is blue, so why
would we care about what if it was red?

------
eemax
Are there really people other than philosophers who are actually "Bayesians"
or "Frequentists"?

Frequentist statistics are just a special case of Bayesian statistics with
certain implicit, built-in priors - if you pick these priors using Bayesian
statistics you'll get the same answers.

The frequentest toolbox is just a collection of these useful special cases of
Bayesian statistics with priors that usually make sense in practice. The
advantage is this greatly simplifies the statistical analysis for many
problems. Sometimes these methods fail though, when the priors they rely on
implicitly are far off from the actual, and so a Bayesian analysis is needed.

Of course, sometimes it's difficult or impossible to use Bayesian methods.

It's analogous to classical vs. quantum/relativistic physics. For many cases,
you can get the right answer to a problem using classical physics. But under
certain conditions, classical physics breaks down, and you must apply quantum
or relativistic physics to get a meaningful answer. On the other hand, for
many problems it would be silly or impossible to use quantum or relativistic
physics because classical is perfectly good.

So you might get into argument about whether a specific case can be adequately
handled by frequentest statistical methods, or whether a bayesian analysis
can/must be applied.

The philosophical debate about the different approaches to the nature of
probability is just that - a philosophical one, and one that has no real
bearing on the usefulness and correctness of bayesian or frequentist
statistics in practice.

~~~
andrea_s
Statistics researchers in academia definitely express a preference between
frequentist and Bayesian inference. So no, it's not limited to philosophers -
and I'm pretty sure all but the most forgiving frequentists would not agree
with the concept "frequentist = Bayesian with flat priors". Remember that you
cannot really formulate a flat infinite (and infinitely thin) distribution in
mathematical terms: you need to resort to limit calculation.

While the concept "Bayesian + flat prior = frequentist" is useful to explain a
high level connection between the two worlds, there is a lot more to the topic
- and in my opinion, it's something that can hardly be scratched without a
formal education in the field.

~~~
mjw
The other problem with the "Bayes with flat prior = frequentist maximum
likelihood" idea is that, even if you ignore the issues with improper priors,
the concept of a "flat prior" is inherently dependent on arbitrary choices in
the way a model is parameterised.

It's not possible for a prior to be "flat" with respect to all re-
parameterisations of a continuous parameter in a model. E.g. a flat prior for
the variance isn't flat for its inverse (precision) or its square root (the
std. dev.), and the choice of which of these alternative parameterisations you
use to express the unknown quantity in the model is arbitrary. In the
frequentist case it doesn't affect the result of the inference; in the
Bayesian case it matters which of the parameterisations you choose your prior
to be flat with respect to.

~~~
mjw
If this seems a bit odd (and it did to me at first!) think about it this way:

Bayesian methods work by averaging over a bunch of different models /
different values of the parameters.

What it means to compute a mean depends on the parameterisation in which you
do it: simplest example being that an arithmetic mean is not in general the
same as a geometric mean, or a harmonic mean.

There's no "neutral" / parameterisation-independent way to specify how this
averaging is done, so if you care about the average case, you're going to have
commit to doing it some particular favoured parameterisation. Choosing that
parameterisation is equivalent to choosing the prior.

Frequentist methods avoid the need for this decision; the price they pay is
that without a prior they're unable to condition on the observed data. They
must consider every parameter value and its resulting sampling distribution
separately and can't average over them.

------
bayesianhorse
So if two camps of nerds argue, inevitably someone has to submit a post to
hackernews "You are both wrong! Let me show you why ..."

And maybe this post also rubs me a little wrong because on the one hand I have
an interest in bayesian methods and on the other I will never be able to get a
thorough nit-picking-able math education...

------
judk
At the end of the day, statistics is about using a rationally consistent
process to making decisions under _uncertainty_ , but everyone (except the
theoreticians) tries to use it to make _correct_ decisions, conjuring
certainty from the ether, and then railing against anyone who disagrees with
their choice of articles of faith.

In applied statistics we find the bitter religious rivalries of science.

------
Double_Cast
> _You prove your understanding by the type of questions you ask._

I can't find where I paraphrased this from. Maybe it was Better Explained [0]
while explaining the intuition behind complex numbers? As I remember, the
context was "Now you understand that complex numbers are rotations through 2
dimensions, I bet a lot of you are asking 'can we extend math to rotations
through _3_ dimensions?'" (aka quaternions)

The vibe I get from this article is "the meaning isn't important, so plug and
chug away". No no no no no! Grokking the meaning is important because it
allows us to extend our understanding from a strong foundation. E.g. HN a few
days ago submitted a paper on "Half Coins"[1] which extends our conventional
notion of _probability_ to _negative numbers_. And after I read the paper, I
realized it's not as weird as it sounds.

[0]
[http://betterexplained.com/cheatsheet/](http://betterexplained.com/cheatsheet/)

[1]
[https://news.ycombinator.com/item?id=8187457](https://news.ycombinator.com/item?id=8187457)

disclaimer: I don't know anything about statistics.

------
papaf
The book mentioned in the discussion looks interesting but super expensive:

[http://www.amazon.com/Introduction-Statistical-Inference-
Spr...](http://www.amazon.com/Introduction-Statistical-Inference-Springer-
Statistics/dp/1461395801)

Has anyone read it?

~~~
raverbashing
Try your nearest university classifieds or used book store

------
song
I always think that having a better understanding of statistics and the
mathematics associated with Bayesian calculations would be useful but I don't
know where to start.

What would be a good book for me to learn?

~~~
tmoertel
_Probability Theory: The Logic of Science_ , by E.T. Jaynes, is a highly
readable definitive reference that builds the theory from the ground up. The
first three chapters are all you need for your purposes and are available
online for free (link below), but the entire book is wonderful and well worth
reading if you are serious about the subject.

[http://bayes.wustl.edu/etj/prob/book.pdf](http://bayes.wustl.edu/etj/prob/book.pdf)

------
jmpeax
...and yet another ignoramus who doesn't understand the debate and so blankets
it with the predictably safe "you're both wrong".

Almost stopped reading at the hilariously rubbish statement "All the models
have limitations which make them of useless in practice." but it's Sunday, and
I'm being entertained.

~~~
andrea_s
I think "of" meant to stand for "often".

Which is still a very strong position, but not entirely out of the park - it's
quite common (not only in statistics, of course) to encounter situations where
not every prerequisite of a certain methodology is met, and yet the obtained
results are usable in a practical environment.

