Hacker News new | comments | show | ask | jobs | submit login
Bayesian vs. frequentist: squabbling among the ignorant (madhadron.com)
110 points by madhadron on Aug 31, 2014 | hide | past | web | favorite | 38 comments

This article stands in contrast to the thoughts expressed in a recent interview with George Ellis[1]. The question of Bayesian vs. frequentist is fundamentally a philosophical question, and it's often discussed in these terms in the articles about the debate, including the recent one which caused this post. The philosophy that we accept affects the tools that we use to solve problems, and the way we think about problems, so you can't simply ignore the debate. Ellis' argument even extends to his example, about ψ (the quantum wave function), about which there is extensive philosophical debate.

When people identify as supporting Bayesian methods over frequentist methods, that actually changes the way they perform science. It's an argument against bad statistical interpretations that frequently arise in the social sciences and medicine[2] and elsewhere because p-values have become the altar at which all publishing researchers must prostrate themselves. It's not a case of a debate without substance, which would have no observable effects.

I can understand the frustrations of the author. Every tool has its use, and frequentist statistical methods are frequently useful. They help you get published, they are broadly supported by software packages, they are often faster and easier to perform, etc., so there are many situations in which they are the right choice. But I feel like in attempting to be agnostic about methods, the author is losing nuance in their argument. Bayesian reasoning really does represent the ground truth (assuming imperfect knowledge), and even when we are using frequentist methods, we are only doing so because the benefits outweigh the costs.

1: http://blogs.scientificamerican.com/cross-check/2014/08/21/q...

2: http://www.sciencebasedmedicine.org/prior-probability-the-di...

Don't think that's quite what the author was getting at. A "frequentist" who makes a decision based on a p-value or confidence interval, and a "Bayesian" who makes a decision based on a posterior or predictive probability distribution, can both be viewed as making a decision according to a procedure that minimizes some statistic of a loss function.

Both of these broad families of methods are diverse, but to generalize broadly, decisions informed by "frequentist" tools are often concerned with the worst-case value of the loss function within some universe, and decisions based on "Bayesian" methods can be viewed as caring about the expected value of the loss function given some prior.

Both of those families (and many others) are "the ground truth," in that they both make statements that can be proved as theorems of mathematics. A "frequentist" confidence interval will always achieve its guaranteed coverage no matter what, if the likelihood function is true. A "Bayesian" credibility interval will include the true value of the parameters at exactly the advertised rate, when averaged over each possible value of the parameters weighted according to the prior, and assuming the likelihood function is true.

The author says, and I agree, that the thing worth arguing over is the "norm we use to choose our optimal procedure" and the nature of the loss function. Whether you care about controlling the worst case or the expected value or something else depends on what matters. (The author points out that in an adversarial situation where your strategy is known to your opponent, minimizing the worst case may be advisable...)

Some days, we care about the worst-case performance of QuickSort, and some days we care about its average-case performance (given an assumption that all input orderings are uniformly likely). It's ok to care about different things depending on the application; we don't have to split into warring tribes over it.

More here: http://blog.keithw.org/2013/02/q-what-is-difference-between-...



Bayesian statistics gives you a posterior distribution. What you do with that distribution is up to you. If you want to find the decision that minimizes the maximum loss instead of the expected loss, that fits perfectly well within the Bayesian framework. The posterior gives you all the information that you need to make a decision, whatever your loss function.

Frequentist statistics on the other hand gives you no such thing. You can't make decisions based on frequentist statistics. Frequentist statistics is all about reasoning about things that did not happen. Things that did not happen are irrelevant for making decisions in a situation where that thing did happen. So while frequentist statistics is mathematically correct, it's also strictly speaking useless in practice. It's only useful insofar as it gives us heuristics for decision making when the Bayesian approach is impractical.

Here's an example. Suppose there are two types of berries: edible and poisonous. We devise a statistical procedure where you measure some properties of the berry (lets say we look at the color), and the procedure should help you decide whether to eat that berry. Now in frequentist statistics, you'll get a procedure that gives you the correct answer with probability at least p regardless of what your measurement was. Suppose that p=90% and we observe that the color is blue, and the procedure says: this berry is edible. Can we now eat the berry? No! This says absolutely nothing about the probability of the blue berry being poisonous or not. For example suppose 95% of berries are edible and red, and 5% of the berries are poisonous and blue. Then a valid procedure would be one that says that all berries are edible. It's correct >90% of the time, so yay! But if the berry we are holding in our hands is blue, it's incorrect 100% of the time. The fact that the procedure would have given us the right answer if the berry we found was red is irrelevant for making the decision of whether to eat the berry in the situation where the berry we found was blue. Things that did not happen are irrelevant for making decisions in a situation where that thing did happen!

tl;dr: frequentist vs bayesian is not about worst case vs average case. It's about P(measurement | true value) vs P(true value | measurement). The former is irrelevant for making decisions, the latter is exactly what you want.

This isn't just about a difference in the choice of loss function to optimise. It's a difference in what sort of guarantees you seek about that loss function.

Bayesian analysis seeks an estimator which minimises posterior expected loss, conditioning on the data and with the expectation taken over the parameters under a particular prior.

A frequentist analysis might seek an estimator for which uniform bounds on the worst-case expected loss are available, which hold in expectation over the data, given any value of the parameters.

Both approaches fit into a decision theoretic framework and there are good reasons why you might care about frequentist properties when making decisions. I agree that this isn't only about average case vs worst case -- as you point out it's also about whether you take expectations over data given params or over the params given data, and that's important too. But I think the average case vs worst case aspect of this is an important part of what this is all about and gets to the heart of what the trade-offs are when choosing between these methods.

I disagree that the sampling distribution is "irrelevant for making decisions", that's quite an extreme view which I don't think many applied Bayesian statisticians would take. Frequentist properties are something people often validly care about when deciding on a statistical procedure to use in an experimental design context, i.e. before collecting the data -- and especially if you're choosing an estimator which you intend to use many times for many experiments, even if they're not all exact replicates of each other.

That's a good example to demonstrate one of the major criticisms of the frequentist tools.

But there's no free lunch here. We can flip the example around and produce an example that demonstrates one of the criticisms of the Bayesian tools.

"Suppose there are two villages, Frequentistburg and Bayesianville, harvesting berries grown in a field between them. There are two types of berries: edible and poisonous. Suppose 86% of berries in the field are edible and float in water, 9% are edible and sink in water, 4% are poisonous and float, and 1% are poisonous and sink. Both towns are interested in devising a decision rule where a citizen measures some property of the berry (let's say we look at whether it floats or sinks) and the procedure should let them decide whether they can eat that berry with at least 80% certainty that it is edible.

"In Bayesianville, the town leaders announce this procedure in the newspaper: 'Take the berry out of the wrapper and see if it floats in water. Given this observation, calculate the posterior probability that the berry is edible, and if that number is more than 80%, eat away. If everybody follows this procedure, on average only 20% of our town will get poisoned by their morning berry.' Is this a good decision rule for the town? Not really. In practice, the town's citizens will end up eating ALL berries. (p(edible|floats) = 86 / (86 + 4) = 95.6% and p(edible|sinks) = 9 / (9 + 1) = 90%). The town's faraway enemy subscribes to their newspaper, learns the decision rule, and exploits a vulnerability: they rearrange the berry crops on the field so that the berries closest to the Bayesianville harvesters are all the poisonous crops. The next day, 100% of the citizens will do the experiment, 100% of the citizens will conclude that they have a <= 10% chance of getting poisoned by their morning berry, 100% of the citizens will eat that berry, and 100% of the citizens will get poisoned by it.

"In Frequentistburg, the town leaders announce a different procedure in the newspaper: 'We have devised a hypothesis test to reject the hypothesis that your morning berry is poisonous. If the berry sinks, then with p = 0.2, you can reject the hypothesis that the berry is poisonous. If the berry floats, then with p = 0.8, you can reject the hypothesis that the berry is poisonous.' A citizen who uses a tolerance for false positive (mistaken eating) of alpha=20% will end up eating the berry if and only if it sinks. The Bayesianvillagers regard this behavior as bizarre: it's the floating berries that have a higher posterior probability of being edible! But in this procedure, because of the minimax criterion, there is no similar vulnerability that be exploited by an enemy town -- the procedure will preserve 80% of the citizenry even if all of their morning berries are somehow manipulated to be the poisonous kind. (Of course, the procedure also ends up discarding 90% of the berries.)

Frequentistburg sees all of Bayesianville's citizens get poisoned by a bad harvest and replies to your critique: "BOTH towns are caring about 'things that did not happen.' Here in Frequentistburg, we constructed our hypothesis test by caring about observations (e.g. float/sink) that did not happen. Your citizens in Bayesianville calculated their posterior by doing a weighted average over values of the parameter (e.g. edible/inedible) that did not happen."

Moved by the painful experience, the neighboring towns met for a joint summit in a neutral location and explained their desiderata to each other in terms of the common language of decision theory and then they all lived happily ever after.

(In my first link above, I show the same basic problems using a uniform prior among four options.)

All you've shown here is that if you optimize one loss function (average number of people dying), you may do badly on another loss function (maximum number of people dying). Or if you are completely wrong about your prior, then you may do badly too. It's a classic "garbage in, garbage out" situation. This reminds me of the Charles Babbage quote:

"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

Furthermore, the frequentist method doesn't do well either, they just aren't eating most of the berries (and the berries that they are eating, are the wrong ones!). If apparently eating berries isn't worth much to you, but dying has a big negative cost, you should give that to the Bayesian loss function, and it too will be conservative about eating berries. I'm very surprised that you seriously consider a method that lets you eat the more poisonous berries, simply because they are rarer, a valid criticism of Bayesianism! If you had given the correct loss function to the Bayesian, he would simply only let 20% of the people eat berries, but that 20% would be eating the mostly edible berries, and not the mostly poisonous berries of course.

Saying that Bayesians also care about things that did not happen because one of edible/inedible is something that did not happen is a bad comparison. It is unknown whether the berry is edible/inedible, so it makes sense that we consider both. On the other hand, it is known that the berry is blue, so why would we care about what if it was red?

Where do the rejection rules used in Frequentistburg come from? And what prevents the faraway enemies (from Machine Learning City?) from selecting the poisonous/sinking kind to kill all the Frequentistburg population as they did in Bayesianville?

Everything that you've written is technically correct. But it seems like in the large majority of situations where people are using/publishing statistics, the expected value is what we care about but we are using frequentist methods. So if we are going to correct the "norm we use to choose our optimal procedure," it should imply a many-fold increase in uses of Bayesian methods. A huge amount of published papers that include any statistics are reporting means and p-values in very simple statistical analyses (essentially all papers published in the social sciences, health sciences, etc.). In many of these situations researchers could provide subjective priors based on the state of the research and knowledge beforehand and it would be philosophically valid and beneficial.

Because of this mismatch, people who support Bayesian methods perhaps overstate their case. Maybe this is for the same reason that partisan political actors argue fervently for a single side of an issue, ignoring nuance. Nuance historically has not led to social change, and fixing statistics in science requires changing people's minds and "raising awareness."

I hear you, and my understanding is that this largely depends on the discipline.

In fields like medicine and psychology, the flourishing of classical methods in the 30s and 40s and 50s did lead to a sort of dogmatism about p-values and a disdain for the old "inverse probability" (what we now call Bayesian methods) of the 18th and 19th centuries. These fields seem to be still recovering from this. My impression is that that's where you often find self-styled Bayesian rebels with the faith of the converted.

In areas like radar or image processing or communications or information theory or ad placement or AI in general, I think the field has long had a more nuanced understanding of the underlying decision-theoretic concerns. When you have to win World War II and there is a cost to falsely identifying a Nazi aircraft as Allied, and a cost to falsely identifying an Allied aircraft as Nazi, you develop a notion of an ROC curve pretty quickly. Ditto when you want to talk to a Voyager probe and you're not sure if it just sent a 0 or a 1.

In my view, the Bayesian vs. frequentist "debate" has little to say to these fields. Although for a contrary view, see what Jaynes writes about Shannon in "Probability Theory: The Logic of Science" in the last chapter ("Introduction to communication theory").

Do Bayesian methods imply a choice of loss function? I think it's a framework for observational statistics, not for policy making or decision theory. You can use it as a tool during policy or decision making, where you should indeed have a debate about picking a loss function that corresponds to your goals.

If you're reporting point or interval estimates (rather than the entire posterior) then you are implicitly or explicitly optimising some kind of loss function.

Also, worth a reminder that continuous estimation problems can be viewed as decision problems too, it's not just about discrete decision-making.

I don't think it's great to take the view that, because I'm not making a decision based on this estimate myself, I don't need to worry about which loss function I'm implicitly optimising for when choosing an estimator. Someone else may need to.

What is not what the author was getting at? It's clear from your comments here--continually scare quoting "frequentist" and "Bayesian"--and your answer on Quora, that you're contemptuous of this debate, like the author. OP is saying that there's more to frequentist vs Bayesian than simply picking the right tool for the job, which you are both suggesting. It won't be settled in this comments section.

As a bayesianist I would like to defend Frequentism here. p-value hacking IS NOT "Frequentist", it is malpractice has its roots in lack of education (copy-paste-analyses), publication pressure and using the p-value as a strong metric for judgement of results (which it was never intended to be).

Frequentism is a toolset of best practices and experiences about statistics, a KISS approach to statistics so to speak.

Bayesianism is a beautiful approach to probabilities and statistics, yet often turns out to be really complex when performing calculations and it is also not fool-proof and it also requires experience to be applied correctly.

If you are restricting your tools on the grounds of philosophy you subscribe to, you are doing it wrong. Ditch the philosophy, stick to math, try everything and go with what results in models of predictive utility. That's what actual scientists do.

Incorrect. The philosophers know 'math'. The half-mathematicians that think they can escape philosophy should read.... well anything. Almost all grand-standing that people have about 'philosophy' is some weird fictional idea that people have in their mind.

Notice how all the recipients of the Lakatos awards and how they are people like Judea Pearl.

Here's a quote from here (Terrence Tao, definitely a mathematician).


"probability theory is only “allowed” to study concepts and perform operations which are preserved with respect to extension of the underlying sample space."

Any attempt to side-step these issues leads to people wandering off into the middle of nowhere. Misuse of mathematical tools to areas where is does not apply and all sorts of other major and minor errors that could have been avoided. There is no computer science without Godel, Frege, Russell, and Turing.

How do you decide which models actually have predictive utility, under real world incomplete information? Back to square One.

How do you decide if the water in your bathtub is warm? Stick your finger into it and see for yourself.

Similarly here, you use your models to predict something, check if they predicted it right, and either try to fix them or ditch them completely.

... and test the prediction.

That is implied in model being of predictive utility.

Are there really people other than philosophers who are actually "Bayesians" or "Frequentists"?

Frequentist statistics are just a special case of Bayesian statistics with certain implicit, built-in priors - if you pick these priors using Bayesian statistics you'll get the same answers.

The frequentest toolbox is just a collection of these useful special cases of Bayesian statistics with priors that usually make sense in practice. The advantage is this greatly simplifies the statistical analysis for many problems. Sometimes these methods fail though, when the priors they rely on implicitly are far off from the actual, and so a Bayesian analysis is needed.

Of course, sometimes it's difficult or impossible to use Bayesian methods.

It's analogous to classical vs. quantum/relativistic physics. For many cases, you can get the right answer to a problem using classical physics. But under certain conditions, classical physics breaks down, and you must apply quantum or relativistic physics to get a meaningful answer. On the other hand, for many problems it would be silly or impossible to use quantum or relativistic physics because classical is perfectly good.

So you might get into argument about whether a specific case can be adequately handled by frequentest statistical methods, or whether a bayesian analysis can/must be applied.

The philosophical debate about the different approaches to the nature of probability is just that - a philosophical one, and one that has no real bearing on the usefulness and correctness of bayesian or frequentist statistics in practice.

I'm afraid you're mistaken -- frequentist statistics are not a special case of Bayesian statistics, and there is generally no "implicit, built-in prior" that allows you to get frequentist worst-case guarantees from Bayesian techniques.

Consider the case of a confidence interval on a binary proportion given a finite number of samples.

A frequentist method will produce an interval that includes the true value of the proportion with at least x% probability, even in the worst case, for any proportion between 0 and 1. (E.g. the Blyth-Still-Casella method or the Clopper-Pearson method.)

An x% credible interval will include the true value exactly x% of the time, averaged over all values of the proportion weighted according to the prior. This will not provide the same guarantee (much less the same interval extent!), no matter what prior is used.

Which is not to say the credible interval is bad or inappropriate. It's just not the same kind of tool. It is optimizing a different penalty function of non-inclusion.

(Another example where the "Bayesian" technique is not quite as conservative as necessary to achieve a frequentist-style guarantee, even with a uniform prior: http://www.quora.com/I-have-burned-200-disks-and-I-want-to-m...)

Statistics researchers in academia definitely express a preference between frequentist and Bayesian inference. So no, it's not limited to philosophers - and I'm pretty sure all but the most forgiving frequentists would not agree with the concept "frequentist = Bayesian with flat priors". Remember that you cannot really formulate a flat infinite (and infinitely thin) distribution in mathematical terms: you need to resort to limit calculation.

While the concept "Bayesian + flat prior = frequentist" is useful to explain a high level connection between the two worlds, there is a lot more to the topic - and in my opinion, it's something that can hardly be scratched without a formal education in the field.

The other problem with the "Bayes with flat prior = frequentist maximum likelihood" idea is that, even if you ignore the issues with improper priors, the concept of a "flat prior" is inherently dependent on arbitrary choices in the way a model is parameterised.

It's not possible for a prior to be "flat" with respect to all re-parameterisations of a continuous parameter in a model. E.g. a flat prior for the variance isn't flat for its inverse (precision) or its square root (the std. dev.), and the choice of which of these alternative parameterisations you use to express the unknown quantity in the model is arbitrary. In the frequentist case it doesn't affect the result of the inference; in the Bayesian case it matters which of the parameterisations you choose your prior to be flat with respect to.

If this seems a bit odd (and it did to me at first!) think about it this way:

Bayesian methods work by averaging over a bunch of different models / different values of the parameters.

What it means to compute a mean depends on the parameterisation in which you do it: simplest example being that an arithmetic mean is not in general the same as a geometric mean, or a harmonic mean.

There's no "neutral" / parameterisation-independent way to specify how this averaging is done, so if you care about the average case, you're going to have commit to doing it some particular favoured parameterisation. Choosing that parameterisation is equivalent to choosing the prior.

Frequentist methods avoid the need for this decision; the price they pay is that without a prior they're unable to condition on the observed data. They must consider every parameter value and its resulting sampling distribution separately and can't average over them.

Why is "flat prior" here? A requentist method assumes a certain prior, for example Normal

Maybe you mean that a frequentist method assumes a certain model, for example normal. A frequentist method will sometimes give similar results to a Bayesian analysis that uses a non-informative prior. For example, if we want to estimate a value from repeated measurements with Gaussian noise the frequentist result is equivalent to the Bayesian result if a "flat" (improper) prior is used. [http://en.wikipedia.org/wiki/Jeffreys_prior#Gaussian_distrib...]

This is true in some cases: maximum likelihood estimation often corresponds to MAP estimation with a certain prior (though it can be argued whether MAP estimation is truly a Bayesian concept, or just a way to transplant MLE to a Bayesian setting by using a very weird loss function). But many frequentist concepts have no Bayesian counterpart such that if you choose a particular prior you get the frequentist concept.

So if two camps of nerds argue, inevitably someone has to submit a post to hackernews "You are both wrong! Let me show you why ..."

And maybe this post also rubs me a little wrong because on the one hand I have an interest in bayesian methods and on the other I will never be able to get a thorough nit-picking-able math education...

At the end of the day, statistics is about using a rationally consistent process to making decisions under uncertainty, but everyone (except the theoreticians) tries to use it to make correct decisions, conjuring certainty from the ether, and then railing against anyone who disagrees with their choice of articles of faith.

In applied statistics we find the bitter religious rivalries of science.

> You prove your understanding by the type of questions you ask.

I can't find where I paraphrased this from. Maybe it was Better Explained [0] while explaining the intuition behind complex numbers? As I remember, the context was "Now you understand that complex numbers are rotations through 2 dimensions, I bet a lot of you are asking 'can we extend math to rotations through 3 dimensions?'" (aka quaternions)

The vibe I get from this article is "the meaning isn't important, so plug and chug away". No no no no no! Grokking the meaning is important because it allows us to extend our understanding from a strong foundation. E.g. HN a few days ago submitted a paper on "Half Coins"[1] which extends our conventional notion of probability to negative numbers. And after I read the paper, I realized it's not as weird as it sounds.

[0] http://betterexplained.com/cheatsheet/

[1] https://news.ycombinator.com/item?id=8187457

disclaimer: I don't know anything about statistics.

The book mentioned in the discussion looks interesting but super expensive:


Has anyone read it?

Try your nearest university classifieds or used book store

looks like you can find it on http://booksprice.com for about $50

I always think that having a better understanding of statistics and the mathematics associated with Bayesian calculations would be useful but I don't know where to start.

What would be a good book for me to learn?

Probability Theory: The Logic of Science, by E.T. Jaynes, is a highly readable definitive reference that builds the theory from the ground up. The first three chapters are all you need for your purposes and are available online for free (link below), but the entire book is wonderful and well worth reading if you are serious about the subject.


Depends on your background. For me Chapter 28 about "Model Comparison and Occam's Razor" in MacKay's book was an eye-opener regarding the usefulness of Bayesian methods.


This book is good if you are interested in information theory and machine learning, but it assumes a basic math and probability background.

...and yet another ignoramus who doesn't understand the debate and so blankets it with the predictably safe "you're both wrong".

Almost stopped reading at the hilariously rubbish statement "All the models have limitations which make them of useless in practice." but it's Sunday, and I'm being entertained.

I think "of" meant to stand for "often".

Which is still a very strong position, but not entirely out of the park - it's quite common (not only in statistics, of course) to encounter situations where not every prerequisite of a certain methodology is met, and yet the obtained results are usable in a practical environment.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact