When people identify as supporting Bayesian methods over frequentist methods, that actually changes the way they perform science. It's an argument against bad statistical interpretations that frequently arise in the social sciences and medicine and elsewhere because p-values have become the altar at which all publishing researchers must prostrate themselves. It's not a case of a debate without substance, which would have no observable effects.
I can understand the frustrations of the author. Every tool has its use, and frequentist statistical methods are frequently useful. They help you get published, they are broadly supported by software packages, they are often faster and easier to perform, etc., so there are many situations in which they are the right choice. But I feel like in attempting to be agnostic about methods, the author is losing nuance in their argument. Bayesian reasoning really does represent the ground truth (assuming imperfect knowledge), and even when we are using frequentist methods, we are only doing so because the benefits outweigh the costs.
Both of these broad families of methods are diverse, but to generalize broadly, decisions informed by "frequentist" tools are often concerned with the worst-case value of the loss function within some universe, and decisions based on "Bayesian" methods can be viewed as caring about the expected value of the loss function given some prior.
Both of those families (and many others) are "the ground truth," in that they both make statements that can be proved as theorems of mathematics. A "frequentist" confidence interval will always achieve its guaranteed coverage no matter what, if the likelihood function is true. A "Bayesian" credibility interval will include the true value of the parameters at exactly the advertised rate, when averaged over each possible value of the parameters weighted according to the prior, and assuming the likelihood function is true.
The author says, and I agree, that the thing worth arguing over is the "norm we use to choose our optimal procedure" and the nature of the loss function. Whether you care about controlling the worst case or the expected value or something else depends on what matters. (The author points out that in an adversarial situation where your strategy is known to your opponent, minimizing the worst case may be advisable...)
Some days, we care about the worst-case performance of QuickSort, and some days we care about its average-case performance (given an assumption that all input orderings are uniformly likely). It's ok to care about different things depending on the application; we don't have to split into warring tribes over it.
More here: http://blog.keithw.org/2013/02/q-what-is-difference-between-...
Frequentist statistics on the other hand gives you no such thing. You can't make decisions based on frequentist statistics. Frequentist statistics is all about reasoning about things that did not happen. Things that did not happen are irrelevant for making decisions in a situation where that thing did happen. So while frequentist statistics is mathematically correct, it's also strictly speaking useless in practice. It's only useful insofar as it gives us heuristics for decision making when the Bayesian approach is impractical.
Here's an example. Suppose there are two types of berries: edible and poisonous. We devise a statistical procedure where you measure some properties of the berry (lets say we look at the color), and the procedure should help you decide whether to eat that berry. Now in frequentist statistics, you'll get a procedure that gives you the correct answer with probability at least p regardless of what your measurement was. Suppose that p=90% and we observe that the color is blue, and the procedure says: this berry is edible. Can we now eat the berry? No! This says absolutely nothing about the probability of the blue berry being poisonous or not. For example suppose 95% of berries are edible and red, and 5% of the berries are poisonous and blue. Then a valid procedure would be one that says that all berries are edible. It's correct >90% of the time, so yay! But if the berry we are holding in our hands is blue, it's incorrect 100% of the time. The fact that the procedure would have given us the right answer if the berry we found was red is irrelevant for making the decision of whether to eat the berry in the situation where the berry we found was blue. Things that did not happen are irrelevant for making decisions in a situation where that thing did happen!
tl;dr: frequentist vs bayesian is not about worst case vs average case. It's about P(measurement | true value) vs P(true value | measurement). The former is irrelevant for making decisions, the latter is exactly what you want.
Bayesian analysis seeks an estimator which minimises posterior expected loss, conditioning on the data and with the expectation taken over the parameters under a particular prior.
A frequentist analysis might seek an estimator for which uniform bounds on the worst-case expected loss are available, which hold in expectation over the data, given any value of the parameters.
Both approaches fit into a decision theoretic framework and there are good reasons why you might care about frequentist properties when making decisions. I agree that this isn't only about average case vs worst case -- as you point out it's also about whether you take expectations over data given params or over the params given data, and that's important too. But I think the average case vs worst case aspect of this is an important part of what this is all about and gets to the heart of what the trade-offs are when choosing between these methods.
I disagree that the sampling distribution is "irrelevant for making decisions", that's quite an extreme view which I don't think many applied Bayesian statisticians would take. Frequentist properties are something people often validly care about when deciding on a statistical procedure to use in an experimental design context, i.e. before collecting the data -- and especially if you're choosing an estimator which you intend to use many times for many experiments, even if they're not all exact replicates of each other.
But there's no free lunch here. We can flip the example around and produce an example that demonstrates one of the criticisms of the Bayesian tools.
"Suppose there are two villages, Frequentistburg and Bayesianville, harvesting berries grown in a field between them. There are two types of berries: edible and poisonous. Suppose 86% of berries in the field are edible and float in water, 9% are edible and sink in water, 4% are poisonous and float, and 1% are poisonous and sink. Both towns are interested in devising a decision rule where a citizen measures some property of the berry (let's say we look at whether it floats or sinks) and the procedure should let them decide whether they can eat that berry with at least 80% certainty that it is edible.
"In Bayesianville, the town leaders announce this procedure in the newspaper: 'Take the berry out of the wrapper and see if it floats in water. Given this observation, calculate the posterior probability that the berry is edible, and if that number is more than 80%, eat away. If everybody follows this procedure, on average only 20% of our town will get poisoned by their morning berry.' Is this a good decision rule for the town? Not really. In practice, the town's citizens will end up eating ALL berries. (p(edible|floats) = 86 / (86 + 4) = 95.6% and p(edible|sinks) = 9 / (9 + 1) = 90%). The town's faraway enemy subscribes to their newspaper, learns the decision rule, and exploits a vulnerability: they rearrange the berry crops on the field so that the berries closest to the Bayesianville harvesters are all the poisonous crops. The next day, 100% of the citizens will do the experiment, 100% of the citizens will conclude that they have a <= 10% chance of getting poisoned by their morning berry, 100% of the citizens will eat that berry, and 100% of the citizens will get poisoned by it.
"In Frequentistburg, the town leaders announce a different procedure in the newspaper: 'We have devised a hypothesis test to reject the hypothesis that your morning berry is poisonous. If the berry sinks, then with p = 0.2, you can reject the hypothesis that the berry is poisonous. If the berry floats, then with p = 0.8, you can reject the hypothesis that the berry is poisonous.' A citizen who uses a tolerance for false positive (mistaken eating) of alpha=20% will end up eating the berry if and only if it sinks. The Bayesianvillagers regard this behavior as bizarre: it's the floating berries that have a higher posterior probability of being edible! But in this procedure, because of the minimax criterion, there is no similar vulnerability that be exploited by an enemy town -- the procedure will preserve 80% of the citizenry even if all of their morning berries are somehow manipulated to be the poisonous kind. (Of course, the procedure also ends up discarding 90% of the berries.)
Frequentistburg sees all of Bayesianville's citizens get poisoned by a bad harvest and replies to your critique: "BOTH towns are caring about 'things that did not happen.' Here in Frequentistburg, we constructed our hypothesis test by caring about observations (e.g. float/sink) that did not happen. Your citizens in Bayesianville calculated their posterior by doing a weighted average over values of the parameter (e.g. edible/inedible) that did not happen."
Moved by the painful experience, the neighboring towns met for a joint summit in a neutral location and explained their desiderata to each other in terms of the common language of decision theory and then they all lived happily ever after.
(In my first link above, I show the same basic problems using a uniform prior among four options.)
"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."
Furthermore, the frequentist method doesn't do well either, they just aren't eating most of the berries (and the berries that they are eating, are the wrong ones!). If apparently eating berries isn't worth much to you, but dying has a big negative cost, you should give that to the Bayesian loss function, and it too will be conservative about eating berries. I'm very surprised that you seriously consider a method that lets you eat the more poisonous berries, simply because they are rarer, a valid criticism of Bayesianism! If you had given the correct loss function to the Bayesian, he would simply only let 20% of the people eat berries, but that 20% would be eating the mostly edible berries, and not the mostly poisonous berries of course.
Saying that Bayesians also care about things that did not happen because one of edible/inedible is something that did not happen is a bad comparison. It is unknown whether the berry is edible/inedible, so it makes sense that we consider both. On the other hand, it is known that the berry is blue, so why would we care about what if it was red?
Because of this mismatch, people who support Bayesian methods perhaps overstate their case. Maybe this is for the same reason that partisan political actors argue fervently for a single side of an issue, ignoring nuance. Nuance historically has not led to social change, and fixing statistics in science requires changing people's minds and "raising awareness."
In fields like medicine and psychology, the flourishing of classical methods in the 30s and 40s and 50s did lead to a sort of dogmatism about p-values and a disdain for the old "inverse probability" (what we now call Bayesian methods) of the 18th and 19th centuries. These fields seem to be still recovering from this. My impression is that that's where you often find self-styled Bayesian rebels with the faith of the converted.
In areas like radar or image processing or communications or information theory or ad placement or AI in general, I think the field has long had a more nuanced understanding of the underlying decision-theoretic concerns. When you have to win World War II and there is a cost to falsely identifying a Nazi aircraft as Allied, and a cost to falsely identifying an Allied aircraft as Nazi, you develop a notion of an ROC curve pretty quickly. Ditto when you want to talk to a Voyager probe and you're not sure if it just sent a 0 or a 1.
In my view, the Bayesian vs. frequentist "debate" has little to say to these fields. Although for a contrary view, see what Jaynes writes about Shannon in "Probability Theory: The Logic of Science" in the last chapter ("Introduction to communication theory").
Also, worth a reminder that continuous estimation problems can be viewed as decision problems too, it's not just about discrete decision-making.
I don't think it's great to take the view that, because I'm not making a decision based on this estimate myself, I don't need to worry about which loss function I'm implicitly optimising for when choosing an estimator. Someone else may need to.
Frequentism is a toolset of best practices and experiences about statistics, a KISS approach to statistics so to speak.
Bayesianism is a beautiful approach to probabilities and statistics, yet often turns out to be really complex when performing calculations and it is also not fool-proof and it also requires experience to be applied correctly.
Notice how all the recipients of the Lakatos awards and how they are people like Judea Pearl.
Here's a quote from here (Terrence Tao, definitely a mathematician).
"probability theory is only “allowed” to study concepts and perform operations which are preserved with respect to extension of the underlying sample space."
Any attempt to side-step these issues leads to people wandering off into the middle of nowhere. Misuse of mathematical tools to areas where is does not apply and all sorts of other major and minor errors that could have been avoided. There is no computer science without Godel, Frege, Russell, and Turing.
Similarly here, you use your models to predict something, check if they predicted it right, and either try to fix them or ditch them completely.
Frequentist statistics are just a special case of Bayesian statistics with certain implicit, built-in priors - if you pick these priors using Bayesian statistics you'll get the same answers.
The frequentest toolbox is just a collection of these useful special cases of Bayesian statistics with priors that usually make sense in practice. The advantage is this greatly simplifies the statistical analysis for many problems. Sometimes these methods fail though, when the priors they rely on implicitly are far off from the actual, and so a Bayesian analysis is needed.
Of course, sometimes it's difficult or impossible to use Bayesian methods.
It's analogous to classical vs. quantum/relativistic physics. For many cases, you can get the right answer to a problem using classical physics. But under certain conditions, classical physics breaks down, and you must apply quantum or relativistic physics to get a meaningful answer. On the other hand, for many problems it would be silly or impossible to use quantum or relativistic physics because classical is perfectly good.
So you might get into argument about whether a specific case can be adequately handled by frequentest statistical methods, or whether a bayesian analysis can/must be applied.
The philosophical debate about the different approaches to the nature of probability is just that - a philosophical one, and one that has no real bearing on the usefulness and correctness of bayesian or frequentist statistics in practice.
Consider the case of a confidence interval on a binary proportion given a finite number of samples.
A frequentist method will produce an interval that includes the true value of the proportion with at least x% probability, even in the worst case, for any proportion between 0 and 1. (E.g. the Blyth-Still-Casella method or the Clopper-Pearson method.)
An x% credible interval will include the true value exactly x% of the time, averaged over all values of the proportion weighted according to the prior. This will not provide the same guarantee (much less the same interval extent!), no matter what prior is used.
Which is not to say the credible interval is bad or inappropriate. It's just not the same kind of tool. It is optimizing a different penalty function of non-inclusion.
(Another example where the "Bayesian" technique is not quite as conservative as necessary to achieve a frequentist-style guarantee, even with a uniform prior: http://www.quora.com/I-have-burned-200-disks-and-I-want-to-m...)
While the concept "Bayesian + flat prior = frequentist" is useful to explain a high level connection between the two worlds, there is a lot more to the topic - and in my opinion, it's something that can hardly be scratched without a formal education in the field.
It's not possible for a prior to be "flat" with respect to all re-parameterisations of a continuous parameter in a model. E.g. a flat prior for the variance isn't flat for its inverse (precision) or its square root (the std. dev.), and the choice of which of these alternative parameterisations you use to express the unknown quantity in the model is arbitrary. In the frequentist case it doesn't affect the result of the inference; in the Bayesian case it matters which of the parameterisations you choose your prior to be flat with respect to.
Bayesian methods work by averaging over a bunch of different models / different values of the parameters.
What it means to compute a mean depends on the parameterisation in which you do it: simplest example being that an arithmetic mean is not in general the same as a geometric mean, or a harmonic mean.
There's no "neutral" / parameterisation-independent way to specify how this averaging is done, so if you care about the average case, you're going to have commit to doing it some particular favoured parameterisation. Choosing that parameterisation is equivalent to choosing the prior.
Frequentist methods avoid the need for this decision; the price they pay is that without a prior they're unable to condition on the observed data. They must consider every parameter value and its resulting sampling distribution separately and can't average over them.
And maybe this post also rubs me a little wrong because on the one hand I have an interest in bayesian methods and on the other I will never be able to get a thorough nit-picking-able math education...
In applied statistics we find the bitter religious rivalries of science.
I can't find where I paraphrased this from. Maybe it was Better Explained  while explaining the intuition behind complex numbers? As I remember, the context was "Now you understand that complex numbers are rotations through 2 dimensions, I bet a lot of you are asking 'can we extend math to rotations through 3 dimensions?'" (aka quaternions)
The vibe I get from this article is "the meaning isn't important, so plug and chug away". No no no no no! Grokking the meaning is important because it allows us to extend our understanding from a strong foundation. E.g. HN a few days ago submitted a paper on "Half Coins" which extends our conventional notion of probability to negative numbers. And after I read the paper, I realized it's not as weird as it sounds.
disclaimer: I don't know anything about statistics.
Has anyone read it?
What would be a good book for me to learn?
This book is good if you are interested in information theory and machine learning, but it assumes a basic math and probability background.
Almost stopped reading at the hilariously rubbish statement "All the models have limitations which make them of useless in practice." but it's Sunday, and I'm being entertained.
Which is still a very strong position, but not entirely out of the park - it's quite common (not only in statistics, of course) to encounter situations where not every prerequisite of a certain methodology is met, and yet the obtained results are usable in a practical environment.