A famous Bayesian arguing for frequentist statistics?
Gelman tries to steal the concept "Frequentism" from simple minded frequentist statisticians.
His argument seems to be:
Simple minded frequentists statisticians perform a statistical procedure once - they do not think about performing the procedure many times.
They fall into this trap (from Gelman's paper):
"3. Researcher degrees of freedom without fishing: computing a single test based on the data,
but in an environment where a different test would have been performed given different data"
Gelman is one of the few self-proclaimed Bayesians who doesn't seem to outright hate frequentist approaches. They're complementary approaches. Bayesian methods are great for combining different sources of information. Frequentist methods are great for validating that a method is working well. (For example, Gelman often recommends running simulations to see if models give sensible predictions, but that is itself a pretty frequentist thing to do.)
Frequentism is mostly about how to evaluate a methodology. It's pretty agnostic about what that methodology is. Bayesian methods are about combining different sources of information. In a situation where you only have one source of information, Bayesian and Frequentist methods usually give the same answer.
People say you might as well always use Bayesian methods then. But no matter what, you should always try to validate or poke holes in your model, and Frequentist techniques are great for that. So it's best to be familiar with both!
> running simulations to see if models give sensible predictions, but that is itself a pretty frequentist thing to do
Is looking at probability distributions “a pretty frequentist thing to do”? Even when those models and simulations include _prior_ probability distributions? Sure, one can (re)define frequentist to include Bayesian models - as Gelman seems to want to do in that post. I just don’t see how this helps to clarify anything.
>In a situation where you only have one source of information, Bayesian and Frequentist methods usually give the same answer.
Bayesian and frequentist methods always give the same answer because they represent two different ways of translating the same mathematical ideas into English.
Imagine the following question: "what's the male/female ratio in gorillas?"
A frequentist method may provide the answer "[1.1 1.3] is a 95% confidence interval" based on taking a sample of zoos and asking them about the sex of the last gorilla born there.
A Bayesian method will provide a different answer - maybe one difficult to reconcile. Because it's not "translating the same mathematical ideas into English". Not only the translation is different - the "mathematical ideas" considered are different as well.
A Bayesian may put a strong prior on around the 1:1 sex ratio at birth - because in addition to that data regarding a sample of births they incorporate in the calculation knowledge about the plausible ratio coming from previous observations or biological facts about giraffes or related animals - and get a 95% credible interval (which is conceptually completely different from a 95% confidence interval) like [0.99 1.01] or whatever.
You can't just say that Bayesian and frequentist methods _always_ give the same answer without offering even a _single_ example.
What is commonly understood as 'Bayesian methods' will give answers in the form of a probability distribution. What is commonly understood as 'frequentist methods' will never do that. How can they always give the same answer then?
Thanks for adding detail. I didn't offer any examples because I want to use ones that sound representative to you.
The 95% confidence interval is in reference to a probability distribution, I'm not sure what you mean when you say that frequentist answers aren't in terms of those.
As for your bayesian answer, there is a prior that would make their result equal to the frequentist one - and in another example where priors were more obviously crucial (weak evidence) the frequentist would still use Baye's theorem. Their ensemble would be all possible worlds in which they'd ask the question.
> The 95% confidence interval is in reference to a probability distribution, I'm not sure what you mean when you say that frequentist answers aren't in terms of those.
Maybe I should have been more explicit. Let me complete what I wrote above:
"What is commonly understood as 'Bayesian methods' will give answers in the form of a probability distribution FOR THE QUANTITY OF INTEREST. What is commonly understood as 'frequentist methods' will never do that."
In the sex ratio example, the 95% confidence interval [1.1 1.3] DOES NOT mean that the probability that the sex ratio is between 1.1 and 1.3 is 95%. What it means is that when you calculate confidence intervals using this method 95% of them will contain the true value of the parameter. It's not the same thing.
> As for your bayesian answer, there is a prior that would make their result equal to the frequentist one
So it doesn't _always_ give the same answer, does it? (In the best case you get the same answer by twisting what a confidence interval means and only if the "magic" prior is used.)
If you mean that frequentist methods shouldn't be used if they give a different answer I agree!
They are used because the are easy even though they give the wrong answer (or the answer to the wrong question) - not because they always give the right answer.
Two bayesians can come up with different answers if they use different priors, there are rarely clear-cut rules for choice of prior (if there is a commonly accepted value, how much weight do you put on it?). Two frequentists can come up with different answers if they define their ensembles differently. A frequentist and a bayesian can come up with the same answer if the implicit prior built in to the ensemble matches the explicit prior adopted by the bayesian.
Since bayesians understand that they can't trace back their probability updates all the way back to a single ultimate prior, they do not actually talk about distributions over the quantity of interest either. Ultimately it's the same as what a frequentist would do.
When a bayesian lets a somewhat arbitrarily adopted prior stand in for the "ultimate prior," so that they can interpret their answer as a distribution over the variable of interest, they are not doing anything differently from when a frequentist fudges the exact concepts and treats their answer about the degree to which the model would predict the measured outcomes as if it was the probability of the model being true.
There are no schools of statistics that offer an unqualified "probability that the model is true," and that may be philosophically impossible AFAIK.
We agree that they don't always give the same answer then.
And also that the answers refer to different things. Bayesian methods depend on the prior probablity distribution chosen for the parameter of interest and produce a posterior probability distribution for it [1]. Frequentist methods have no concept of probability distribution regarding the parameter of interest at all.
So long!
[1] "Since bayesians understand that they can't trace back their probability updates all the way back to a single ultimate prior, they do not actually talk about distributions over the quantity of interest either."
I'm having issues to understand that though. If priors and posteriors are not "distributions over the quantity of interest" what are they about?
The posterior can be anything depending on the choice of prior. It is not a distribution over the variable of interest any more than a frequentist's ensembles are. Like frequentist answers, the posterior is understood to be something that can stand in for the in fact unobtainable distribution over the variable of interest.
> It is not a distribution over the variable of interest any more than a frequentist's ensembles are.
It can be right or wrong but it is what it is!
The frequentist ensembles in the frequentist inference are absolutely not a distribution over the variable of interest. They are defined for a fixed value of the variable of interest.
The 95% confidence interval means that 95% of the intervals that you generate in the ensemble where the paramater has always whatever its unknown actual value is will include it.
Wouldn't that imply that every Bayesian calculation ever done was wrong? The chances that you'd choose the right ultimate prior function R->R from that set are measure zero.
The calculation Input -> [Model] -> Output is not wrong.
I don't know what are you trying to say but it's no longer related to whether Bayesian methods and frequentist methods give the same answers. They don't even try to represent the same questions.
PS: I didn't have time to reply to your later comment, so here is a final comment reaching for common ground.
> frequentist conceptual machinery
Just to be clear, that's not what I've been talking about. As I'm tried to make clear several times I'm talking about what is commonly understood as frequentist methods. I gave the concrete example of confidence intervals. I started the conversation with a explicit question: What do you call "frequentist methods"?
I agree that if we define "frequentist conceptual machinery" to be "probability" we can do Bayes with it.
Frequentism and bayesianism are more than separate collections of formulas, they're philosophical stances on interpreting theorems about statistics. That's where the separate language comes from. It might not be taught very often (because it's inconvenient and wordy), but it is actually possible to address anything in frequentist language, even cases where you're combining separate sources of differently-weighted evidence. The same goes for bayesian language.
The frequentist test for this attempts to see what would happen with a variety of test designs using likelihood ratio and similar statistical tests. Relaxing it you end up with Generalized Method of Moments family.
A Bayesian would attempt to compute the Bayes factor using approximate Bayesian computation resulting in more or less the same thing. You end up with various information criteria.
Both approaches then converge in using Monte Carlo techniques to evaluate the features of the whole experimental setup using simulated data.
All of the above approaches replace the problem of choosing the test/design based on data by the researcher with one by a data driven algorithm with known properties.
On the view of at least some Bayesians, "frequentist" is just the special case of "Bayesian" that you get when you are computing credences based on a large number of identically prepared, independent trials over a known sample space. So on this view (with which I tend to agree), the two are certainly not incompatible.
It is of course possible to do both frequentist and Bayesian statistics badly. I would say bad frequentism comes when one fails to realize that standard frequentist methods tell you the probability of the data given a hypothesis, when what you really need to know is the probability of the hypothesis given the data. Bayesianism at least starts right out with the latter approach, so it avoids the former (unforfunately all too common) error.
Bad Bayesianism, OTOH, I would say comes when one fails to realize that Bayes' rule is not a drop-in replacement for your brain. You still need to exercise judgment and common sense, and you still need to make an honest evaluation of the information you have. You can't just blindly plug numbers into Bayes' rule and expect to get useful answers.
Long ago when I was a physics grad student, I distinctly remember that when someone introduced Bayesian statistics in a talk, it was because they were trying to justify weeding outliers from their data by hand. And they always got called to task on it.
I think that it’s right to call out scientists who think that math is reality.
We made math up, end of story. The “Unreasonable Effectiveness of Mathematics“ is obvious selection bias.
Math, statistics, is a tool. I don’t expect my shovel to be a dowsing rod, and I don’t expect my bayesian methods to predict the future. But, shovels do dig wells and probably does, on average, work out; neither is worthy of being disregarded. But, there have literally been folks since Pythagoras‘s time who believe that logic, and math, are The Truth. Like, God: The Truth. Like, it works because it’s the way that nature is, and we understand and control it and itis math… a “Natural Law”.
A better scientific mind does not fall for such folly. The “outliers” are the very phenomena that science wants to study. If we can explain the outliers through an error in method, fine. But, if the outliers are not able to be explained then we would never want to gloss over them because they don’t fit our expectations of a mathematical model of reality.
There is less of a conflict than many would have you believe. In many situations, both approaches yield the same answer. There are some edge cases. For example, in A/B testing, is early peeking bad? From a frequentist perspective the answer is "yes, either use a sequential method, or don't early peek at all". From a Bayesian perspective the answer is "early peeking is fine".
It boils down to what properties you want your analysis to have. Cox and Hinkley's "Theoretical Statistics" has a great discussion (section 2.4). Basically, you might want your analysis to have a certain kind of internal consistency. But you might also want your analysis to be replicable either by yourself or by another researcher. Those both seem like pretty important things! But there are edge cases (like the early peeking example) where you can't have it both ways. So you have to pick which one you want, and use the corresponding methods.
The likelihood principle actually supports the Bayesian perspective on these issues of experiment design, and is regarded as foundational by many frequentists.
Agreed. But as Cox and Hinkley discuss, the likelihood principle is sometimes at odds with the repeated sampling principle, so in any particular application, you need to identify if there is a conflict, and if so, which principle is more important. In my domain (simple A/B tests), you can claw the repeated sampling principle from my cold, dead hands.
There's no conflict. Nowadays whoever takes some courses in statistics will be exposed to Bayesian statistics. The universal reaction when exposed to Bayesian statistics is to fall in love with it, to think it's the answer to all of the world's problems (with the possible exception of war and hunger). But most people get to the point where they realize that Bayesian statistics is just a set of tools. Frequentist statistics can often times get you essentially the same results, with less effort.
And by the way, what do you call MLE (maximum likelihood estimation)? Is it frequentist? Because it looks awfully close to Bayesian.
Nothing prevents you from "going Bayesian" in figuring out how compatible a hypothesis is with reality. Bayesians have no issues with probabilities representing frequencies even though frequentists cannot understand probabilities representing uncertainty.
My point is, in Many-Words the parallel universes aren’t just a thought experiment, they are the actual real future of your current self. The computed probabilities do not describe potentialities then, but actualities.
Probabilities can represent uncertainty about actual things - things that are hardly going to be different even if you strongly imagine your current self branching out into a myriad of parallel universes.
There's nothing stopping the Bayesian posterior from being the degree to which you can reject the null hypothesis except for its definition. There's nothing except their definition stopping real numbers from including i, too.
The p-value's ACTUAL definition, as accepted today in the statistical sense, can incorporate bayesian priors easily by making the model-data compatibility score upon which it acts compatible with Bayesian priors. Such as, e.g., a posterior probability.
My understanding of the post is that given a small actual effect size, for a fixed experiment, you are more likely to get a significant p-value on a "large" measured effect size.
More like "make sure your test power is what you think it is". There will still be results that fail to replicate by virtue of the rejection of the null hypothesis by chance, but that should only happen 1 in 20 times at an alpha of 0.05. With all the bad practices that alter test power, such as p-hacking and the file drawer effect, that 1 in 20 blows up to 1 in 2.
A test size of 0.05 is only really 0.05 if the entire process you used was free from p-hacking. And as a reader of that paper that process includes the file drawer effect. All these things have an impact on distorting the test size away from what it is advertised to be.
In principle the robustness approach (redoing the analysis with some different parameters) tries to show good faith. This matters because observational studies have no meaningful replication, no second collection of data. In practice it just means practitioners gravitate towards methods more likely to show false positives.
Like, this discontinuity at zero approach seems geared to show large effects from noise. When the linear regression across the dataset is a flat line, splitting at the discontinuity pretty much mathematically requires _any_ slope have an opposite slope on the other side.
Gelman tries to steal the concept "Frequentism" from simple minded frequentist statisticians.
His argument seems to be:
Simple minded frequentists statisticians perform a statistical procedure once - they do not think about performing the procedure many times.
They fall into this trap (from Gelman's paper):
"3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data"