Seriously, the main point of an experiment is to gather evidence. Coupled with prior beliefs, you get a posterior belief, but the most important point is how much evidence the experiment provides.
Sure, a full fledged posterior belief is needed to make an actual decision, like, what should we test next. And if a subject is deemed important enough that we need to be certain, we can replicate until we get enough evidence to trump any reasonable prior belief. (Mind publication bias, though, some replications are going to fail, and that's relevant evidence too.)
In the mean time, it would be nice if the papers just told "the experiment provides 20dB of evidence that A is wrong, and B is right", instead of saying "B is right (at p<0.01)". No you're not certain B is right just yet. Your evidence is significant, perhaps even decisive, but it is not certain. A one in a hundred fluke is not unheard of. Also, sharing likelihood ratios (instead of posterior beliefs) makes the whole debate a bit less heated.
Getting a double one on dice you just threw for the first time doesn't mean they are loaded to make you lose. It only provides about 15 decibels of evidence in favour of such a con job.
It's a conversation between a scientist, a bayesian, and a confused undergrad: https://arbital.com/p/likelihoods_not_pvalues/?l=4xx (warning, site loads very slow).
The Frequentist believes that probabilities merely represent the long term frequency counts of events (for a given 'population').
When I learned Bayes' theorem in college stats class, there was no mention of beliefs. It was just a straightforward theorem related to conditional probability.
The frequentists do not "believe," they measure.
However it's problematic because you can measure a million coin flips and get heads every time. It's not possible to actually measure an infinite number of trials - you need to imagine it.
In terms of frequentism, how you define what the population distribution you are sampling from is exactly, is a choice.
You are limiting the discussion to what happens after you've chosen how and what the population distribution consists of.. that part is itself a nontrivial and subjective process.
Prior distribution is a beautiful and logical mechanism for adding regularization, domain-specific knowledge to our model.
 Stan, a platform for statistical modeling http://mc-stan.org/
 Automatic Differentiation Variational Inference
 Auto-Encoding Variational Bayes
It is very hard to validate a given choice of a prior in many applications. E.g., if I claim one prior, and another investigator claims a sharper one, it can be very difficult to decide who is right.
If the prior does not wash out due to lots of data, this indicates a serious and fundamental problem.
Both prior and likelihood are our model's assumptions. So, the prior validation problem is similar to the likelihood validation problem. To check a Bayesian model or any model, we need to bring the model out of the formal world, to the real world for validation.
Prior predictive simulation method, which generates random data points from the prior, is a good heuristic to check if the prior is NOT plausible.
But priors can be much harder.
Say I’m trying to estimate a wind speed from the blade velocity of a windmill. I can bring a more accurate wind speed sensor to calibrate the windmill against the wind speed, perhaps aided by basic physics. This is the likelihood portion.
But what should the prior be? The typical speed at that time of day? The speed in January? The speed on cloudy days? I have to have a crisp number — a full distribution actually, accurate out to the tails. I really have very little grounding for choosing that distribution.
I started out just wanting to relate the wind speed to some data in a rather concrete way, and now I’ve been roped in to choosing a crisp distribution for a rather amorphous state of nature.
This is a deep problem.
We can sharpen the problem. Say my number and yours are different. How do we tell who is right?
One can try a different tack: I’m being stubborn. The prior will mostly wash out in any well-posed problem, or else why try to solve it? But now we’re back to frequentism, just looking at the likelihood.
HN tends to invoke the Bayesian framework as a complete solution to inference — I’m just trying to demonstrate that there are problems with that approach.
What would you do when the sensor returns negative wind speeds due to noise or errors?
The wind speed cannot be negative, or greater than speed of light. An expert in windmill can narrow down the prior distribution much more.
> But priors can be much harder.
Choosing a prior is hard because it requires thinking explicitly about the problem and its assumptions.
It merely exposes our lack of expertise on the problem.
When you're lazy, you can pick a uniform prior Uniform(0, c) and call it a day.
> We can sharpen the problem. Say my number and yours are different. How do we tell who is right?
Forget about prior, say, we have 2 sensors which output two slightly different wind speeds. Which wind speeds is right?
The lower one or the average speed.
This is a deep philosophical problem. However, it's a problem for any model.
> The prior will mostly wash out in any well-posed problem.
I don't think so. Any well-posed problem should include the prior, or else how can we tell: 2 data points is not enough?
> HN tends to invoke the Bayesian framework as a complete solution to inference [...]
Bayesian framework is indeed a complete solution to inference in a formal/logical sense.
However, I agree that there are many problems in applying Bayesian framework to real world problems that requires
serious thinking about our assumptions on the problem.
Bradley Efron, in TFA, begs to disagree:
"I wish I could report that this resolves the 250-year controversy and that it is now safe to always employ Bayes’ theorem. Sorry. My own practice is to use Bayesian analysis in the presence of genuine prior information; to use empirical Bayes methods in the parallel cases situation; and otherwise to be cautious when invoking uninformative priors. In the last case, Bayesian calculations cannot be uncritically accepted and should be checked by other methods, which usually means frequentistically."
In the windmill example, the AI can quickly collect all it has in its memory about blade speeds, and maybe spend a self-imposed X min computational time to make a best guess for the prior speed distribution.
Humans can't do this, so we have gone down a philosophical rabbit hole of figuring out this "prior problem", when the real problem is that we are just messy informal thinkers.
> How do we tell who is right?
You are fundamentally conceptually mistaken here. There is nothing right or wrong with two agents disagreeing on the prior. The different prior reflects the before experiment knowledge of the two agents. I am a windmill engineer, so my priors will be much more narrow than yours, who has never seen a windmill outside of a hollywood movie.
Much havoc has befallen the scientific world because of the hidden assumptions of frequentist techniques with poorly understood preconditions, even for rather basic models. And there isn't much anyone can do about that save move to ever more complicated models.
Is it a graduate level topic or is there an intuitive course that teaches it to beginners?
- approximating the posterior using a nice parametric distribution, then
- minimizing some error (typically KL Divergence) between your approximate posterior and the true posterior
A lot of recent work focuses on the Wasserstein divergence  as an alternative. One advantage of Wasserstein over KL is that the Wasserstein metric provides better fit over the whole distribution instead of localizing on some specific regions, thereby preventing "mode collapse". This makes it a popular metric for training Generative Adversarial Networks (GANs).
For recent work on applying Wasserstein distance to variational inference, see: https://arxiv.org/abs/1805.11284
(I haven't watched https://www.youtube.com/watch?v=ogdv_6dbvVQ but it seems like a longer version of the same talk)
Here's a more recent advance https://arxiv.org/pdf/1711.09268.pdf
'vanilla' HMC uses detailed balance to guarantee that the stationary distribution of the chain is the one you want, causing the process to behave like a random walk. So although the Hamiltonian bit of HMC lets you take these great big steps through state space, you end up retracing your steps quite a lot.
Hence NUTS (No-U-turn sampler) et. al
The different people can then go on and do lots of experiments, collect lots of data and update their priors to posteriors. And the guarantee is that as long as each person's prior was not a mathematically weird function, after enough evidence has been collected all these people will have the same posterior function i.e. they will agree .
 The famous Aumann's agreement theorem https://en.wikipedia.org/wiki/Aumann%27s_agreement_theorem is a related result that you might like to read about.
Coming to consensus on priors is the same process for arriving at consensus that all scientific inquiry must engage in. Anyone who says frequentist methods somehow more accurately represent an underlying reality are pulling a fast one.
Bayesian research is just more honest about what's already the case.
That's precisely my point. The act of presenting and refining research IS the act of building that consensus.
My statement here is not a novel thought. It's pretty much the modern philosophy of science for over a decade.
> So we need to new research to gather additional evidence.
This is simply data gathering though. Every approach starts here. I'm not sure why you suggest that people using Bayesian approaches to analysis are somehow forbidden from being informed by data (or informing priors by data).
That's exactly the same process folks use when selecting non-bayesian models. They don't spring from absolute truth, they're selected as well.
> If every research paper tried to argue for a particular prior and posterior
Given the replication crisis that's in part due to mis-application of existing models along with a lack of rigor in data collection, having research focus more tightly on the methodology for presenting data and conclusions doesn't seem like a bad outcome at all.
Bayesian inference is great when you have to make a decision and there are many theorems that illustrate this (for example, the arguments around coherence  and the complete class theorems ). In fact, Bayesian techniques are often useful for creating estimators with great frequentist properties! However, Bayesian interpretations of probability, and thereby the meaning of Bayesian statements, are inherently tied to the beliefs of an individual. That means that Bayesian statements usually aren't "true" in the objective / non-relative sense that we often expect from science. On the other hand, frequentist statements tend to have more of an objective flavor. The trick is: all our mathematical models have short comings and ways in which they're wrong when applied to any particular situation -- so neither really has a claim to being true.
The frequentist perspective often looks at worst case risk and tends to give a more global understanding of a procedure in terms of "how does this procedure shake out in all reasonably possible scenarios?". So, frequentist methods tend to be a bit more risk-averse which is often useful but can cost you for being to pessimistic. Ultimately, the real win is to know your tools well and to pick the right one for the job.
I began to research alternative approaches to modeling and conducting inference a few years ago. Discovering Bayesian Inference has had a large impact on the way I think and conduct research. There's a lot of hype and uncertainty about what "Bayesian" actually means. Here's a compact definition that I hope will attract some interest:
Bayesian Inference allows you to explicitly quantify your prior beliefs and get a more complete picture of uncertainty when modeling something.
If you'd like to learn more, the links below should be helpful.
Introduction to Bayes' Theorem (short):
Bayesian A/B testing example (short):
If you're interested in spending some time learning about applied Bayesian Inference, I highly recommend Statistical Rethinking. The book doesn't assume a strong mathematical background and its filled with practical examples.
McElreath is currently working on a second edition of that textbook, due around 2020:
P(Hypothesis|Data) = P(Hypothesis) * evidence_factor
P(Hypothesis) is the prior probability of the Hypothesis being true, in other words the probability we gave to the Hypothesis before seeing any of the data we are using in the theorem. When new data is observed, we use Bayes' theorem to update our believe in the hypothesis, which in practice means multiplying our prior probability by a number that depends on how well the new data fits our hypothesis. More precisely:
evidence_factor = P(Data|Hypothesis)/P(Data)
So it is the ratio of how likely our data is if our hypothesis is true, compared to (divided by) how likely it is in general. If it is more likely to occur in our Hypothesis, our probability of it being true increases, if it is more likely in general (and thus also more likely in case our hypothesis is not true, you can prove mathematically that those two statements are the same), then our believe in the hypothesis decreases.
TLDR: Prob(Hypothesis after I have seen new data) = Prob(Hypothesis before I saw the new data) * (how likely I am to see the data if my hypothesis is true, compared to in general)
P(A|B) = P(B|A)P(A)/P(B)
P(prior|data) = P(data|prior)P(prior)/P(data)
P(prior|data) proportional to P(data|prior)P(prior)
This is probably not what you were looking for but this is it.
"Extraordinary claims require extraordinary evidence"
Where a (naive) frequentist might assume, for instance, that after a 90% accurate test comes back positive the hypothesis is likely to be true, a Bayesianist would ask how likely it was to be true in the first place; all the test did was make it ten times more likely, which may or may not make it probable.
You may enjoy https://www.lesswrong.com/posts/XTXWPQSEgoMkAupKt/an-intuiti...
So a prior's probability increases to the degree that it predicts an observation better than alternative priors.
In fact, the open ended stop when you win method is the equivalent of running an enormous trial and then re-anlyzing the result at each point and publishing the most favorable point as the result.
The only remedy against publication bias based cheating is to publish everything, including the failures. That will take care of the "wait until I get a 1 in 20 fluke" just so you can get past the p<0.05 threshold.
Is that right? At each next trial Bayesians should feed the probability from the previous one as prior. Assuming that the first two trials did not bring the required results - then the prior to the third one should be rather small.
Of course, this all comes crashing down if the first two experiments happen to provide contrary evidence (that is, evidence the drug does not work). This would cancel out the results of the final trial somewhat, and not taking this into account is clearly cheating by publication bias.
Bayesian isn't used much in the industry compare to frequentist approach. Likelihoodist is even rarer. I've learned a bit on Bayesian but end up refocusing on time series and survival analysis within frequentist domain. There is waaaay more job posting and people that you work under that are frequentist or more comfortable doing it the old way.
Unfortunately it predates many of the modern developments in methods / computation, but if you want to dive deep, I strongly recommend it. It takes the perspective of designing a reasoning robot to make the most effective decisions.
Another resource, eg Stan's manual, can get you up and running on computation/inference. Your choice in computation tool should reflect the type and size of of problems you're interested in, and languages you're comfortable with. Stan has bindings for many scripting languages. R also offers Nimble, Python PyMC3 and Edward, and Julia has DynamicHMC and Turing. (EDIT: xcodevn has better Python recommendations: https://news.ycombinator.com/item?id=18213923 )
What we expect to find at the end is, I get about the same number of heads and tails in the whole meta-experiment. About 95% of the runs will have more heads than tails, but each of those runs will only have one extra head. The few runs where I did all 1000 flips will be ones where heads never had a majority, so they'll probably have lots of extra tails. Same number of heads and tails over all is the relevant result, 95% of runs had majority heads is bullshit intended to baffle you.
Nobody would be fooled by such nonsense, right?
bayesRule :: (Prob a) -> (a -> Prob b) -> b -> Prob a
bayesRule prior likelihood data = do
h <- prior
d <- likelihood h
guard $ d == data
Looks like it was written similarly here http://www.randomhacks.net/files/build-your-own-probability-...
requires an exact match of data, doesn't seem right
Tons of write ups and YouTube videos out there on it but here is one example of an explanation:
You can certainly create a different truth table that arrives at the correct answer, but the truth table approach does not help ensure you get to the right answer like the Bayesian approach does.
Check out Scenario 2 here https://medium.com/@ProfessorF/visualizing-the-solution-to-t... for a correct tree.
I mentioned a full truth table/decision tree because a long time ago these gave me the insight why switching is the right solution in the standard formulation of the problem, and they also illustrate why the problem is a purely deductive/logical problem whose solution does not require any inductive inference.
Then, to me it was a valuable lesson to learn that the Monty Hall problem does not reveal any perceived or real fundamental problem of probability theory.
Depending on how you model Monty Hall's prior probability of revealing the prize, seeing a non-prize door can result in the probability anywhere between 0 and 2/3 of switching being advantageous.
I consider several kind of Monties there, including an "enemy" that will try to open the prize door if he can.
With frequentist the trick is always in choosing a distribution. You can't update it according to a rule, but there is no reason why you simply can't pick a different populatuon distribution to operate under
To assign a (improper) uniform prior to the variance of a Gaussian distribution is to assign a non-uniform prior to its standard deviation, and vice versa. One can, in certain circumstances, assign priors to be non-inforamative in a particular way, but to be universally non-informative, no, it must be Jeffreys' or nothing at all.
In consideration of the aforementioned, the debate about non-informative Bayesian priors is a relic of 20th century philosophy. The construction of hiearchical causality networks for the purposes of unsupervised learning is the future of Bayesian statistics, and priors in this context are rarely non-informative.
On the other hand, these priors can be difficult to create in some (many?) situations and it's often more tractable to do ML.
Bayesian inference seems more principled to me in general if you allow for and use reference priors, but outside of that I think there are still reasons to prefer ML. There's two areas where I still have problems with priors.
The first is that the sequential testing paradigm (that is, prior -> posterior -> prior) doesn't always work in reality because you often have multiple experimenters operating simultaneously and independently with different priors. In one sense this is a trivial problem but in another sense it is not. E.g., if you are a meta-analyst faced with integrating such results, is prior variation akin to publication bias? What implications does that have?
The second is that there are situations in which using a prior actually might lead to unfair inequities. For example, let's say you're trying to make some inference about an individual, and know that ethnicity provides information in a statistical sense about the parameter you are making an inference about. Is it prejudicial or not to use a prior? I think using a reference prior would address this situation, but depending on the scenario you could make an argument that it is unfair (e.g., if the informative prior would suggest a positive outcome, not using it might be seen as prejudicial, but if the informative prior would suggest a negative outcome, using it might be seen as unfair). In this case, not using a prior at all actually might make sense--you might make a similar argument about non-Bayesian inference as Bayesian reference inference, but using non-prior-based inference does sidestep the issue in a sense, in that there is no longer a prior to decide about. This might be especially important in that, e.g., if you have a series of individuals, the act of choosing a prior might be seen as prejudicial in itself.
I generally consider myself as an "objective Bayesian" in the Jaynesian / reference prior sense, but there are practical and theoretical scenarios where I think people are likely to run into problems.