
Bayes’ Theorem in the 21st Century (2013) [pdf] - mikevm
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/Science-2013-Efron.pdf
======
loup-vaillant
[http://www.overcomingbias.com/2009/02/share-likelihood-
ratio...](http://www.overcomingbias.com/2009/02/share-likelihood-ratios-not-
posterior-beliefs.html)

Seriously, the main point of an experiment is to _gather evidence_. Coupled
with prior beliefs, you get a posterior belief, but the most important point
is _how much evidence_ the experiment provides.

Sure, a full fledged posterior belief is needed to make an actual decision,
like, what should we test next. And if a subject is deemed important enough
that we need to be certain, we can replicate until we get enough evidence to
trump any reasonable prior belief. (Mind publication bias, though, some
replications _are_ going to fail, and that's relevant evidence too.)

In the mean time, it would be nice if the papers just told "the experiment
provides 20dB of evidence that A is wrong, and B is right", instead of saying
"B is right (at p<0.01)". No you're not certain B is right just yet. Your
evidence is significant, perhaps even decisive, but it is _not_ certain. A one
in a hundred fluke is not unheard of. Also, sharing likelihood ratios (instead
of posterior beliefs) makes the whole debate a bit less heated.

Getting a double one on dice you just threw for the first time doesn't mean
they are loaded to make you lose. It only provides about 15 decibels of
evidence in favour of such a con job.

~~~
platz
The Bayesian believes that probability represents our beliefs about the world.

The Frequentist believes that probabilities merely represent the long term
frequency counts of events (for a given 'population').

~~~
Koshkin
> _The Frequentist believes_

The frequentists do not "believe," they _measure_.

~~~
akvadrako
No. They believe you can measure an infinite number of trials (say # of heads
vs tails) and whatever ratio you get is the probability of heads.

However it's problematic because you can measure a million coin flips and get
heads every time. It's not possible to actually measure an infinite number of
trials - you need to imagine it.

~~~
Koshkin
This is just silly, nobody in their right mind would believe they could do
something “an infinite number” of times.

~~~
akvadrako
If you don’t do an infinite number of trials then you can’t be sure your
frequencies match the real probability.

------
xcodevn
I truly believe that Bayesian inference is the statistics of the 21st century.
Recent advances in MCMC (e.g., NUTS, Stan [1]) and variational inference
(e.g., ADVI [2], VAE [3], etc.) + more computing power than ever promise a
near future in which Bayesian inference is the default inference engine.

Prior distribution is a beautiful and logical mechanism for adding
regularization, domain-specific knowledge to our model.

[1] Stan, a platform for statistical modeling [http://mc-stan.org/](http://mc-
stan.org/)

[2] Automatic Differentiation Variational Inference
[https://arxiv.org/abs/1603.00788](https://arxiv.org/abs/1603.00788)

[3] Auto-Encoding Variational Bayes
[https://arxiv.org/abs/1312.6114](https://arxiv.org/abs/1312.6114)

~~~
platz
I feel like variational inference has never been described very well to an
intro audience even having statistical basics.

Is it a graduate level topic or is there an intuitive course that teaches it
to beginners?

~~~
klipt
"variational inference" is perhaps an uninformative name. You can just think
of it as

\- approximating the posterior using a nice parametric distribution, then

\- minimizing some error (typically KL Divergence) between your approximate
posterior and the true posterior

~~~
ced
Do you know _why_ KL divergence is minimized? I get that it gives a lower
bound on the marginal likelihood, which is cool, but is that it? What are the
alternatives?

~~~
ssivark
KL divergence is motivated nicely from an information/coding theory viewpoint.
It's very closely related to Shannon-von Neumann entropy [1], and KL(P||Q)
characterizes the efficiency of a code designed for a model distribution P,
when applied to reality which is actually represented by Q.

A lot of recent work focuses on the Wasserstein divergence [1] as an
alternative. One advantage of Wasserstein over KL is that the Wasserstein
metric provides better fit over the whole distribution instead of localizing
on some specific regions, thereby preventing "mode collapse". This makes it a
popular metric for training Generative Adversarial Networks (GANs).

For recent work on applying Wasserstein distance to variational inference,
see: [https://arxiv.org/abs/1805.11284](https://arxiv.org/abs/1805.11284)

[1]:
[https://physics.stackexchange.com/questions/64574/definition...](https://physics.stackexchange.com/questions/64574/definition-
of-the-entropy/64597#64597) [2]:
[https://en.wikipedia.org/wiki/Wasserstein_metric](https://en.wikipedia.org/wiki/Wasserstein_metric)

------
nickhuh
I think this article omits the most important distinction between Bayesian and
Frequentist statistics: subjective vs. frequentist interpretations of
probability. In my own opinion, neither is "true", they're both just different
tools for different purposes.

Bayesian inference is great when you have to make a decision and there are
many theorems that illustrate this (for example, the arguments around
coherence [1] and the complete class theorems [2]). In fact, Bayesian
techniques are often useful for creating estimators with great frequentist
properties! However, Bayesian interpretations of probability, and thereby the
meaning of Bayesian statements, are inherently tied to the beliefs of an
individual. That means that Bayesian statements usually aren't "true" in the
objective / non-relative sense that we often expect from science. On the other
hand, frequentist statements tend to have more of an objective flavor. The
trick is: all our mathematical models have short comings and ways in which
they're wrong when applied to any particular situation -- so neither really
has a claim to being true.

The frequentist perspective often looks at worst case risk and tends to give a
more global understanding of a procedure in terms of "how does this procedure
shake out in all reasonably possible scenarios?". So, frequentist methods tend
to be a bit more risk-averse which is often useful but can cost you for being
to pessimistic. Ultimately, the real win is to know your tools well and to
pick the right one for the job.

[1]
[https://en.wikipedia.org/wiki/Coherence_(philosophical_gambl...](https://en.wikipedia.org/wiki/Coherence_\(philosophical_gambling_strategy\))

[2]
[https://projecteuclid.org/euclid.aoms/1177730345](https://projecteuclid.org/euclid.aoms/1177730345)

------
dna_polymerase
For those who are new to Bayesian Statistics and not to eager to dive into the
maths right away I recommend Think Bayes [0]. It gives a nice introduction for
those who know programming (Python). The ebook is available for free (see
link).

[0]: [https://greenteapress.com/wp/think-
bayes/](https://greenteapress.com/wp/think-bayes/)

------
laplacesdemon48
I remember the frequentist approach taught in introductory stats classes never
making sense to me. I didn't want to shove "stats" into the back of my brain
and just focus on graduating. I genuinely wanted to understand the world a
little better.

I began to research alternative approaches to modeling and conducting
inference a few years ago. Discovering Bayesian Inference has had a large
impact on the way I think and conduct research. There's a lot of hype and
uncertainty about what "Bayesian" actually means. Here's a compact definition
that I hope will attract some interest:

Bayesian Inference allows you to explicitly quantify your prior beliefs and
get a more complete picture of uncertainty when modeling something.

If you'd like to learn more, the links below should be helpful.

Introduction to Bayes' Theorem (short):
[https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-
wi...](https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego)

Bayesian A/B testing example (short):
[https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-
test...](https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-testing)

If you're interested in spending some time learning about applied Bayesian
Inference, I highly recommend Statistical Rethinking. The book doesn't assume
a strong mathematical background and its filled with practical examples.
[https://xcelab.net/rm/statistical-
rethinking/](https://xcelab.net/rm/statistical-rethinking/)

McElreath is currently working on a second edition of that textbook, due
around 2020: [http://elevanth.org/blog/2018/07/14/statistical-
rethinking-e...](http://elevanth.org/blog/2018/07/14/statistical-rethinking-
edition-2-eta-2020/)

------
qwerty456127
BTW can anybody share a link to a really simple explanation of the Bayes'
Theorem? I've once seen one, it was a size of a twit and would let you
understand it in a matter of seconds, all the "super-duper intuitive
explanations" around are too huge and complex actually.

~~~
fela
I like to split the theorem in the following way:

P(Hypothesis|Data) = P(Hypothesis) * evidence_factor

P(Hypothesis) is the prior probability of the Hypothesis being true, in other
words the probability we gave to the Hypothesis before seeing any of the data
we are using in the theorem. When new data is observed, we use Bayes' theorem
to update our believe in the hypothesis, which in practice means multiplying
our prior probability by a number that depends on how well the new data fits
our hypothesis. More precisely:

evidence_factor = P(Data|Hypothesis)/P(Data)

So it is the ratio of how likely our data is if our hypothesis is true,
compared to (divided by) how likely it is in general. If it is more likely to
occur in our Hypothesis, our probability of it being true increases, if it is
more likely in general (and thus also more likely in case our hypothesis is
_not_ true, you can prove mathematically that those two statements are the
same), then our believe in the hypothesis decreases.

 _TLDR_ : Prob(Hypothesis after I have seen new data) = Prob(Hypothesis before
I saw the new data) * (how likely I am to see the data if my hypothesis is
true, compared to in general)

------
gweinberg
"A Bayseian FDA regulator would b more forgiving". He absolutely would not. It
should be obvious to anyone who gives the matter any thought that it must be
more probable that you'll conclude drug A is better than drug B at the 5%
level at some point over the course of an open ended experiment than that that
will be the case at the end of a trial with a specific number of runs.

In fact, the open ended stop when you win method is the equivalent of running
an enormous trial and then re-anlyzing the result at each point and publishing
the most favorable point as the result.

~~~
loup-vaillant
The "open ended stops when you win" method, has to take into account contrary
evidence into account. If you make 10 experiments, and 3 of those say your new
and improve medication doesn't work, you have to take those into accounts, or
else you're just cheating.

The only remedy against publication bias based cheating is to publish
_everything_ , including the failures. That will take care of the "wait until
I get a 1 in 20 fluke" just so you can get past the p<0.05 threshold.

------
zby
""" The Bayesian-frequentist argument, unlike most philosophical disputes, has
immediate practical consequences. Consider that after a 7-year trial on human
subjects, a research team announces that drug A has proved bet- ter than drug
B at the 0.05 signifi cance level. Asked why the trial took so long, the team
leader replies “That was the first time the results reached the 0.05 level.”
Food and Drug Administration (FDA) regulators reject the team’s submission, on
the frequentist grounds that interim tests of the data, by taking repeated
0.05 chances, could raise the false alarm rate to (say) 15% from the claimed
5%. A Bayesian FDA regulator would be more forgiving. Starting from a given
prior distri- bution, the Bayesian posterior probability of drug A’s
superiority depends only on its fi nal evaluation, not whether there might
have been earlier decisions. """

Is that right? At each next trial Bayesians should feed the probability from
the previous one as prior. Assuming that the first two trials did not bring
the required results - then the prior to the third one should be rather small.

~~~
gweinberg
It's wrong. Not because the stated probability at the point you stop the
experiment is wrong, but because the (stupid) rule is that we'll approve the
drug if the probability that the drug is better than the alternative is above
some arbitrary threshold value. If you run experiments a fixed number of
trials, you'll get a variety of conclusions with different strengths. If you
stop as soon as you are in the "barely passing" zone, you'll get a lower
number of failing result, a higher number of barely passing results, and none
at all that do better than barely passing.

------
digitalzombie
Written by the creator of the bootstrap method in statistic.

Bayesian isn't used much in the industry compare to frequentist approach.
Likelihoodist is even rarer. I've learned a bit on Bayesian but end up
refocusing on time series and survival analysis within frequentist domain.
There is waaaay more job posting and people that you work under that are
frequentist or more comfortable doing it the old way.

------
adamnemecek
Bayesian theorem made more sense once I started seeing the conditional (if
statement) in conditional probability.

~~~
cultus
Yep. Baye's theorem is actually a generalization of contrapositivity (if A
implies B, then "not B" implies "not A") to stochastic settings. It's not
usually taught in an intuitive way.

~~~
adamnemecek
Yeah, I had to figure some of this out on my own. Do you have a good
book/resource on this?

~~~
celrod
Probability Theory the Logic of Science is an excellent textbook on math and
theory. Note the title even references cultus' point.

Unfortunately it predates many of the modern developments in methods /
computation, but if you want to dive deep, I strongly recommend it. It takes
the perspective of designing a reasoning robot to make the most effective
decisions.

Another resource, eg Stan's manual, can get you up and running on
computation/inference. Your choice in computation tool should reflect the type
and size of of problems you're interested in, and languages you're comfortable
with. Stan has bindings for many scripting languages. R also offers Nimble,
Python PyMC3 and Edward, and Julia has DynamicHMC and Turing. (EDIT: xcodevn
has better Python recommendations:
[https://news.ycombinator.com/item?id=18213923](https://news.ycombinator.com/item?id=18213923)
)

------
gweinberg
I apologize for repeating myself, I don't think I'm being clear on what I'm
trying to say. Let me give an illustrative example: Let's say I've got an 100%
fair coin, but I want to trick you into thinking it comes up heads more often
than tails. The way I do this is wit a meta-experiment: I will have 1000
trials with up to 1000 flips each, but I stop the experiment whenever I have a
majority of heads.

What we expect to find at the end is, I get about the same number of heads and
tails in the whole meta-experiment. About 95% of the runs will have more heads
than tails, but each of those runs will only have one extra head. The few runs
where I did all 1000 flips will be ones where heads never had a majority, so
they'll probably have lots of extra tails. Same number of heads and tails over
all is the relevant result, 95% of runs had majority heads is bullshit
intended to baffle you. Nobody would be fooled by such nonsense, right?

------
formalsystem
Does anyone have any examples of informative priors that they used to solve
some problem at work?

~~~
xcodevn
See the discovery paper of gravitational waves [1] which uses informative
priors from physical evidences/constraints.

[1]:
[https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.11...](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.119.161101)

------
laretluval
Suppose you have a probability monad, implementing enumeration or random
sampling. Like Amb but with probabilities attached.

    
    
       bayesRule :: (Prob a) -> (a -> Prob b) -> b -> Prob a
        bayesRule prior likelihood data = do
       h <- prior
       d <- likelihood h
       guard $ d == data
       return h
    

I don’t actually do Haskell...just thinking out loud.

Looks like it was written similarly here
[http://www.randomhacks.net/files/build-your-own-
probability-...](http://www.randomhacks.net/files/build-your-own-probability-
monads.pdf)

~~~
platz
d == data

requires an exact match of data, doesn't seem right

------
zhoge
Interesting article (thanks for sharing). Indeed, a genuine prior is the key
to apply Bayes' rule. However, is there a proof that it always exist? (As
Feynman would question the law of physics changing in time.)

------
byt143
[https://scholar.google.com/scholar?cluster=26077808520337143...](https://scholar.google.com/scholar?cluster=260778085203371430&hl=en&as_sdt=0,21&sciodt=0,21#d=gs_qabs&p=&u=%23p%3DpuXAbUl4ngMJ)

------
platz
"The prior can generally only be understood in the context of the likelihood"
[https://arxiv.org/abs/1708.07487](https://arxiv.org/abs/1708.07487)

------
Vaslo
One of the best examples of the idea of Bayes is the Monte Hall problem. It is
a good example of how prior probability (3 unopened doors) can lead to a more
clear posterior-host selects an unopened door with a bad prize and you are
asked whether to stay with the unopened door you chose or switch to the
remaining open door. Turns out via Bayes it’s better to switch doors because
you have more information now.

Tons of write ups and YouTube videos out there on it but here is one example
of an explanation:

[http://angrystatistician.blogspot.com/2012/06/bayes-
solution...](http://angrystatistician.blogspot.com/2012/06/bayes-solution-to-
monty-hall.html?m=1)

~~~
13415
It's better to switch doors according to _any_ statistical method, whether
you're a frequentist or Bayesian does not matter. You can also show that you
should switch doors by making an exhaustive truth table, by writing a computer
program, or by experimentation, if you prefer these kind of approaches.

~~~
daveFNbuck
A truth table doesn't necessarily get you to the right answer. There are 3
doors I could pick, 3 doors Monty could pick, and 3 doors the prize could be
behind. If I make a truth table of all 27 possible combinations, there are 12
combinations where Monty doesn't choose the same door as the contestant or the
prize. Of these 12 options, exactly 6 have the contestant choosing the right
door and 6 have the contestant choosing the wrong door.

You can certainly create a different truth table that arrives at the correct
answer, but the truth table approach does not help ensure you get to the right
answer like the Bayesian approach does.

~~~
13415
No method on earth necessarily gives you the right answer to a problem. You
first have to represent the problem correctly, of course.

Check out Scenario 2 here [https://medium.com/@ProfessorF/visualizing-the-
solution-to-t...](https://medium.com/@ProfessorF/visualizing-the-solution-to-
the-monty-hall-problem-bf4a08c8ed8f) for a correct tree.

~~~
daveFNbuck
Neither method guarantees the right answer, but it's a lot easier to get it
wrong with a truth table.

~~~
13415
Maybe you're right, I'm not fully convinced.

I mentioned a full truth table/decision tree because a long time ago these
gave me the insight _why_ switching is the right solution in the standard
formulation of the problem, and they also illustrate why the problem is a
purely deductive/logical problem whose solution does not require any inductive
inference.

Then, to me it was a valuable lesson to learn that the Monty Hall problem does
not reveal any perceived or real fundamental problem of probability theory.

------
delver
I founded my consulting based on bayesian network. If anyone needs deeper
explanation or advice, feel free to reach out. manmit@dextroanalytics.com

------
mannermachine
The only sensible "non-informative" prior is Jeffreys' prior. Invariance under
reparameterization of the parameter is what I would consider to be a non-
negotiable feature of any non-informative prior belief.

To assign a (improper) uniform prior to the variance of a Gaussian
distribution is to assign a non-uniform prior to its standard deviation, and
vice versa. One can, in certain circumstances, assign priors to be non-
inforamative _in a particular way_ , but to be universally non-informative,
no, it must be Jeffreys' or nothing at all.

In consideration of the aforementioned, the debate about non-informative
Bayesian priors is a relic of 20th century philosophy. The construction of
hiearchical causality networks for the purposes of unsupervised learning is
the future of Bayesian statistics, and priors in this context are rarely non-
informative.

~~~
lellotope
I agree with a tiny caveat, in that I'd change Jeffreys prior to reference
prior.

On the other hand, these priors can be difficult to create in some (many?)
situations and it's often more tractable to do ML.

Bayesian inference seems more principled to me in general if you allow for and
use reference priors, but outside of that I think there are still reasons to
prefer ML. There's two areas where I still have problems with priors.

The first is that the sequential testing paradigm (that is, prior -> posterior
-> prior) doesn't always work in reality because you often have multiple
experimenters operating simultaneously and independently with different
priors. In one sense this is a trivial problem but in another sense it is not.
E.g., if you are a meta-analyst faced with integrating such results, is prior
variation akin to publication bias? What implications does that have?

The second is that there are situations in which using a prior actually might
lead to unfair inequities. For example, let's say you're trying to make some
inference about an individual, and know that ethnicity provides information in
a statistical sense about the parameter you are making an inference about. Is
it prejudicial or not to use a prior? I think using a reference prior would
address this situation, but depending on the scenario you could make an
argument that it is unfair (e.g., if the informative prior would suggest a
positive outcome, not using it might be seen as prejudicial, but if the
informative prior would suggest a negative outcome, using it might be seen as
unfair). In this case, not using a prior at all actually might make sense--you
might make a similar argument about non-Bayesian inference as Bayesian
reference inference, but using non-prior-based inference does sidestep the
issue in a sense, in that there is no longer a prior to decide about. This
might be especially important in that, e.g., if you have a series of
individuals, the act of choosing a prior might be seen as prejudicial in
itself.

I generally consider myself as an "objective Bayesian" in the Jaynesian /
reference prior sense, but there are practical and theoretical scenarios where
I think people are likely to run into problems.

~~~
improbable22
Jeffreys / reference priors also have some weird behaviour in high dimensions.
You may enjoy this attempt to do better, without giving up reparameterization
invariance:

[https://arxiv.org/abs/1705.01166](https://arxiv.org/abs/1705.01166)

------
Koshkin
But it’s worth noting that the most important branch of the physical science
today, quantum theory, is manifestly non-Bayesian.

~~~
kovrik
[https://en.wikipedia.org/wiki/Quantum_Bayesianism](https://en.wikipedia.org/wiki/Quantum_Bayesianism)

~~~
Koshkin
QBism is an _idealist_ interpretation of QM. (Which is to say, non-
scientific.)

------
platz
An interesting alyernative to frequentist and Bayesian is that of the
'likelihoodist'

[http://gandenberger.org/wp-
content/uploads/2014/07/Statistic...](http://gandenberger.org/wp-
content/uploads/2014/07/Statistical_Methods.png)

[http://gandenberger.org/2014/07/21/intro-to-statistical-
meth...](http://gandenberger.org/2014/07/21/intro-to-statistical-methods/)

