
An Intuitive Explanation of Bayes' Theorem - pizza
http://yudkowsky.net/rational/bayes?repost3yearslater
======
b0sk
The problem with many explanations is that people who write them have
forgotten how it is to not know Bayes Theorem. Here is my humble attempt as to
where people who want to grok Bayes theorem can start:

Imagine you have been locked inside a room. You are asked what the outside
weather is. Assume you have to pick one of sunny, cloudy and rainy. You know
it is July so you predict it is mostly sunny. (Because intuitively you know
that the odds of "sunny" is more than the other two). [1]

Now you see that someone has entered the room carrying an umbrella. You are
asked that question again. Of course you are going to update your answer based
on the new information. Your previous odds calculation, while good, has to
accommodate the umbrella factor -- a very solid piece of additional
information. [2]

Before the umbrella happened all you knew was [1] and it is called Prior.
After you saw the umbrella the odds are conditioned on the umbrella and it is
called Posterior [2]. which is what you want. And you should definitely use
your intuition about the weather too [1] which now goes to the right hand side
of the equation.

So

Posterior = ( ) x Prior

This blank is called as Likelihood... it is the probability of umbrella given
the weather.

Imagine you live in a place where the umbrella is used only during rain. So
the likelihood of an umbrella given it is sunny is very very less and is going
to make the product of Prior x Likelihood very less.

Whereas the likelihood of umbrella given it is rainy is almost certain. Even
if your initial guess (prior) gives less odds to rain.. the product of
likelihood x prior for rain is more than that of the previous paragraph
product. You answer "rainy"

Of course there is more to it and I suggest reading other great explanations
in this thread.

------
InclinedPlane
Still a pretty long winded explanation, though a good one.

I like to think of Bayes' Theorem as akin to scaling "universes". For example,
consider trying to determine the probability of someone over 6 feet tall being
a woman. First you take the probability of a woman being over 6 feet tall (say
1%) and then you multiply that by the probability of being a woman (say, 50%).
Now you have the probability of being a woman AND being over six feet tall
scaled to the overall size of the total "universe" (being a human being).
Then, if you divide this by the probability of being over 6 feet tall (say,
4%) you'll get the probability of being a woman AND being over six feet tall
scaled to the universe where everyone is over six feet tall (12.5%).

~~~
otoburb
Agreed. I didn't really get Bayes' Theorem intuitively until I saw the
pictures in this short blog post which emphasizes the idea you pointed out of
'scaling "universes"':

<http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/>

~~~
drostie
I actually got a really good question when I was using set diagrams to explain
ideas from logic.

So, I was explaining some logic ideas to a guy on IRC, and he was struggling
with the contrapositive rule, which says "if A implies B, then not-B implies
not-A." He asked me what this looked like as a set diagram.

I began to teach him how "A implies B" means that whenever you have A, you
know you have B. The picture is that B can be bigger, but the situations with
A are contained within it, so the picture looks like this: "A is a subset of
B." (In symbols, A → B becomes A ⊆ B.)

Now you need to understand that "not A" is the entire region outside the
circle of A, and "not B" is the entire region outside the circle of B. And
then you have to understand a crazy perspective: that the entire space outside
B is now a subset of the entire space outside A.

He was very confused about this, so I explained it this way: There is a story
about a physicist, an engineer, a mathematician, and a farmer. The farmer asks
all three for a fence containing the largest space for his sheep.

The engineer is first up, builds a square fence using one of the walls of the
barn to get a little extra space fenced in. The farmer seems pretty satisfied
with this, so they both go to meet the physicist, who has tethered a cord to a
peg and is drawing a large circle. "Circles," he points out, "minimize their
surface-to-area amount. Actually, I could probably do the same with your barn
there, get you a little more space by chopping a chord out of the circle." The
farmer says "no, this looks like it will take too long."

They both come over to see the mathematician, who has apparently gotten
tangled up in the fence! They start to work to get him out of there -- the
farmer asks, "what were you thinking, why did you bend the fence this way,
what is wrong with you?!" The mathematician says, "you don't understand --
this is the _outside_ of this fence!"

Suddenly, the idea of flipping "inside" and "outside" seemed to make sense,
and I was able to show him that yes, if you take this perspective, the
corresponding rule is Bᶜ ⊆ Aᶜ, thus not-B → not-A.

(Another strategy which often works is to leverage moral intuitions, but
'permissions' and 'implications' are opposite arrows. So A → B means "if I
know A, then I also know B." Apply this to "Santa only gives presents to good
children". Your intuitions all will work much better, save one: you probably
would write the above statement as "good child => get presents", following
what is permissible for good children, but in logic it actually states "got
present → good child", if Santa gave a present then you know that the child
was good, but Santa might not give a good child a present, especially if that
child is, say, Jewish.)

------
dmvaldman
more like a verbose explanation of Bayes' Theorem...

I'll give my succinct spin: Throw a coin in the air 10 times. Say 6 times the
coin lands heads up, and 4 times tails up.

A Bayesian would say: the most likely explanation is that the coin is biased
in favor of heads with Prob(heads) = .6 and Prob(tails) = .4. Because given
such a model, the chance of this outcome is 25% (10 choose 6 * .6^6 * .4^4).
Less likely is the explanation that the coin is fair.

A Frequentist would begin by assuming a model, likely that the coin is fair.
She would then argue that the probability of getting 6 heads in 10 tosses is
about 21% (10 choose 6 * 1/2^10), and she would have tools (number of STDs
from the mean, etc) to test how much of an outlier this observation was, in
order to validate her model.

In summary, the Bayesian considers many models at once, and asks how likely
are they given the observations. A Frequentist starts with one model, and asks
how likely is the observation.

The Frequentist uses the scientific method: develop a hypothesis, then test
it. The Bayesian does something else, which I think is better, but ill-
defined. She somehow tests all hypotheses at once, and puts different
confidences in each.

~~~
Strilanc
The example is not quite right. The Bayesian starts with a prior, some
weighting of likelihoods. In the case of a coin a reasonable prior would place
a lot of initial probability on the coin being fair or almost fair. You'd need
to flip a biased coin more times in order to overcome the prior and make
'biased' the dominant hypothesis.

~~~
dmvaldman
That is correct. I tried to allude to that in the end when I said how a
Bayesian mysteriously gives a certain confidence to the space of models. In
the case of a uniform prior I believe my example is correct.

Usually when Bayesian probability is explained, it often comes across as "duh,
that's what anyone would do" except not rigorously with equations. I tried to
give an example contrasting a Frequentist and Bayesian approach.

------
olalonde
Previous discussion on HN: <http://news.ycombinator.com/item?id=376631>

Some other interesting explanations:

<http://yudkowsky.net/rational/technical>

<http://wiki.lesswrong.com/wiki/Bayes%27_theorem>

<http://lesswrong.com/lw/2b0/bayes_theorem_illustrated_my_way>

<http://commonsenseatheism.com/?p=13156>

------
shriphani
The best explanation I found of the theorem was from the book by Bertsekas and
Tsitsiklis:

-> You have a resulting event B in front of you. Any of {A1,...,An} causes (all of them mutually disjoint) could have led to B. Now, you are aware of how likely each of these {A1,..,An} is of producing this B (this is P(B|A)). Bayes theorem allows you to use this information to deduce which {A1,...,An} is most likely to have been the cause given that B occurred (P(A|B)).

i.e. you use the Bayes theorem to reverse the conditional probability
relationships given in the problem. The exact expression is now easily
derivable using this idea and the total probability theorem.

------
bluekeybox
None of these explanations are intuitive. Here is a visual explanation in two
short paragraphs. Imagine a Venn diagram of two partly overlapping circles.
The probability space is your sheet of paper on which the diagram is drawn.
The area of the yellow circle with respect to the entire sheet is the
probability of event Y. Same measure of the cyan circle is the probability of
event C. The overlapping area is green (yellow and cyan pigments, when mixed,
result in green), and corresponds to both events occurring.

To simplify calculations, we don't scale circle areas by the area of the
entire sheet (which incidentally turns our probabilities into frequencies).
This way, P(C) and P(Y) are simply areas of the corresponding circles. Now,
P(Y|C) is the probability of event Y occurring, conditional on event C
occurring also. This is visualized as simply as taking the green area (because
both events have to happen) and dividing it by the area of the entire circle C
(because we are interested in conditional probability on C); the resulting
ratio is P(Y|C). Now it should be clear as sky that P(Y|C) * P(C) = green area
= P(C|Y) * P(Y), which is another way to write Bayes Theorem.

~~~
dmvaldman
This explains the equation, but it doesn't explain the "philosophy." Being
Bayesian is in direct conflict with the "scientific method." The scientific
method says: make a theory, then test it against observation. To a Bayesian,
first comes the observation, and they pick the theory that best explains it.

If T = theory, and O = observation

P(T|O) = P(O|T) * P(T) / P(O)

tells you: given my observation what is the probability of my theory, P(T|O).
And the equation gives you how this quantity depends on how your theory
predicts the observation, P(O|T).

~~~
david_ar
Not true. Maximum likelihood picks the hypothesis/theory that best "explains"
the data, hence you run into issues with overfitting, etc. A Bayesian starts
with a prior over a set of hypotheses (the P(T) term), and then uses the data
to update their confidence in the various hypotheses. Assuming you have a sane
prior, you end up with a simple theory with a record of making decent
predictions (i.e. the kind of thing you look for in science, cf. Occam's
razor, etc).

I don't see any conflict with the scientific method (at least in a fundamental
sense). Scientists aren't oracles - hypotheses need to come from somewhere.
Some call it intuition - I would call it an implicit application of Bayesian
reasoning, where the data is experience, and the prior is governed by genetic
constraints of the brain. From this you obtain a set of intuitive hypotheses
to be (further) tested (i.e. those with large posterior P(T|O)).

Testing a set of hypotheses then just involves collecting observations that
differentiate them (i.e. where they make conflicting predictions). This can
still be considered an application of Bayes' rule, but usually one tries to
collect enough data that it's quite obvious which is consistently making the
best predictions (in other words has posterior close to 1), in which case it
becomes a theory.

~~~
Dn_Ab
His equation is that for Universal Inductive Inference. It is correct. Fully
bayesian but incomputable.

~~~
david_ar
I never disagreed with the equation (of course it's correct). My point is that
the prior always comes first, even in UII. You're not simply picking the
hypothesis that best explains the data (assuming by best explains you mean has
the greatest likelihood P(O|T)), otherwise you just end up with the hypothesis
containing a lookup table of all previous data. You need to take into account
your confidence in the hypothesis before the data arrived (e.g. based on the
complexity/size of programs expressing that hypothesis for UII).

~~~
Dn_Ab
Ah. Your use of not true made it look like an outright dismissal of his whole
statement. As for the order of when to pick the prior, I think what is more
important is that the data not influence your choice of prior. If you were
some oracular machine you could see the data and generate hypothesis and
priors for them independent of the data and still not fall for the problem you
state.

And then there is the problem of how do you form sensible hypotheses without
at least knowing the shape of the data first. The form of these hypotheses are
themselves a restriction on the possible space. I think that is what the GGP
was getting at.

------
niels_olson
At the risk of being self-congratulatory, I submit an independently arrived-at
first conceptiopn of Bayesian thinking.

I arrived at a Bayesian solution one night to my own problem indepedent of
ever learning Bayes. My undergrad was in Physics, but I don't recall going
over Bayes before that night. The first time I heard about Bayes, that I
recall, was pg's bio, which had to be after Reddit was founded.

Anyway, Much like tonight, I awoke in the middle of the night realizing I
could calculate my odds of getting into medical school. The illustration is
the key:

<http://nielsolson.us/MedSchool/#Odds>

This author's barrel of eggs is way to hard for me to understand. A square
with a side of 1 much easier.

However, his method is how I calculated the answer for his problem: just think
about 10,000 women (cheap trick to avoid fractions until the very end)

------
shokwave
Explanations are great, but I've found I got a lot of use out of an intuitive
_method_ for applying Bayes' Theorem.

<http://news.ycombinator.com/item?id=4305144>

------
drucken
Bayes' Theorem trivially flows from,

Mnemonic: P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Visualise the above concept once as Venn Diagrams (aka. frequentist
interpretation) and you'll never forget it.

------
sprobertson
Excellent, I was looking for this exact article last night. Now where's the
intuitive explanation of synchronicity?

------
dbbolton
Google cache:
[http://webcache.googleusercontent.com/search?q=cache:a5o1K7r...](http://webcache.googleusercontent.com/search?q=cache:a5o1K7rF9kEJ:yudkowsky.net/rational/bayes/+&cd=1&hl=en&ct=clnk&gl=us)

(at least I think that's the right page)

------
santigepigon
Has anything been added to the article? Or is this simply a repost--albeit, a
welcomed one--as the URL's query string suggests? [1]

[1] <http://yudkowsky.net/rational/bayes?repost3yearslater>

------
juanre
Can't resist posting my own visual explanation:

<http://juanreyero.com/article/math/bayes.html>

------
lazugod
Are you reposting this because the website is down?

------
malkia
Can't open the page...

~~~
pizza
Cached:
[http://webcache.googleusercontent.com/search?q=cache:a5o1K7r...](http://webcache.googleusercontent.com/search?q=cache:a5o1K7rF9kEJ:yudkowsky.net/rational/bayes/+&cd=1&hl=en&ct=clnk&gl=us)

