A Gentle Introduction to Bayes’ Theorem for Machine Learning (machinelearningmastery.com)
305 points by headalgorithm on Oct 3, 2019 | 35 comments

This is a far gentler introduction, and the rest of the blog is pretty good too: https://www.countbayesie.com/blog/2016/5/1/a-guide-to-bayesi...

Edit: this is a different take on the subject but an enjoyable and accessible read too: http://mbmlbook.com/toc.html

Thank you for the second link. Apart from great content, it is really well presented for web.

Thank you for this.

The second link looks very nice. Thanks.

I appreciate the lack of math notation, for many with a poor mathematics backgrounds it feels like a huge wall into getting into interesting and useful theories.

I agree, I m working on a google translate idea for math. I think notations could be more readable!

Bayes theorem is very well suited to this. Frankly, it's one of those rare cases where those without much math might find it easier to read the original paper than many of the introductions...

When I finally got Bayes Theorem I thought it says something obivious in unfamiliar terms.

What made it click for me was realizing that bayesian networks are a mini-language and the Bayes theorem is much more easily explained visually than with formulas. I think teachers should start with telling the correspondence between them and probability terminology.

Here's how I would explain it.


In a bayesian network nodes are events, arrows are probabilities.

When you traverse a path made of successive arrows you multiply the probabilities of the arrows you encounter along the path.

When there is more than one path to get from A to B and you want to know the probability of getting from the former to the latter, you sum the probabilities obtained from the various paths.

When you say "probability of A" it's like saying: sum of the paths that get to A.

When you say "probability of A and B" it's like saying: sum of the paths that include both A and B.

When you say "conditional probability of B given A" it's like saying: starting from A, sum of the paths that lead to B.


Let's do a simple application. This is a tree that doctors should find familiar and from which i understood it.

  /  \T-
  \  /T+
Starting from root, at the first bifurcation we have: probability of having a disease or not. At the second bifurcation we have: probability that a diagnostic test tells either "positive" or "negative".

Usually doctors can estimate the values of the single arrows of this tree.

Let's say I told you: what's the conditional probability of having a positive test given the patient has the disease? Given what we said, you just put your pencil on D+ and follow the path to T+: just 1 arrow, no need to multiply (it's called the "sensitivity" of the test).

What's the probability of having a positive test randomly extracting a person from population? Since we don't start with a patient that has or not a disease, we put our pencil on root. There are 2 ways of getting to a T+: root-->D+-->T+ and root-->D- -->T+. As we said above, while following each of the paths we multiply the arrows we encounter and then we sum the result of the 2 paths.

And finally: what's the probability of our patient having the disease given that the test says "positive"? We said we have 2 ways to get a positive test, but in only one of these ways our patient really has the disease, so we just divide the probability given by the only path that contain both D+ and T+ by the probability given by all paths that lead to T+. We are just saying that true positive are a fraction of all positives (seems obvious to me?). Numerator is the only "test is positive and it's true" path. Denominator is the sum of all "test is positive" paths.

Well, guess what we just did:

P(D+|T+) = ( P(T+|D+) P(D+) ) / P(T+)

(Additional intuition: another way to see it is that what we did corresponds to mapping the tree we started from to a flipped one in which the first bifurcation is T+/T- and the second one is D+/D-)

I think a better way to describe it is ven diagrams:

conditional probability is just like, what proportion does A represent given B has already happened.

A might be small in the ven diagram box, but take up a larger area when constrained to only the part that B is in

Bayes Theorem hardly requires any math notation at all. It would literally take you less than a minute to understand conditional probability.


Possibly true, but just looking at the Wikipedia page for Bayes Theorem, more than half the text on the page is math notation: https://en.wikipedia.org/wiki/Bayes%27_theorem

It doesn't matter how simple the math actually is, if someone is unfamiliar with mathematical notation it's going to be overwhelming to read.

Then learn the notation or find another source.

This is a bit like complaining about the existence of books because you never learned how to read.

I guess some people interested in machine learning doesn’t know about multiplication and division, but i wouldn’t want to depend on their models...

Oh yeah, and the first actually usable form of Bayesian Theorem would be probabilistic graphical models with max-sum algorithm. Good luck mastering that quickly or at all!

That is far from the first usable form of Bayes. I have no idea what point you are making.

Bayes Theorem is easily derived algebraically using conditional probability and the chain rule. You can also derive it easily with a Venn diagram. There is barely any notation needed at all here to understand it.

If you're struggling with things at that level, it is more likely due to your own laziness, not because the math is hard. Because it is very easy to reason about.

Chain rule???

Always nice to see a little piece of python code show up whenever one of these ML blogs get posted to HN. It's becoming, or has become, the lingua franca of this field.

I'm very excited for Tensorflow for Swift, personally. The type system will make development so much more pleasant IMO. Judging by the way things have been lately though, I imagine that by the time TF for Swift is ready I'll be reaching for PyTorch at every opportunity anyways...

I think the fact that Swift has first-class automatic differentiation is an even bigger deal than the fact that it is strongly-typed. Here is an interesting write-up about it...


I've used Bayes for classification before, is it still the go to for that?

Naive Bayes is a good baseline since it's both very fast and quite efficient, and it doesn't need a big training set. But it's not often the best model you can find, and it only works if your classes can be linearly separable, i.e it can't model an xor.

Shameless plug but if anybody wants to look at Naive Bayes classifiers for continuous and categorical dat, I wrote a library here:


I remember the first time I heard of Bayes Theorem , it took me a while to grasp it.

His alternativ formulation of P(A|B) is really confusing me.

Is there a typo in there?

change my mind: bayes in practice is just a way to regularize your model and the language of bayes makes it seem principled but really you could use literally any regularizer and it would work almost just as well. i believe this because ultimately you're always going to minimize loglikelihood anyway (and so the prior becomes the regularization term).

Counterpoint: regularization is just a way of specifying a Bayesian prior for maximum a posteriori estimation.

but what value is that perspective? how do i use this to actually fit a model?

Well you could draw samples of the parameters with a Bayesian setup via MCMC or get a distribution over them via a variational approximation, rather than getting some sort of maximum likelihood (MAP whatever) value for the parameters of the model via solving an optimization problem. This seems much more general (and practically useful). So I think it is the other way around regularizers are just priors (that you arrived at somehow).

A bottling company is interested in determining the accuracy with which their equipment is filling bottles of water. One answer would be "95% percent of the bottles contain between 11.9 and 12.1 ounces". A different way of answering the question would be to estimate the actual distribution of water amounts.

The difference here, is that knowing a distribution is often more useful than just knowing the mean, or the variance, or some confidence intervals. Bayesian methods tend to be useful when you want this sort of information which is often the case when you are using it for decision making (or something like game theory).

Another uses case is when you are making decisions requiring multiple pieces of information that don't neatly fit together. A simple example is cancer screening. A rational decision about the proper threshold requires you to combine information about (1) The accuracy of your test, (2) The prevalence of the cancer in the population.

I will also add that the formula presented in the article is the simple case with discrete distributions. The more general version of the formula can also handle continuous distributions.

lol is this copypasta? i'm quite familiar with all of these toy examples of inference instead of point estimation. i'm talking about fitting models rather than descriptive statistics (or decision theory).

Commonly a model is being used primary to make better decisions. Specifically in the context of fitting models, Bayesian methods are really popular for hyperparameter tuning.

I guess my main point is that at least one reason people are using Bayesian methods is because they are dealing with problems that are qualitatively different than more prototypical prediction problems.

You cannot use any prior, let alone literally any regularizer, and say it would work almost just as well.

A standard normal prior centered at 0 and one centered at 42 can give very different results.

i said almost - that's code for "obviously i'm not talking about pathological regularizers"

Well, in that case minimizing the (negative) loglikelihood seems principled but you could minimize literally any loss function and it would work almost just as well.

Lol agreed!

