Edit: this is a different take on the subject but an enjoyable and accessible read too:
What made it click for me was realizing that bayesian networks are a mini-language and the Bayes theorem is much more easily explained visually than with formulas. I think teachers should start with telling the correspondence between them and probability terminology.
Here's how I would explain it.
In a bayesian network nodes are events, arrows are probabilities.
When you traverse a path made of successive arrows you multiply the probabilities of the arrows you encounter along the path.
When there is more than one path to get from A to B and you want to know the probability of getting from the former to the latter, you sum the probabilities obtained from the various paths.
When you say "probability of A" it's like saying: sum of the paths that get to A.
When you say "probability of A and B" it's like saying: sum of the paths that include both A and B.
When you say "conditional probability of B given A" it's like saying: starting from A, sum of the paths that lead to B.
Let's do a simple application. This is a tree that doctors should find familiar and from which i understood it.
Usually doctors can estimate the values of the single arrows of this tree.
Let's say I told you: what's the conditional probability of having a positive test given the patient has the disease? Given what we said, you just put your pencil on D+ and follow the path to T+: just 1 arrow, no need to multiply (it's called the "sensitivity" of the test).
What's the probability of having a positive test randomly extracting a person from population? Since we don't start with a patient that has or not a disease, we put our pencil on root. There are 2 ways of getting to a T+: root-->D+-->T+ and root-->D- -->T+. As we said above, while following each of the paths we multiply the arrows we encounter and then we sum the result of the 2 paths.
And finally: what's the probability of our patient having the disease given that the test says "positive"?
We said we have 2 ways to get a positive test, but in only one of these ways our patient really has the disease, so we just divide the probability given by the only path that contain both D+ and T+ by the probability given by all paths that lead to T+. We are just saying that true positive are a fraction of all positives (seems obvious to me?). Numerator is the only "test is positive and it's true" path. Denominator is the sum of all "test is positive" paths.
Well, guess what we just did:
P(D+|T+) = ( P(T+|D+) P(D+) ) / P(T+)
(Additional intuition: another way to see it is that what we did corresponds to mapping the tree we started from to a flipped one in which the first bifurcation is T+/T- and the second one is D+/D-)
conditional probability is just like, what proportion does A represent given B has already happened.
A might be small in the ven diagram box, but take up a larger area when constrained to only the part that B is in
It doesn't matter how simple the math actually is, if someone is unfamiliar with mathematical notation it's going to be overwhelming to read.
This is a bit like complaining about the existence of books because you never learned how to read.
Bayes Theorem is easily derived algebraically using conditional probability and the chain rule. You can also derive it easily with a Venn diagram. There is barely any notation needed at all here to understand it.
If you're struggling with things at that level, it is more likely due to your own laziness, not because the math is hard. Because it is very easy to reason about.
The difference here, is that knowing a distribution is often more useful than just knowing the mean, or the variance, or some confidence intervals. Bayesian methods tend to be useful when you want this sort of information which is often the case when you are using it for decision making (or something like game theory).
Another uses case is when you are making decisions requiring multiple pieces of information that don't neatly fit together. A simple example is cancer screening. A rational decision about the proper threshold requires you to combine information about (1) The accuracy of your test, (2) The prevalence of the cancer in the population.
I will also add that the formula presented in the article is the simple case with discrete distributions. The more general version of the formula can also handle continuous distributions.
I guess my main point is that at least one reason people are using Bayesian methods is because they are dealing with problems that are qualitatively different than more prototypical prediction problems.
A standard normal prior centered at 0 and one centered at 42 can give very different results.
Is there a typo in there?