I appreciate the lack of math notation, for many with a poor mathematics backgrounds it feels like a huge wall into getting into interesting and useful theories.
Bayes theorem is very well suited to this. Frankly, it's one of those rare cases where those without much math might find it easier to read the original paper than many of the introductions...
When I finally got Bayes Theorem I thought it says something obivious in unfamiliar terms.
What made it click for me was realizing that bayesian networks are a mini-language and the Bayes theorem is much more easily explained visually than with formulas. I think teachers should start with telling the correspondence between them and probability terminology.
Here's how I would explain it.
----
In a bayesian network nodes are events, arrows are probabilities.
When you traverse a path made of successive arrows you multiply the probabilities of the arrows you encounter along the path.
When there is more than one path to get from A to B and you want to know the probability of getting from the former to the latter, you sum the probabilities obtained from the various paths.
When you say "probability of A" it's like saying: sum of the paths that get to A.
When you say "probability of A and B" it's like saying: sum of the paths that include both A and B.
When you say "conditional probability of B given A" it's like saying: starting from A, sum of the paths that lead to B.
----
Let's do a simple application. This is a tree that doctors should find familiar and from which i understood it.
/T+
D+/
/\
/ \T-
/
\
\ /T+
\/
D-\
\T-
Starting from root, at the first bifurcation we have: probability of having a disease or not. At the second bifurcation we have: probability that a diagnostic test tells either "positive" or "negative".
Usually doctors can estimate the values of the single arrows of this tree.
Let's say I told you: what's the conditional probability of having a positive test given the patient has the disease? Given what we said, you just put your pencil on D+ and follow the path to T+: just 1 arrow, no need to multiply (it's called the "sensitivity" of the test).
What's the probability of having a positive test randomly extracting a person from population? Since we don't start with a patient that has or not a disease, we put our pencil on root. There are 2 ways of getting to a T+: root-->D+-->T+ and root-->D- -->T+. As we said above, while following each of the paths we multiply the arrows we encounter and then we sum the result of the 2 paths.
And finally: what's the probability of our patient having the disease given that the test says "positive"?
We said we have 2 ways to get a positive test, but in only one of these ways our patient really has the disease, so we just divide the probability given by the only path that contain both D+ and T+ by the probability given by all paths that lead to T+. We are just saying that true positive are a fraction of all positives (seems obvious to me?). Numerator is the only "test is positive and it's true" path. Denominator is the sum of all "test is positive" paths.
Well, guess what we just did:
P(D+|T+) = ( P(T+|D+) P(D+) ) / P(T+)
(Additional intuition: another way to see it is that what we did corresponds to mapping the tree we started from to a flipped one in which the first bifurcation is T+/T- and the second one is D+/D-)
Oh yeah, and the first actually usable form of Bayesian Theorem would be probabilistic graphical models with max-sum algorithm. Good luck mastering that quickly or at all!
That is far from the first usable form of Bayes. I have no idea what point you are making.
Bayes Theorem is easily derived algebraically using conditional probability and the chain rule. You can also derive it easily with a Venn diagram. There is barely any notation needed at all here to understand it.
If you're struggling with things at that level, it is more likely due to your own laziness, not because the math is hard. Because it is very easy to reason about.
Always nice to see a little piece of python code show up whenever one of these ML blogs get posted to HN. It's becoming, or has become, the lingua franca of this field.
I'm very excited for Tensorflow for Swift, personally. The type system will make development so much more pleasant IMO. Judging by the way things have been lately though, I imagine that by the time TF for Swift is ready I'll be reaching for PyTorch at every opportunity anyways...
I think the fact that Swift has first-class automatic differentiation is an even bigger deal than the fact that it is strongly-typed. Here is an interesting write-up about it...
Naive Bayes is a good baseline since it's both very fast and quite efficient, and it doesn't need a big training set. But it's not often the best model you can find, and it only works if your classes can be linearly separable, i.e it can't model an xor.
change my mind: bayes in practice is just a way to regularize your model and the language of bayes makes it seem principled but really you could use literally any regularizer and it would work almost just as well. i believe this because ultimately you're always going to minimize loglikelihood anyway (and so the prior becomes the regularization term).
Well you could draw samples of the parameters with a Bayesian setup via MCMC or get a distribution over them via a variational approximation, rather than getting some sort of maximum likelihood (MAP whatever) value for the parameters of the model via solving an optimization problem. This seems much more general (and practically useful). So I think it is the other way around regularizers are just priors (that you arrived at somehow).
A bottling company is interested in determining the accuracy with which their equipment is filling bottles of water. One answer would be "95% percent of the bottles contain between 11.9 and 12.1 ounces". A different way of answering the question would be to estimate the actual distribution of water amounts.
The difference here, is that knowing a distribution is often more useful than just knowing the mean, or the variance, or some confidence intervals. Bayesian methods tend to be useful when you want this sort of information which is often the case when you are using it for decision making (or something like game theory).
Another uses case is when you are making decisions requiring multiple pieces of information that don't neatly fit together. A simple example is cancer screening. A rational decision about the proper threshold requires you to combine information about (1) The accuracy of your test, (2) The prevalence of the cancer in the population.
I will also add that the formula presented in the article is the simple case with discrete distributions. The more general version of the formula can also handle continuous distributions.
lol is this copypasta? i'm quite familiar with all of these toy examples of inference instead of point estimation. i'm talking about fitting models rather than descriptive statistics (or decision theory).
Commonly a model is being used primary to make better decisions. Specifically in the context of fitting models, Bayesian methods are really popular for hyperparameter tuning.
I guess my main point is that at least one reason people are using Bayesian methods is because they are dealing with problems that are qualitatively different than more prototypical prediction problems.
Well, in that case minimizing the (negative) loglikelihood seems principled but you could minimize literally any loss function and it would work almost just as well.
Edit: this is a different take on the subject but an enjoyable and accessible read too: http://mbmlbook.com/toc.html