
Stanford Lecture Notes on Probabilistic Graphical Models - volodia
https://ermongroup.github.io/cs228-notes/
======
georgeek
An amazing text on this topic is Martin Wainwright/Michael Jordan's Graphical
Models, Exponential Families, and Variational Inference:
[https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_F...](https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf)

~~~
signa11
wow ! thank you :) this should be required reading/prereq, ideally, before
taking the PGM course offered via coursera.

~~~
georgeek
This is more of a graduate-level text though. It's as close as you can
currently get to Michael Jordan's forthcoming book on graphical models and ML.

~~~
chasely
Is this the book that's been in development for the past decade with bits and
pieces available online?

~~~
georgeek
That's the one :)

------
refrigerator
For anyone interested, here are the materials for the Graphical Models course
at Oxford:
[http://www.stats.ox.ac.uk/~evans/gms/index.htm](http://www.stats.ox.ac.uk/~evans/gms/index.htm)

------
philipov
OpenCourseOnline, different Stanford professor:
[https://www.youtube.com/watch?v=WPSQfOkb1M8&list=PL50E6E80E8...](https://www.youtube.com/watch?v=WPSQfOkb1M8&list=PL50E6E80E8525B59C&index=1)

Carnegie-Mellon:
[https://www.youtube.com/watch?v=lcVJ_zsynMc&list=PLI3nIOD-p5...](https://www.youtube.com/watch?v=lcVJ_zsynMc&list=PLI3nIOD-p5aoXrOzTd1P6CcLavu9rNtC-)

------
beambot
For a practical application of using graphical models to "solve" a bayesian
problem, I recommend Frank Dellaert's whitepaper, which cover Simultaneous
Localization and Mapping (SLAM, a robotics algorithm) using similar
techniques:
[https://research.cc.gatech.edu/borg/sites/edu.borg/files/dow...](https://research.cc.gatech.edu/borg/sites/edu.borg/files/downloads/gtsam.pdf)

~~~
rsp1984
I'm not an expert in Probabilistic Graphical Models but I do know factor
graphs well. I've also read quite a bit of the recent SLAM and VO stuff that
came out of Dellaert's group.

Here's the thing: Dellaert loves to write about SLAM from a viewpoint of
probabilities and densities. I think he likes to see himself as a
mathematician. However at the end of the day he's working with Gaussian noise
assumptions, just like everyone else, and the solution is obtained by some
form of Least Squares, again, like everyone else.

I would really like it if he just cut all the probabilistic thicket and went
straight to the core of how his methods improve the SOTA just in terms of how
the friggin Least Squares problem is set up and solved (e.g. like here [1]).
But I guess that would probably take away most of the "magic".

[1]
[http://grail.cs.washington.edu/projects/mcba/pba.pdf](http://grail.cs.washington.edu/projects/mcba/pba.pdf)

~~~
Xcelerate
> However at the end of the day he's working with Gaussian noise assumptions,
> just like everyone else, and the solution is obtained by some form of Least
> Squares, again, like everyone else.

Well you need _some_ kind of model to optimize, and finding the perfect one is
a _vastly_ more difficult problem that probably begins to incorporate some
stuff from information theory (which starts to become rather a different
field).

I don't actually see the issue with researchers working on algorithms to solve
the Gaussian noise versions of these problems — they can usually be extended
to some other model relatively easily (for instance, rotation synchronization
via Riemannian manifolds can incorporate a Pseudo-Huber loss function without
much difficulty).

Also, what is SOTA?

~~~
gadjo95
> Also, what is SOTA? State Of The Art

------
dirtyaura
For a novice, it's hard to assess how important PGMs are currently.

Investing my time to learn Deep Learning (CNNs, RNNs) vs. Random Forest vs.
PGMs vs. Reinforcement Learning well enough to be able to apply the chosen
approach, it seems that PGMs are not high in the list, is that correct?

Are there Kaggle competitions, in which PGMs have been the best approach?

What are the real-world problem areas that PGM's currently excel at compared
to other methods?

~~~
eachro
Random Forests is not really at all in the same class of depth(no pun
intended) as the other models you've mentioned; if you understand decision
trees, random forests are just a way of combining independently trained
decision trees that is less prone to overfitting.

I'd probably rank these areas in order of importance as follows: deep
learning, pgms, reinforcement learning. Deep learning as a framework is pretty
general. PGMs, as I have seen them, don't really have any one killer domain
area - maybe robotics and areas where you want to explicitly model causality?
Applications for reinforcement learning seem the more niche, but maybe that's
because they haven't been adequately explored to the extent that
DL/CNNs/RNNs/PGMs have been.

~~~
digitalzombie
Tree based algorithm have it's place. I dunno what your definition of depth is
but it all depends on your data.

I'm doing a thesis on tree based algorithm and it works great for medical
data.

Granted I have little exposer to NN, you can't do that with NN when clinical
trials data is small as hell.

It's all base on the type data and your resource and criteria.

NN is really hype and people tend to over look other algorithms but NN is not
a silver bullet. I don't get how you rank those importance either, it could be
bias.

~~~
eachro
Oh there's no question that tree based methods are effective - random
forests/gradient boosted trees routinely win kaggle competitions. But I was
more referring to how random forests is learnable in a day, whereas deep
learning would probably take at least a few weeks to learn properly.

------
likelynew
We have a professor in our college, who uses this thing with great passion. I
took two related courses under him. The problem with this thing is he uses his
own mental image and notation in the lectures and examination. Even the
internet seems to be highly affected with this problem. I think there are many
concepts that are not hard but it takes time to get a feeling. Like see
"monads are not burritos" essay. It describes the problem of using analogies
in explaining monads. This seems like a great page in the sense it uses
minimal confusing analogies as is common in most resources in bayesian rule.

------
graycat
In their "Probability review" at

[http://ermongroup.github.io/cs228-notes/preliminaries/probab...](http://ermongroup.github.io/cs228-notes/preliminaries/probabilityreview/)

I see two problems:

(1) First Problem -- Sample Space

Their definition of a _sample space_ is

"The set of all the outcomes of a random experiment. Here, each outcome ω can
be thought of as a complete description of the state of the real world at the
end of the experiment."

The "complete description" part is not needed and even if included has meaning
that is not clear.

Instead, each possible _experiment_ is one _trial_ and one element in the set
of all trials Ω. That's it: Ω is just a set of trials, and each trial is just
an element of that set. There is nothing there about the outcomes of the
trials.

Next the text has

"The sample space is Ω = {1, 2, 3, 4, 5, 6}."

That won't work: Too soon will find that need an uncountably infinite sample
space. Indeed an early exercise is that the set of all events cannot be
countably infinite.

Indeed, a big question was, can there be a sample space big enough to discuss
random variables as desired? The answer is yes and is given in the famous
Kolomogorov extension theorem.

(2) Second Problem -- Notation

An _event_ A is an element of the set of all events F and a subset of the
sample space Ω.

Then a _probability measure_ P or just a _probability_ is a function P: F -->
[0,1] that is, the closed interval [0,1].

So, we can write the probability of event A by P(A). Fine.

Or, given events A and B, we can consider the event C = A U B and, thus, write
P(C) = P(A U B). Fine.

But the notes have P(1,2,3,4), and that is undefined in the notes and, really,
in the rest of probability. Why? Because

1,2,3,4,

is not an event.

For the set of real numbers R, a real _random variable_ X: Ω --> R (that is
_measurable_ with respect to the sigma algebra F and a specified sigma algebra
in R, usually the Borel sets, the smallest sigma algebra containing the open
sets, or the Lebesgue measurable sets).

Then an event would be X in {1,2,3,4} subset of R or the set of all ω in Ω so
that X(ω) in {1,2,3,4} or

{ω| X(ω) in {1,2,3,4} }

or the inverse image of {1,2,3,4} under X -- could write this all more clearly
if had all of D. Knuth's TeX.

in which case we could write

P(X in {1,2,3,4})

When the elementary notation is bad, a bit tough to take the more advanced
parts seriously.

A polished, elegant treatment of these basics is early in

Jacques Neveu, _Mathematical Foundations of the Calculus of Probability_ ,
Holden-Day, San Francisco, 1965.

Neveu was a student of M. Loeve at Berkeley, and can also see Loeve,
_Probability Theory_ , I and II, Springer-Verlag. A fellow student of Neveu at
Berkeley under Loeve was L. Breiman, so can also see Breiman, _Probability_ ,
SIAM.

These notes are from Stanford. But there have long been people at Stanford,
e.g., K. Chung, who have these basics in very clear, solid, and polished
terms, e.g.,

Kai Lai Chung, _A Course in Probability Theory, Second Edition_ , ISBN
0-12-174650-X, Academic Press, New York, 1974.

K. L. Chung and R. J. Williams, _Introduction to Stochastic Integration,
Second Edition_ , ISBN 0-8176-3386-3, Birkhaüser, Boston, 1990.

Kai Lai Chung, _Lectures from Markov Processes to Brownian Motion_ , ISBN
0-387-90618-5, Springer-Verlag, New York, 1982.

~~~
ilzmastr
If the notes only discuss discrete valued RVs there is no problem. This looks
like the case since from greping for "continuous" there are few results:
[https://github.com/ermongroup/cs228-notes/search?l=Markdown&...](https://github.com/ermongroup/cs228-notes/search?l=Markdown&q=continuous&type=&utf8=)

~~~
graycat
Hmm, discrete only? So, can't take averages, state the law of large numbers,
the central limit theorem, convergence, completeness, do linear regression,
etc. Can't multiply a random variable by a real number. Hmm ....

Will be in trouble when considering a sequence of, say, independent random
variables, say, as in coin flipping.

Will have trouble with P(A|B) and E[Y|X].

And the notation is still wrong.

~~~
ilzmastr
On my planet all of those concepts work whether your RV takes values on a
countable set (what a discrete RV means) or a continuum.

~~~
graycat
Right. And if don't want to get involved in what, really, are the measure
theory foundations of probability, then fine. Nearly all of statistics has
been done this way.

But if do try to give the measure theory foundations, as the OP did, then at
least don't make a mess out of it.

If don't want to get the measure theory right, then, sure, just leave out the
foundations and start with events and random variables. The measure theory
foundations are so powerful, so robust, so general that in practice it's tough
to get into trouble. A place to get into trouble: In stochastic processes,
take the union of uncountably infinitely many points, call that an event, and
ask for its probability. Okay, then, don't do that.

------
mrcactu5
course like these make me wonder ... how much one can dress up basic
probability. i think the answer is A LOT

~~~
adamnemecek
If you can derive all this just from probability my hat is off to you.

