
Understand the Math Behind it All: Bayesian Statistics - romil
http://blogs.adobe.com/digitalmarketing/personalization/conversion-optimization/understand-the-math-behind-it-all-bayesian-statistics/
======
antics
There has been a lot of rich discussion about the relative merits of the
Bayesian and frequentist perspectives on statistical inference. If you are
thinking about applying inference techniques to some problem, then it is well
worth your time to make sure that you really, really understand this debate,
because picking the correct tool for your job is likely to make your life a
_lot_ easier (and, yes, there are both situations where the Bayesian
perspective is _clearly_ better and where it is _clearly_ worse).

Unfortunately this post completely and totally ignores the things that will
enable you to make this decision. This post is called "Understand the Math
Behind it All", but you will learn nothing about math, or really, Bayesian
statistics, at all. You will not learn, for example, how to apply Bayesian
inference to problems, what it means to do basic Bayesian tasks like "use an
expert" or "condition on evidence", or even what a Bayesian statistic is. In
fact, Bayes' theorem is never even mentioned. There's just a hand-wavy
collection of statements like "The Bayesian approach is to rely on past
knowl­edge and then adjust accordingly". That is so vague that it is not even
clear that they are talking about Bayesian analysis. This is the sort of
statement that fools people into believing they understand something that they
really don't.

If you really want to understand this material, you should watch[1], a talk by
Mike Jordan called "Are you a Bayesian or a Frequentist?". It's a bit much for
beginners, but if you are willing to look up some of the math, it is entirely
digestible, and it is _by far_ the best comparison of the two communities I
have found. I say "by far" because it is (1) a more or less complete
representation of both communities, (2) it is pretty much an unbiased
representation of both communities' strengths and weaknesses, and (3) it is as
direct as it can get, meaning that it is not tied up in a lot of external
knowledge, and is intent on delivering this message, rather than delivering it
in an off-hand way as a method of getting to something else.

[1] <http://videolectures.net/mlss09uk_jordan_bfway/>

~~~
ced
When are Bayesian methods "clearly worse" than frequentist methods, apart from
computationally?

~~~
antics
My Bayesian theory is a bit rusty, but here we go.

Say we have data X, and some non-finite dimensional index into the family of
functions that describe the the data, called \theta.

The Bayesian perspective classically holds \theta constant and optimizes the
expected loss, _conditioned on the data X_. The frequentist perspective, on
the other hand, classically optimizes \theta, that is, it picks the best
\theta over the data X, _unconditionally_.

This has two impacts. First, all things equal frequentist statistics will tend
to be more stable, and more calibrated, but less coherent. It is commonly said
that frequentist statistics will "isolate" one from poor decision making, and
all things equal, that will be true.

Specific, clear wins for frequentists are bootstrapping procedures (e.g.,
Efron's bootstrap, the b of n bootstrap, Jordan's own scalable "bag of
bootstraps" from NIPS 2011), which are methods for building what are called
"quantifiers" for "estimators". In short, this means that if you have some
estimator (e.g., a classifier, or a mean, or whatever), you want to be able to
quantify the certainty of your estimator -- so if you've only seen 5 examples,
you want to express that you're less certain about this. This is clearly a
frequentist application, not a Bayesian application, and in general, it points
to the fact that pure frequentist tools not only have a place in inference,
but they fill a niche that Bayesian tools necessarily will not, and in some
cases, cannot, fill.

~~~
equark
It sounds like you're just saying that if you want to know the frequentist
properties of an estimator you have to be frequentist. That's a tautology.

The harder question is whether there are any decisions you'd prefer to make
using a non-Bayesian procedure. That's basically a tautology in the other
direction though.

~~~
antics
As I said, my Bayesian theory is rusty, but there are no "frequentist
properties" of an estimator. Frequentist inference is inference -- it doesn't
make guarantees about the underlying thing it's approximating, it provides
guarantees about its approximation.

The key here is that Bayesian and frequentist procedures provide different
sorts of guarantees. Frequentists optimize for \theta, the possible set of
things that could describe all of the data X, while Bayesians will assume a
single describing function \theta (this might come from an "expert") and
simply optimize the expectation conditioned on the data. Neither is "wrong"
but in the case of the bootstrap, the result is calibrated in a way that
Bayesian inference simply never will be (if it were, it would be frequentist).

EDIT: As per your second question, actually I think it's not more interesting.
A classifier is a type of estimator, so all of the general frequentist
guarantees actually still apply to decisionmaking.

~~~
equark
Frequentist statistics is about determining the repeated sampling properties
of a procedure/statistic/estimator. It's about evaluation not estimation.
"Optimizing \theta" or whatever you're envisioning is just one possible
procedure you might be interested in evaluating. You can use the repeated
sampling properties of your procedure to do frequentist inference or to
evaluate other properties like unbiasedness, consistency, risk, etc. Typically
the goal is to find procedures that have "good" frequentist (repeated
sampling) properties. Most Bayesian-inclined statisticians would tend to argue
that many frequentist properties are not important to applied data analysis or
optimal decision making.

------
tmoertel
If you actually want to understand the "math behind it all," do yourself a
favor and read the first three chapters of _Probability Theory: The Logic of
Science_ by E. T. Jaynes. Jaynes builds, from the ground up, probability
theory as an extended logic that allows you to draw inferences from incomplete
and uncertain information. In this logic, a probability represents a degree of
belief, and Bayes' theorem becomes a rule for updating prior beliefs in light
of new evidence. From this basis, Jaynes recreates the classical probability
theorems (e.g., sum and product rules) , giving them clear interpretations.

If you want to understand this stuff for real, it's hard to beat Jaynes.
(Plus, he uses a robot mind (!) as a recurring expository device. That alone
is worth the price of admission.)

[1] <http://bayes.wustl.edu/etj/prob/book.pdf>

------
tokenadult
I like the online article "An Intuitive Explanation of Bayes' Theorem" by HN
participant Eliezer S. Yudkowsky

<http://yudkowsky.net/rational/bayes>

as an understandable overview of why a Bayesian perspective on statistics is
important. His technical explanation article

<http://yudkowsky.net/rational/technical>

is a follow-up to that article.

------
lobo_tuerto
I really liked and easily understood this explanation of Bayes:

<http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/>

------
ot

      Whereas a frequentist model looks at an absolute basis for
      chances, something like the population of females is 52%, so that
      means that if I select someone at random from my office, I have a
      52% chance of picking a female. The chances are purely based on
      the total probability. The Bayesian approach is to rely on past
      knowledge and then adjust accordingly. If I know that 75% of my
      office is male, and I grab a person, then I know that I have a
      25% chance of picking a female.
    

This is a terrible example, it makes it look like frequentist statistics don't
know about conditional probabilities.

My understanding is that frequentist statistics are all about _point
estimates_ (such as maximum likelihood) or sometimes _confidence intervals_.
Say you want to run an elections poll, sample a bunch of people and ask who
they have voted for.

Frequentists will average the data and say "Party A is at 51%" (point
estimate) or "Party A is between 49% and 52%" (confidence interval).
Asymptotically and under certain assumptions this value will converge to the
"real" value and often you can also estimate the speed of convergence (with
variance bounds).

Bayesians will instead start with a "prior" which is a probability
distribution p(x) on "Party A is at x%". You can start from an uniform, non-
informative prior, or if you have some information you can factor it into the
prior. Then you take your polls data D and compute the conditional probability
p(x | D), called "posterior", which is the probability that "Party A is at x%
_given the data_ ". So you don't get a number or an uniform interval, you get
a probability _for each possible electoral outcome_. Again, if you have
infinite data this will converge to a distribution where all the mass is on a
single point, which usually is the same given by frequentist statistics.

The problem with Bayesian statistics is that you have to handle probability
distributions instead of single numbers (as with point estimates), so the
inference gets extremely harder. It has been made somewhat possible only
recently thanks to both algorithmic and hardware advances. On the other hand,
the main advantage is that you get to know how uncertain your estimate is,
which can make a huge difference when you have little data.

Last note: another thing you can do when you have a prior is factor it into
your estimator and take the maximum likelihood of the posterior as a point
estimate. This is called Maximum a Posteriori (MAP) and called by some people
"Bayesian", but I don't think Bayesians agree with that.

~~~
equark
I don't think this is a worthwhile distinction, even if it's historically
accurate. Both Bayesian and Frequentists focus on point estimation and
distributions. EAP and MAP are just as Bayesian and the full posterior
distribution. And the sampling distribution is just as important to
Frequentists as the posterior distribution is to a Bayesian.

The key difference is whether inference is based on the sampling distribution
or the posterior distribution.

~~~
ot
Very interesting, can you expand? How do frequentists use the sampling
distribution other than the classical MLE etc?

~~~
equark
The sampling distribution is not the likelihood. It's the fundamental basis of
all frequentist inference. Amazingly, even though this is typically taught in
introductory statistics virtually no students actually digest its importance.
You literally cannot understand frequentist statistics without understanding
the idea of a sampling distribution.

The sampling distribution is the distribution of your statistic (MLE estimate,
mean, EAP, MAP, or whatever you want) under repeated sampling from the
population distribution. Frequentism is an evaluation procedure, which can be
applied to any estimator whether it be Bayesian or something like MLE.
Frequentists are interested in whether this distribution has "good"
properties. Supposedly good properties include things like unbiasedness,
consistency, minimum variance, etc. Inference is typically expressed as a
function of this distribution (confidence intervals) or by comparing the
sampling distribution under some restriction (the null hypothesis) to the
actual value of the statistic in the observed sample.

Given that you can't typically sample from the population distribution, the
practical question becomes how do you approximate the sampling distribution.
Typically this is done by appealing to a central limit theorem. Bootstrapping
provides another intuitive approximation.

There are all sorts of problem with this approach to statistics despite its
success.

------
is74
The problems with statistics is that it's complicated, both bayesian and
frequentist. Specifically, all statistical methods make assumptions about the
data, some of which are quite subtle and take effort to understand. Their
intricacy is the reason why so many scientists use them incorrectly. It's much
less about whether a method is bayesian or frequentist, but whether the
specific assumptions made by a method are suitable for the data. This requires
a judgement call. One of the advantages of Bayesian methods over frequentist
methods is that it's easier to incorporate what we know about the data into a
bayesian model using the bayesian prior straightforwardly, but only in
principle, because in practice doing a good job is pretty tricky.

------
rockmeamedee
Wait, I was looking to Understand the Math Behind it All, instead of a light
overview of the definitions. Does anybody have a more in-depth reference?

~~~
tmoertel
See also: <http://news.ycombinator.com/item?id=4031451>

------
chris_wot
This is a little frustrating. It gives me an intro into Bayesian Statistics,
but not much hard math. Not a formula in sight!

------
pm90
A few days ago I found this gem of a book that I think many here would find
interesting (if you don't already know it) : [http://www.amazon.com/All-
Statistics-Statistical-Inference-S...](http://www.amazon.com/All-Statistics-
Statistical-Inference-Springer/dp/0387402721)

------
Dn_Ab
Someday I will fundamentally understand Bayesian probability.

By understand I mean to grasp the links between thermodynamics ,learning,
blackholes , cosmology, optimization , probability, bayesianism & quantum
mechanics. Stuff like why the utility of techniques from thermodynamics and
energy based models in machine learning, the well known relationship between
shannon entropy and thermodynamic entropy, between entropy and decoherence in
QM, the duality of optimization and probability, the complex Bayesian
probability interpretations of quantum mechanics, the Berkenstein bound and
the holographic principle. I could ramble on at length but fortunately I have
much to do.

~~~
kylebrown
Someday, I hope to do the same! But that's a big chunk to chew. Lately, I've
been nibbling starting with covariance matrices, Mahalanobis distance (a
"moment measure" if i'm not mistaken, therefore a connection to.. dundudun)
and Fisher information.

~~~
Dn_Ab
Fisher information is key and turns up in a lot of fundamental places. I'm
currently slowly working my way through a text on Information Geometry and
another on Ideals and varieties. There is only a limited time one can devote
to constant learning so I try to learn things that cut through as much
territory as possible. I feel strong discomfort when reading about subjects
like say machine learning where a lot of stuff is seemingly arbitrary rules of
thumb* .

Turns out that a bunch of geometric ideas that are useful in physics also
unify ML concepts. The idea of Information Geometries. There are three ways I
have seen the concept used. One is based on differential geometry and treats
sets of probability distributions as manifolds and their parameters as
coordinates. Many concepts are unified and tricky ideas become tautologies
within a solid framework (fisher information as a metric)
<http://www.cscs.umich.edu/~crshalizi/notabene/info-geo.html> .

The other approach is in terms of varieties from algebraic geometry. Here
statistical models of discrete random variables are the zeros of certain sets
of polynomials (which describe hypertetrahedrons). Graphical models (hidden
markov models, neural nets, bayes nets) are all treated on one footing.

The final approach is an interesting set of techniques where a researcher
abstracts Information retrieval with methods from quantum mechanics. The
benefit is that you get a basic education in the math of QM as well.

* Arbitrary in the sense that you just have to accept a lot things that only become less fuzzy in time. Where as a proper framework provides handholds that reward effort with proportional amounts of understanding. The last time I felt this way was when I was first learning functional programming 7 years ago. The terminology was different and heavy going from imperative programming but I knew the rewards in understanding, expressiveness and flexibility would be well worth the effort. Confusion dissipated linearly with effort (unlike C++'s nonlinear relationship) and I knew that I was picking up a bunch of CS theory at the same time that would make learning programming (and C++) much easier.

------
tel
Seems like everyone wants to write an intro to Bayesian state (or, usually,
subjective interpretation of probabilities and Bayes' Theorem). It's like the
new Monad tutorial.

