
Introduction to Probability at an advanced level (2018) [pdf] - Anon84
https://www.stat.berkeley.edu/~aditya/resources/AllLectures2018Fall201A.pdf
======
riskneutral
This is nice but I would have expected “probability at an advanced level” to
mean the measure theoretic foundations of probability.

~~~
j1vms
> This is nice but I would have expected “probability at an advanced level” to
> mean the measure theoretic foundations of probability.

Fair point, though with no intention of being too literal, might this qualify
as the difference between "Introduction to X" vs. "X"?

In case anyone else is interested: measure-theoretic probability theory
"unifies the discrete and the continuous cases, and makes the difference a
question of which measure is used. Furthermore, it covers distributions that
are neither discrete nor continuous nor mixtures of the two." [0]

[0] [https://en.wikipedia.org/wiki/Probability_theory#Measure-
the...](https://en.wikipedia.org/wiki/Probability_theory#Measure-
theoretic_probability_theory)

~~~
LeanderK
There's usually a difference between statistics and probability theory, so I
was also expecting the measure-theoretic foundations of probability (the
advanced point of view). Things like "Covariance, Correlation and Regression"
are not about (advanced) probability in a literal sense.

Or, to provide another perspective: The field of stochastics is bigger than
just statistics. It includes probability theory as its foundation and then
applies it to different domains, including but not limited to statistics.

------
tepitoperrito
Whelp, first and foremost, thanks for the push I needed to commit to learning
stats through and through.

Second, more of an ask HN than a comment; With a background in math up to ~1st
semester calculus (i.e. differential and integral calculus) what are the
killer resources for getting up to speed with Statistics & Probability?

For the record: I found the discussion and content discussed last week(1) to
be a great refresher and the document linked above even denser as far as
topics to explore go.

Many thanks...

[[[https://news.ycombinator.com/item?id=21382470](1)](https://news.ycombinator.com/item?id=21382470\]\(1\))
The Little Handbook of Statistical Practice (2012)]

~~~
graycat
> what are the killer resources for getting up to speed with Statistics &
> Probability?

(1) Linear algebra. Much of applied statistics is _multi-variate_ , and nearly
all of that is done with matrix notation and some of the main results of
linear algebra.

The central linear algebra result you need is the polar decomposition that
every square matrix A can be written as UH where U is unitary and H is
Hermitian.

So, for any vector x, the lengths of Ux and x are the same. So, intuitively
unitary is a rigid motion, essentially rotation and/or reflection.

Hermitian takes a sphere and turns it into an ellipsoid. The axes of the
ellipsoid are mutually perpendicular ( _orthogonal_ ) and are _eigenvectors_.
The sphere stretches/shrinks along each eigenvector axis according to it's
eigenvalue.

U and H can be complex valued (have complex numbers), but the real valued case
is similar, A = QS where Q is orthogonal and S symmetric non-negative
definite. The normal equations of regression have a symmetric non-negative
definite matrix. Factor analysis and singular values are based on the polar
decomposition.

You get more than you paid for: A relatively good treatment of the polar
decomposition has long been regarded as a good, baby introduction to Hilbert
space theory for quantum mechanics. The unitary matrices are important in
group representations used in quantum mechanics for molecular spectroscopy.

Matrix theory is at first just some notation, how to write systems of linear
equations. But the notation is so good and has so many nice properties (e.g.,
matrix multiplication is associative, a bit amazing) that with the basic
properties get _matrix theory_. Then linear algebra is, first, just about
systems of linear equations, and then moves on with matrix theory. E.g., could
write out the polar decomposition without matrix theory, but it would be a
mess.

(2) Need some basic probability theory. So, have _events_ \-- e.g., let H be
the event that the coin came up heads. Can have the _probability_ of H,
written P(H), which is a number in the interval [0,1]. P(H) = 0 means that
essentially never get heads; P(H) = 1 means that essentially always get heads.

Can have event W -- it's winter outside. Then can ask for P(H and W). That
works like set intersection in Venn diagrams. Or can have P(H or W), and that
works like set union in Venn diagrams.

Go measure a number. Call that the value of real valued _random variable_ X.
We don't say what _random_ means; it does not necessarily mean unpredictable,
_truly random_ , unknowable, etc. The intuitive "truly random" is essentially
just _independence_ as below, and we certainly do not always assume
independence. For real number x and _cumulative distribution_ F_X, can ask for

F_X(x) = P(X <= x)

Often in practice F_X has a derivative, as in calculus, and in that case can
ask for the _probability density_ f_X of X

f_X(x) = d/dx F_X(x)

Popular densities include uniform, Gaussian, and exponential.

We can extend these definitions of (cumulative) distribution and density to
the case of the _joint_ distribution of several random variables, e.g., X, Y,
and Z. We can visualize the joint distribution or density of X and Y on a 3D
graph. The density can look like some wrinkled blanket or pizza crust with
bubbles.

In statistics, we work with random variables, from our data and from the
results of manipulating random variables. E.g., for random variables X and Y
and real numbers a and b, we can ask for the random variable

Z = aX + bY

That is, we can manipulate random variables.

Events A and B are _independent_ provided

P(A and B) = P(A)P(B)

In practice, we look for independence because it is powerful case of
_decomposition_ and with powerful, even shockingly surprising, consequences. A
lot of statistics is from an assumption of independence.

Can also extend independence to the case of random variables X and Y being
independent. For any _index_ set I and random variable X_i for i in I, can
extend to the case of the set of all X_i, i in I, being independent. The big
deal here is that the index I can be uncountably infinite -- a bit amazing.

For the events heads H and winter W above, we just _believe_ that they are
independent just from common sense. To be blunt, in practice the main way we
justify an independence assumption is just common sense -- sorry 'bout that.

Can define some really nice cases of an infinite sequence of random variables
converging (more than one way) to another random variable. An important
fraction of statistics looks for such sequences as approximations to what we
really want.

Since statistics books commonly describe 1-2 dozen popular densities, a guess
is that in practice we look for the (distribution or) density of our random
variables. Sadly or not, mostly no: Usually in practice we won't have enough
data to know the density.

When we do know the density, then it is usually from more information, e.g.,
the renewal theorem that justifies an assumption of a Poisson process from
which we can argue results in an exponential density or the central limit
theorem where we can justify a Gaussian density.

(3) A _statistic_ is, give me the values of some random variables; I
manipulate them and get a number, and that is a random variable and a
_statistic_. Some such statistics are powerful.

(4) Statistical Hypothesis Testing. Here is one of the central topics in
statistics, e.g., is the source of the much debated "p-values".

Suppose we believe that in an experiment on average the _true value_ of the
results is 0. We do the experiment and want to do a _statistical hypothesis
test_ that the true value is 0.

To do this test, we need to be able to calculate some probabilities. So, we
make an assumption that gives us enough to make this calculation. This
assumption is the _null hypothesis_ , _null_ as in no _effect_ , that is, we
still got 0 and not something else, say, from a thunderstorm outside.

So, we ASSUME the true value is 0 and with that assumption and our data
calculate a statistic X. We find the (distribution or) density of X with our
assumption of true value 0. Then we evaluate the value of X we did find.

Typically X won't be exactly 0. So we have two cases:

(I) The value of X is so close to zero that, from our density calculation, X
is as close as 99% of the cases. So, we _fail to reject_ , really in practice
essentially _accept_ , that the true value is 0. _Accept_ is a bit fishy since
we can _accept_ easily enough, that is don't notice the pimple simply by using
a fuzzy photograph, just by using a really _weak_ test! Watch out for that in
the news or "how to lie with statistics".

(II) X is so far from 0 that either (i) the true value is not zero and we
_reject_ that the true value is 0 or (ii) the true value really is 0 and we
have observed something rare, say, 1% rare. Since the 1% is so small, maybe we
_reject_ (ii) and conclude (i).

So, for our _methodology_ we can test and test and test and what we are left
with is what didn't get rejected. Some people call this all _science_ ,
others, fishy science, others just fishy. IMHO, done honestly or at least with
full documentation, it's from okay to often quite good.

I mentioned _weak_ tests; intuitively that's okay, but there is a concept of
_power_ of a test. Basically we want more power.

Here's an example of _power_ : Server farm anomaly detection is, or should be,
essentially a statistical hypothesis test. So, then, the usual null hypothesis
is that the farm is healthy. We do a test. There is a rate (over time,
essentially probability) of _false alarms_ (where we reject that the farm is
healthy, conclude that it is sick, when it is healthy) and rate of _missed
detections_ of actual problems (fail to reject the null hypothesis and
conclude that the farm is healthy when it is sick).

Typically in hypothesis testing we get to adjust the false alarm rate. Then
for a given false alarm rate, a more _powerful_ test has a higher detection
rate of real problems.

It's easy to get a low rate of false alarm; just turn off the detectors. It's
easy to get 100% detection rate; just sound the alarm all the time.

So, a question is, what is the most powerful test? The answer is in the
classic (late 1940s) Neyman-Pearson lemma. It's like investing in real estate:
Allocate the first money to the highest ROI property; the next money to the
next highest, etc. Yes, the discrete version is a knapsack problem and in NP-
complete, but this is nearly never a consideration in practice. There have
been claims that some high end US military target detection radar achieves the
Neyman-Pearson most powerful detector.

There's more, but that is a start.

~~~
sriram_malhar
Thank you. This is an excellent list. Can you also point us to resources that
have examples that are relevant to engineering and machine learning? I can't
take one more cancer example!

~~~
graycat
The world is awash in texts and papers on applied statistics. My notes were an
introduction to provide a start on what is needed to read more in applied
statistics.

For much of engineering, e.g., quality control, the statistical hypothesis
testing I outlined is common.

There are now classic texts on hypothesis testing:

E. L. Lehmann, _Testing Statistical Hypotheses_ , John Wiley, New York, 1959.

E. L. Lehmann, _Nonparametrics: Statistical Methods Based on Ranks_ , ISBN
0-8162-4994-6, Holden-Day, San Francisco, 1975.

Sidney Siegel, _Nonparametric Statistics for the Behavioral Sciences_ ,
McGraw-Hill, New York, 1956.

The last is also just hypothesis testing; apparently the book is still
referenced.

From some statistical consultants inside GE, there is

Gerald J. Hahn and Samuel S. Shapiro, _Statistical Models in Engineering_ ,
John Wiley & Sons, New York, 1967.

For machine learning, so far the approach appears to be cases of empirical
_curve fitting_.

The linear case of versions of regression analysis remain of interest. There
the matrix theory I outlined is central. And regression analysis goes way
back, 100 or so years, and is awash in uses of statistical hypothesis tests
for the regression (F-ratio), individual coefficients, (t-tests), and
confidence intervals on predicted values.

For using probabilistic and statistical methods in the non-linear fitting,
that is more advanced and likely an active field of research.

In my own work, I use and/or create new statistical methods when and as I need
them. My current work has some original work in, really, applied probability
that also can be considered statistics. And when I was doing AI at IBM, I
created some new mathematical statistics hypothesis tests to improve on what
we were doing in AI. Later I published the work. Parts of the machine learning
community might call that paper a case of supervised learning, but the paper
is theorems and proofs in applied probability that give results on false alarm
rates and detection rates, that is, statistical hypothesis tests (both multi-
dimensional and distribution-free). So, that paper can be regarded as relevant
to machine learning, but the prerequisites are a ugrad pure math major, about
two years of graduate pure math with measure theory and functional analysis, a
course in what Leo Breiman called _graduate probability_ , and some work in
stochastic processes. At one point I use a classic result of S. Ulam that the
French probabilist Le Cam called _tightness_ \-- there is a nice presentation
in

Patrick Billingsley, _Convergence of Probability Measures_ , John Wiley and
Sons, New York, 1968.

The work was for anomaly detection in complex systems so could be regarded as
for both engineering and machine learning. But I can't give all the background
here.

------
faizshah
Another of his lecture notes on Theoretical Stats was posted today:
[https://news.ycombinator.com/item?id=21468493](https://news.ycombinator.com/item?id=21468493)

------
joker3
This looks like it's roughly at the level of Allan Gut's intermediate
probability book ([https://www.amazon.com/Intermediate-Course-Probability-
Sprin...](https://www.amazon.com/Intermediate-Course-Probability-Springer-
Statistics/dp/1441901612)). There's a need for more exposition at that level,
but I'm not sure that the course notes themselves are a good source.

------
Ragib_Zaman
I wish I had searched for something like this when I was revising for some
interviews last month! This is a great quick overview of intro probability.
Thanks for sharing OP.

~~~
Ragib_Zaman
Homework problems and notes split into individual lectures available here:

[https://www.coursehero.com/sitemap/schools/234-University-
of...](https://www.coursehero.com/sitemap/schools/234-University-of-
California-Berkeley/courses/7342063-STAT201A/)

