Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? Let us first take a leisurely look at the general nature of such problems.
A moment's thought makes it clear that our policeman's conclusion was not a logical deduction from the evidence; for there may have been a perfectly innocent explanation for everything. It might be, for example, that this gentleman was the owner of the jewelry store and he was coming home from a masquerade party, and didn't have the key with him. But just as he walked by his store a passing truck threw a stone through the window; and he was only protecting his own property. Now while the policeman's reasoning process was not logical deduction, we will grant that it had a certain degree of validity. The evidence did not make the gentleman's dishonesty certain, but it did make it extremely plausible. This is an example of a kind of reasoning in which we haveall become more or less proficient, necessarily, long before studying mathematical theories. We are hardly able to get through one waking hour without facing some situation (e.g. will it rain or won't it?) where we do not have enough information to permit deductive reasoning; but still we must decide immediately what to do.
I sometimes joke Sherlock Holmes is really all about "The Science of Abduction" not deduction; a rather more threatening tagline.
(Representation) Degrees of plausibility pl(A|B) can be identified as unique real numbers p ∈ R; and
(Qualitative Correspondence With Common Sense; Logical Internal Consistency) This has three sub-Principles:
– If there is more than one path to a correct conclusion, all such paths must lead to the same plausibility result;
– In assessing pl(A|B), You must always use all of the available information that You regard as relevant to the assessment; and
– Equivalent states of information about (A|B) always lead to the same pl(A|B).
Apart from minor typos that make him a "sloppy mathematician", he is a good educator based on my personal experience taking grad. courses from him. He's not the passionate high school STEM teacher type, but he offers great insights. I think which textbook he used is somewhat secondary. For courses he had taught before, he usually pick a standard text and teach based on the material he wrote in his blog post, including exercises.
So for example, given the question what the probability is that, when throwing 4 coins, 2 of which will be heads; I could write a function that generates all possible outcomes "TTTT", "TTTH", etc. And I could write a filter function that returns true for "TTHH", "THTH", "THHT", etc.
The point is that even for fairly simple real use cases, the computation complexity is so huge, that all computers in the world couldn't compute it in your lifetime if you don't employ some approximation or optimization and stick to naive algorithms.
So, that is what the whole field of machine learning is about: finding some clever ways to deal with random variables in a computationally feasible way...
Doesn't produce "closed-form formula", but, well, those tend to scale to programming-sized tasks poorly anyhow.
let P = probability of 2 heads given 4 coins
let omega = (x + y)^n # probability space
let x = 50% # P(heads)
let y = 50% # P(tails)
let (n, k) = (4, 2) # 4 coins, 2 heads
P = (n, k)(x )^2(y )^2
(6 )(25%) (25%) # (4, 2) = 6; see link
>“Information theory must precede probability theory and not
be based on it.” A.N.Kolmogorov, in [Kolmogorov, 19831.
and it was a combinatorics paper
For what Tao wrote on his page
about determinism or whatever
on that page -- just f'get about that.
And what he wrote, confusing a
sample space and a probability space,
just say that he had a bad day that day.
In particular, sending people to Loeve for their first course in measure theoretic probability would be really cruel.
Personally, I learned from Feller, and Billingsley, and finally Durrett.
Durrett seems on most of the topics to have less than any of the four texts I mentioned.
> In particular, sending people to Loeve for their first course in measure theoretic probability would be really cruel.
I listed the authors in reverse order in which to read them! The easiest start is the last, Breiman. But some things in Loeve are good, e.g., sufficient statistics. Neveu is my favorite as the most elegant, but his exercises are the most difficult. Breiman and Chung are fine. IIRC, both Breiman and Neveu were Loeve students.
> Personally, I learned from Feller, and Billingsley, and finally Durrett.
I used Feller I and/or II for reference. Otherwise his writing seemed to lack an overall unification.
The only Billingsley text I used, and then only for a small topic, was Convergence of Probability Measures.
At this point I see a blog post written by a well respected mathematician whom I feel comfortable trusting and it's being brushed aside by I don't know who.
The Amazon UK reviews of Chung's book lead me to A Probability Path by Sidney Resnick which appears to be aimed at non-mathematicians. I have invested (speaking as a renegade physicist lacking a systematic exploration of measure theory).
Some googling around suggests his students are quite pleased with him. He has, according to Wikipedia's intro for him, the undisputed king of math blogs. Both of these point to him being at least a good or above average educator.
I learned from a star student of
E. Cinlar, long at Princeton.
We used Royden, Rudin,
Neveu, and Chung, and there
were some nice topics in the
course not in any of those
texts, e.g., the Lindeberg-Feller
version of the central limit
theorem, a really nice, astounding,
result on an envelope for
Brownian motion, more on ergodic
theory, some on additive processes,
and more. Super nice course.
Could you expand on this? I'm reading his post now, but from the outside, it seems unlikely that Terry got the concepts completely wrong, and more likely that he's using slightly different definitions for the concepts than you are expecting.
That can be a real pain for
students trying to learn.
But, be warned: A lot of
people work with probability,
but only a tiny fraction ever
had a course in graduate
probability. So, eventually
have to learn to put up with,
and sometimes rewrite, some of
what is written that is not
Here are the accepted definitions:
Take a non-empty set, usually
denoted by Omega, and call it the
sample space or the set of trials.
Take a collection of subsets of
Omega, usually denoted by script F,
and call it the set of events.
Have at least enough subsets of
Omega in the set of events script F
so that script F will be a sigma algebra.
So, script F has to have as an element
the empty set,
be closed under relative complements,
and be closed under countable unions.
We want all that for probability
theory. We want countably
infinite so that we can discuss,
say, the event that a coin never
comes up heads. We don't want
uncountably infinite because
it would create a big mess
in the theory.
Then the ordered pair (Omega, script F)
is called a measurable space -- it
doesn't have a measure yet but
On that measurable space, define
a positive measure P so that
P( Omega ) = 1.
Then P is a
probability measure on the
A probability space is
the triple ( Omega, script F, P ).
a measure assigns to each
event A in script F a
non-negative real number P(A).
A signed measure permits
negative values. And
of course at times want
a measure that yields complex
Measure theory is in, say,
P. Halmos, Measure Theory.
The back of Halmos has a
really nice introduction to
probability theory, the
Kolmogorov three series
theorem in stochastic processes,
etc. For more in stochastic
processes there is, of course,
the classic J. Doob, Stochastic
Processes. Halmos was a Doob
student and then served as
an assistant to von Neumann
at the Institute for Advanced
Study. From that he wrote
Halmos, Finite Dimensional
Vector Spaces, a finite
dimensional introduction to
Hilbert space. Later at
University of Chicago
Halmos worked in mathematical
statistics and made the
fundamental contribution to
IIRC, it was at Chicago that
Halmos wrote Measure Theory.
Great details on measure theory
and much more
are in, say, H. Royden, Real
Analysis and W. Rudin, Real
and Complex Analysis.
narrow measure theory
to probability theory,
use the famous texts by the
authors I listed, Loeve, ....
Royden and Rudin are
terrific prerequisites to
Commonly the sigma algebra of
events script F is the smallest
sigma algebra that contains
as a subset a topology, that is,
a collection of open sets,
sample space Omega.
That's all standard, beginning
stuff in graduate probability --
ah, the only kind should bother
Who should I read for an introduction to that "big mess"?
So many of the cases I'm actually interested in require real valued (continuous) inputs and outcomes. While one can quantize these to create an approximation of something countable, it seems like that a much simpler theory would be possible if it was built from the ground up to handle these common non-discrete real-world cases, rather than trying to shoe-horn them into standard probability theory. I was hoping this might be the direction that Terry was headed, with the emphasis on probabilistic methods.
Here's where it gets less basic: say you want to look at temperature over time, but you don't want to model temperature as a variable that's measured daily, or hourly, or even every second (secondly?), but you want to model it as a process that evolves in continuous time. That's where the theory gets messy. Not necessarily at the level of a user of this theory, but definitely at the level of proving that the math you want to use is allowed.
If you actually need an introduction to that sort of probability, Lawrence Evans (Berkeley) has some old lecture notes aimed at undergrads  that he turned into a book . If (more likely) you want standard measure-theoretic probability theory (as opposed to what's taught to undergraduates), David Pollard's book is pretty good .
I'm sure that @graycat will scoff at those recommendations, but his reading list would be considered excessively hardcore and time consuming even for a graduate student in math, which I'm assuming you're not.
ps: after looking at it again, the intro in Evans's notes is as gentle as it's going to get, so start there. And (as you'll find out, unless you're some sort of savant) this shit's hard. If you actually want to understand this stuff, graduate coursework is probably the only practical way to do it.
Pr(A | B) = Pr(A and B) / Pr(B)
instead of just telling it to students and drawing a few diagrams that drive the point home.
But I think it's more pragmatic than ad hoc. Any deep theory of probability that doesn't deliver
is basically useless since that's how random phenomena seem to behave in real life. Having a deeper theory is useful because it allows you to derive other implications of that theory and makes certain calculations much easier. But if the theory disagrees with phenomena that we want to model, that can be a problem.
Probability and stochastic
processes based on measure
theory are not very popular
in the US, even in graduate
Uh, scoff, scoff. Okay?
The full measure theoretic details
of stochastic processes in
continuous time can be a bit
of a challenge. That topic can
be important, e.g., for Brownian
motion and stochastic differential
equations used in mathematical
finance. Of course, there is
Karatzas and Shreve,
Brownian Motion and Stochastic Calculus
and Chung and Williams,
Introduction to Stochastic Integration.
And there's much more, especially from
Russia and France.
But, otherwise, usually in practice,
what people are interested in is either
(1) second order stationary stochastic
processes, e.g., as in electronic or
acoustical signals and noise. There
are commonly interested in power spectral
estimation, digital filtering, maybe
Wiener filtering, the fast Fourier transform,
etc. or (2) what is in, say, Cinlar,
Introduction to Stochastic Processes.
for the continuous time case,
get a good introduction to
the Poisson process (the vanilla
arrival process, e.g.,
like clicks at a Geiger counter,
at a Web site, and much more).
what else people are mostly interested
in in practice, Markov processes
in discrete time with a
discrete state space (that is, the
values are discrete).
The case of Markov processes in
continuous time and discrete state
space is not so tough if the
jumps are driven by just a Poisson
process. But there is still more
And there are other good texts on
For (1), look at some of the
texts used by EEs. The measure
theory approach is in
Doob, Stochastic Processes,
Loeve, Probability Theory,
and several more texts by quite
E.g., without measure theory,
can just dive in via
Blackman and Tukey, The Measurement
of Power Spectra ....
With all these sources,
are able to get by without
Yes, without measure theory,
at some places will have to
not ask to understand too much
and just skip over some details
to get back to the applied stuff.
But for measure theory, the
Durrett text seems to get
a student to that unusually
For more, at the MIT Web site,
there is an on-line course
in mathematical finance that
avoids measure theory.
They want to use the
and Ito integration but
still avoid measure theory.
Uh, the Radon-Nikodym theorem
is a generalization of the
fundamental theorem of calculus.
Once see it, it's dirt simple,
but a good proof takes a bit
or follow von Neumann's proof
that knocks it all off in one
stroke (it's in
Rudin, Real and Complex Analysis).
It's just can't take uncountably
many events, take their union,
and assume that the result is
also an event.
And you can have each point on the
real line an event. Fine.
But you just can't take
the uncountable union of
any set of such events and assume
that the result is also an event.
As for the event a random variable
takes a value >= 0? Fine.
Or, let the Borel subsets of the
real line be the smallest sigma algebra
that contains all the open sets,
e.g., all the open intervals.
Then for Borel set A and
real valued random variable X,
can ask for the probability
X is in A.
I believe you will find that
you will have a solid foundation
for what you want.
To see all this stuff, need more
than just the sparse definitions
and, instead, need an actual
text and maybe a course. Recently
looked at the on-line materials
from MIT and didn't see such a course.
Graduate probability is not
all that popular in the US;
stochastic processes in continuous
time is still less popular.
To study graduate probability,
I'd recommend a good
undergraduate major in pure math
with good coverage of, say,
W. Rudin, Principles of Mathematical
Analysis. Then good coverage of
linear algebra from more than
one of the best known texts.
Likely also spend as much time
as you can in Halmos, Finite
Dimensional Vector Spaces.
E.g., at one time, Halmos,
Rudin, and Spivak, Calculus
on Manifolds were the three
main texts for Harvard's
famous Math 55.
Get good with proving the theorems.
I also recommend Fleming,
Functions of Several Variables.
Then, sure, Royden, Real Analysis.
Couldn't be prettier.
If not in a hurry, then
the real half of Rudin's
Real and Complex Analysis.
Especially if you like Fourier
Then of the probability books,
I believe that the nicest, first
book is L. Breiman, Probability.
He wrote that before he went
consulting and came back and
did CART and random forests.
Next, K. Chung,
A Course in Probability Theory.
Next, J. Neveu, Mathematical
Foundations of the Calculus of Probability.
Then, Loeve, Probability Theory.
Loeve is huge -- mostly just
use it for reference or browse.
E.g., it has sufficient statistics
and stationary stochastic processes
(the EEs love that) IIRC
not in the other
IIRC, both Breiman and Neveu
were Loeve students at Berkeley.
If do well with Breiman, then
for graduate probability,
likely can stop there.
Else, Chung will then be
fast and easy reading and
reinforce what you learned in
Breiman. Neveu is elegant;
my favorite, but deserve extra
credit for each workable exercise
you can find, not actually work,
you understand, just find! Sure,
some of the exercises are terrific,
half a course in a few lines of
an exercise. E.g., he has one
of those on statistical decision
theory or some such. And see the
Tulcea material in the back.
Then there's more that you can
do on stochastic processes,
potential theory via Brownian
e.g., for mathematical finance,
stochastic optimal control,
I hope all this studying will pay off then, I have fremlin vol1/vol2 sitting underneath my text while I handle the per-requisite material hopefully the journey will pay off.
Measure theory can be used with a lot of stuff I guess.
Here measure is essentially
just a synonym for area, essentially
just ordinary area.
Measure theory addresses both
differentiation and integration
of freshman calculus,
but most of the focus is on
So, in freshman calculus, you
are given, say, a real valued function
of a real variable, say, function
f where, for real numbers x, we have,
f(x) = x^2 + 1
Then, say, we graph f and
want the area under the curve for
x in the interval [0,10]. Okay,
you saw that more than 10,000 times
in freshman calculus.
Well, in this case, measure theory
will give the same answer for the
area under the curve. The difference
is how that area is calculated.
Here is the shortest description
of where measure theory is different:
As you recall, in freshman calculus
you found the area under the curve
by picking points on the X axis,
that is, partitioning the X axis
into little intervals, on each interval
building a rectangle that approximated
the area under the curve over that
little interval, inserting more in the
partition so that the longest of
the little intervals had its length
get as small as we pleased, adding up
the areas of the rectangles, and
taking the limit. That was it. That
was the definition of what the
integral was. Of course, to integrate
x^2 + 1 you learned about, for any
constant C, the anti-derivative
(1/3)x^3 + x + C
So, here's what measure theory does:
Yup, it also works with a partition
but, in this case, the partition is
on the Y axis instead of the X axis.
Then for each little interval on the
Y axis, we get a horizontal bar and
look at the parts that are under the
curve. Then as we add points to the
partition, we add up the ordinary
areas of the relevant parts of the
horizontal bars. The picture
is less nice than in freshman calculus
but, still, no biggie.
Now, how would one do, say, numerical
integration? Sure: The same
as in freshman calculus, say,
Simpson's rule where work with
little trapezoids. Nope, measure
theory is not suggesting that we
Here are four features of the
approach of measure theory:
(1) There are some goofy, pathological
functions that get the integration
theory (the Riemann integral) of
freshman calculus all confused.
E.g., consider the function 0 on
the rational numbers and 1 otherwise.
Then the upper and lower
Riemann sums of those little
rectangles never converge to what
we want. Bummer. Well, it follows
from some theorems that the integral
of measure theory rarely gets confused --
of course the theorems are very precise,
but the generality is mind blowing, so
much that it's darned tricky to
construct a case where the measure
theory integral fails.
(2) A big deal is what happens
when a sequence of functions
is used to approximate and converge
to another function. A leading
example is Fourier series. Well,
on taking limits during this
approximation, the Riemann integral
can get confused when the measure
theory integral (Lebesgue) does just fine.
H. Lebesgue was a student of
E. Borel in France and did
his work near 1900.
(3) Often we want to pass some
limits under the integral sign.
Again, Lebesgue does much better
here than Riemann. Indeed,
the Lebesgue integral has
a super nice theorem on
differentiation under the
integral sign (from the
TOC of Durrett, that theorem
may be the last topic in that
book -- it was a really fun
exercise when I was studying
(4) Notice a biggie: With
Lebesgue, actually we used
next to nothing about
the X axis, that is, about
the domain of the function
we are integrating. In this
way, right away, presto, bingo,
we get a theory of integration
that works on domains
with much, much less in assumptions
and properties than the real
numbers or the usual finite
dimensional real Euclidean
vector space. In particular,
we are now GO for doing
probability theory --
the Lebesgue integral is
used to define both probability
of events and expectation of
random variables. It was
A. Kolmogorov in 1933 who
noticed, and wrote a monster paper,
on how to use measure theory
and the Lebesgue integral
for a solid foundation for
probability theory. Since then
for essentially all serious
research in probability and
stochastic processes, much of
nearly all of stochastic optimal
control, is based solidly
on the Kolmogorov,
i.e., measure theory, foundations.
So, from a mathematician not
much interested in probability,
probability theory is just
a special case of measure theory
where the total area (measure)
is just 1. That's not literally
or logically wrong
but does discard a fantastic
baby with any bathwater.
Some of the results in
probability are just astounding
and powerful -- both
So, in measure theory, here
is what a measure (a definition
of area) is:
We start with a non empty set,
a space, say, X. From
X we have some subsets we
call measurable sets.
So, measurable set A is a
subset of X. In the special
case of probability,
the measurable sets are
the events, that is, e.g.,
all the trials where
our coin comes up heads.
We ask to have enough measurable
sets so that all of them
form a sigma algebra. Why?
Otherwise we don't have much.
A sigma algebra doesn't ask
for much. The sigma part is
supposed to suggest
finite or countably infinite
adding up, as in the usual use of
the capital Greek sigma for
Say our sigma algebra of
subsets of our measurable
space X is S (probability
theory usually uses script F).
Then we want the empty subset
of X to be an element of S.
For A in S we want X - A (the
relative complement) to be
an element of S. And for
B(i) in S for i = 1, 2, ...,
we want the union of
all the B(i) to be an element of
S. These conditions ensure that
we will have enough sets
in S to have a decently strong
In what we commonly
do in applied probability,
we wouldn't settle for less.
E.g., the empty set is the
event it never happens.
If H is the event of heads,
then T = X - H, the relative
complement, is the event tails.
If H(i) is the event that
flip i comes up heads, then
the union of all the H(i)
is the event the coin comes
up heads at least once or
doesn't always come up tails.
In probability, those are all
simple situations, and
just from those we need
a sigma algebra of events.
And it turns out, that's enough,
and has been since 1933.
So, for a measure, say,
m, to measurable set A
there is real number m(A),
the measure (think area
or, in probability theory, the
probability) of A. Of course
in probability we call the
measure P instead of m
and write P(A) for the probability
of event A.
You can begin to see that we
are essentially forced into
how Kolmogorov applied measure
theory whether we like it or not.
Sorry 'bout that!
Well, for a measure
m, we want countable
for disjoint measurable
sets B(i), i = 1, 2, ...,
m(union B(i)) = sum m(B(i))
for some sloppy math notation
since I can't type TeX here!
Usually m(A) is real with
m(A) >= 0, and commonly we
need to have m(X) = infinity
so that m can take on value
We can also extend
to m(A) any real number or
any complex number.
Measure theory is the
total cat's meow for
To get a sigma algebra
of measurable sets we want,
commonly we start with
a topology, that is,
its collection of open
sets, and ask for the
unique smallest sigma
algebra for which each open
set is also a measurable
set in the sigma algebra.
When we do this on the real
line and assign
the measure of intervals
their ordinary length
and extend that to
as many subsets of the
reals as we can, we get
Lebesgue measure for the
real line. We get a lot
of sets! It's a tricky
exercise, that uses the
axiom of choice, even to
construct a subset of the
reals that is not Lebesgue
measurable. Powerful theory!
Suppose we have spaces X
and Y, each with a sigma
algebra of subsets and
f: X --> Y
Then f is measurable
if for each measurable
subset B of Y
f^(-1)(B) is also
a measurable subset
of X. In measure
theory, when we integrate
a function, we ask that
it be measurable.
In the usual cases, it's
even tough to construct
a function that is not
near any limit of
measurable functions is
also measurable -- super
In probability theory,
a random variable
is just a measurable
function where its
a probability space,
that is, a sample space
Omega with a sigma
algebra of subsets
script F and
a probability measure P.
Of course, there's much
more, stacks, shelves,
racks of books
as difficult as you wish,
but the above is
a simple, intuitive view
from 10,000 feet up.
Or, measure theory is
a nicer theory of
area and area under a curve
that in all the simple cases
gives you just what
you have been used to!
This matches the definitions of Wikipedia pretty closely: "A probability space consists of three parts: A sample space [...], A set of events [...], The assignment of probabilities to the events".
So either you misunderstood that sentence or both Wikipedia and Terrence Tao are wrong.
is, in a word, wrong. The sample
space will be part of a probability
space but will not "be a probability
Some intuitive overviews, clearly
labeled as such, are fine and can
be helpful, but "idiomatic language"
just is not. Won't find such in
the writings of W. Rudin, P. Halmos,
J. Neveu, or any of a long list
of authors of some of the best
math books. In a good math or
journal, a reviewer or the editor
would likely reject "idiomatic
definitions I gave are the accepted