
Probability Theory for Scientists and Engineers - kawera
https://betanalpha.github.io/assets/case_studies/probability_theory.html
======
dbranes
This is excellent. Amazing (equation + picture)/text ratio.

Complaint: you don't define your notion of "space". In chapter 1 it's some
informal notion that you use to motivate the definition of a set (??), in 1.3
and 1.4 it becomes clear by space you mean "set". Then later you start talking
about dimension of spaces, implying not only do they come with a topology now
they have a well defined dimension, so a locally Euclidean Hausdorff space or
something - but maybe you just mean R^n.

Comment for other commentators in this thread: not all expositions is tailored
for the masses. A piece of pedagogical literature that does not appeal to your
background doesn't mean it's not good. There's a very clear need for
exposition on basic structures in probability theory and this fits there.

~~~
Eldandan
It's like the definition of set defined here should really be a subset, and
the definition of space should be a set. Maybe just say a set is any
collection of objects?

~~~
alehul
Agreed!

A sample space would be relevant in probability theory and is often helpful to
calculate for your denominator, but this definition is rather vague.

------
joker3
I'm not really sure who the intended audience is here. There's a lot of
material covered very briefly in a very short space, and not enough details
that anyone who doesn't already know it would be able to pick up anything
substantive.

~~~
vecter
Seriously. Content like this is only useful for people who already know
probability.

~~~
tempay
Do you know of anything which can help people get over the hurdle to know
enough to use this content?

For me it's only worked when colleagues have explaining concepts to me when
they were needed, after a several occurrences of this everything finally
started to make sense and I could then make use of material like this.

~~~
vecter
There are no shortcuts with math. If you really want to learn it, you must be
willing to put in a large number of hours over a long period time in order to
master it. Are you willing to do that?

------
nafizh
This is an excellent book by L. V. Tarasov on probability.

[https://archive.org/details/TheWorldIsBuiltOnProbability](https://archive.org/details/TheWorldIsBuiltOnProbability)

I have always found Russian math book writers to be on point, not going too
much over your head, also respecting the reader's intelligence. If you like
it, then you will love his calculus book, that one is also a real gem.

~~~
sAbakumoff
Wow, that's pretty expensive book [https://www.amazon.com/World-Built-
Probability-L-Tarasov/dp/...](https://www.amazon.com/World-Built-Probability-
L-Tarasov/dp/5030011242)

~~~
nafizh
Yeah, his books are out of print. I have been looking for a hard copy of his
calculus book. Really expensive.

~~~
sAbakumoff
Thanks for the reference anyway! I downloaded the Russian version in form of
PDF document and enjoy reading it:-) What was great about USSR is its level of
popularization of science. All these books written for kids or high school
students were amazing.

~~~
cema
Yes, and a number of them are now available for free over the internet. In
Russian...

------
egonschiele
Echoing other comments here, this seems like a hard way to start learning
probability. It sounds like the goal is to make probability easier to
understand based on what you say here
([https://betanalpha.github.io/writing](https://betanalpha.github.io/writing))

> In this case study I attempt to untangle this pedagogical knot to illuminate
> the basic concepts and manipulations of probability theory and how they can
> be implemented in practice

But I think this is too hard. I really loved "Probability For The Enthusiastic
Beginner" [http://a.co/2kp5PZd](http://a.co/2kp5PZd)

~~~
gh02t
"For Scientists and Engineers" sounds to me like it's targeting people who
already have a strong background in more advanced mathematics, but not
necessarily probability and measure theory. If so I think this is a decent way
to go about it.

I'm an engineer and sometimes mathematician that works with fairly in-depth
probability theory related things and this looks to be a condensed version of
a lot of the basic stuff I had to self learn when I was getting into what I
work on now. I'm in a niche area though, and I do wonder if this really is
that useful to most scientists and engineers.

~~~
soVeryTired
Reading 'All of statistics' by Wasserman would be a much better bet than
diving deep into measure-theoretic probability though.

~~~
gh02t
Depends on your goal, statistics and probability theory are separate (though
of course related) fields with different applications. For me I really needed
the measure-theoretic bits because I was (am) working on modeling ergodic
processes. This article honestly doesn't go into enough detail to be
especially useful but I like the direction the author approaches it from.

I'm familiar with the text you mentioned, it's certainly good and would be
better than this article for most, but comparing a textbook to a short web
article isn't exactly fair.

------
baryphonic
The goal is worthy, but the product is inadequate to say the least. This thing
is littered with typos, and enough of the exposition is sufficiently
irrelevant or incorrect to be unintuitive. That said, I like the graphics and
layout.

For example, when he discusses power sets in order to introduce sigma
algebras, he implies that a sigma algebra is a better-behaved alternative to a
power set. However, a power set is always itself a sigma algebra (after all,
even a power set of an uncountable set still is closed under complements and
countable unions).

Later, when discussing probability distributions, he writes:

> [W]e want this allocation [of a conserved quantity] to be self-consistent –
> the allocation to any collection of disjoint sets, A_n ∩ A_m=0, n≠m, should
> be the same as the allocation to the union of those sets, > ℙπ[∪(n=1 to N)
> A_n]=∑(n=1 to N)ℙπ[A_n].

The condition `A_n ∩ A_m=0, n≠m` is actually incorrect, since A_n and A_m are
sets and 0 is an integer. The author means the empty set, but typo'd.

Sometimes he frequently uses words like "conserved" or "well-defined" without
giving us a clue as to what these mean. In what context are probabilities
"conserved"? What distinguishes "well-defined" from "not well-defined"?

I'm a software engineer. A non-trivial amount of my time is devoted to reading
code and finding bugs. Sloppy reasoning, inconsistencies and outright errors
like that are big red flags to me. It doesn't help that the whole section on
sigma algebras is somewhat irrelevant, since he doesn't really explore measure
theory as the basis for modern probability.

IMO a better resource is the series of "Probability Primer" videos from
mathematicalmonk on YouTube[1]. He does an excellent job (IMO) of covering all
pertinent pre-requisites and being mostly rigorous without necessarily proving
every single fact or exhaustively covering all edge and corner cases. He also
makes a good effort to recommend advanced (and rigorous) treatments of the
subject (and ancillary ones like measure theory). A readable version of this
YouTube series would be a great resource, and if Michael Betancourt is
reading, I'd encourage him to pursue that in his next iteration of this
product.

[https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4](https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4)

~~~
soVeryTired
> It doesn't help that the whole section on sigma algebras is somewhat
> irrelevant, since he doesn't really explore measure theory as the basis for
> modern probability.

Christ, practically all of measure theory is irrelevant for applied work, in
much the same way that an engineer shouldn't care about the definition of a
real number.

There's a model of real analysis due to Solovay that used the axiom of
dependent choice instead of the full axiom of choice. In the Solovay model,
_all sets are measurable_. Thus any results that _require_ measure theory
inherently depend on the axiom of choice.

I'd be worried if I was relying on Choice as an applied scientist.

Edit: same goes for Lebesgue vs. Riemann integration. To quote Richard
Hamming: _Does anyone believe that the difference between the Lebesgue and
Riemann integrals can have physical significance, and that whether say, an
airplane would or would not fly could depend on this difference? If such were
claimed, I should not care to fly in that plane._

~~~
psoy
It matters very much in computational / quant finance

~~~
soVeryTired
It matters by convention, because the textbooks are written that way.

My point is that _you don 't need that level of formal rigour_ to do applied
work. You can derive the Feynman-kac formula via a scaling limit of discrete-
time Markov chains. Add some levy process (a.k.a compound Poisson processes)
and you're basically done.

If you want to be ultra-rigourous in your definitions, then you need measure
theory, yes. But even Einstein didn't need that for his description of
Brownian motion. If a scaling limit is good enough for him, it's good enough
for me.

------
mayankkaizen
I confused this with the book "Probability and Statistics for Engineers and
Scientists" by Anthony Hayter and I got excited.

I am kind of a beginner in Machine Learning and was struggling badly with
basic probability and Statistics concepts. I went through so many resources
and somehow none of them clicked. Then I stumbled upon this book and I
realized this is exactly the kind of book I needed. It assumes no prior
knowledge and is very heavy on examples. Other books just dive into
jargon/symbol laded theory without giving simple examples or building concepts
from ground up.

I mentioned this because I feel someone might benefit from this suggestion.

------
nl
Wow, this seems like a particularly hard way to learn probability.

One thing I noticed about myself as I did more and more work with probability
is that I started thinking in terms of distributions a lot more.

These days I find it very difficult to think without using them. In just about
everything I do now I tend to think about moving probability mass around.

------
graycat
Here's my non-standard, nutshell, IMHO advice in using probability theory:

(1) Random Variables. Go outside. Observe a number. Then that is the value of
a _random variable_. To have a random variable, that the number be _random_ in
the sense of unpredictable is not needed. For the phrase and/or criterion
"truly random", mostly f'get about it, but we return to that for the subject
of random number generation below. So, net, your data, all your data, are the
values of random variables.

(2) Distributions. Sure, each random variable has a distribution. And there is
the Gaussian, uniform, binomial, exponential, Poisson, etc. distributions.

Sometimes in practice can use some assumptions to conclude that a random
variable has such a known distribution; this is commonly the case for
exercises about flipping coins, rolling dice, shuffling cards.

For another example, suppose customers are arriving at your Web site. Well
maybe the number of arrivals since noon have stationary (over time)
independent increments -- maybe you can confirm this just intuitively. Then,
presto, bingo, the arrivals are a Poisson process, and the times between
arrivals are independent, identically distributed exponential random variables
-- see E. Cinlar, _Introduction to Stochastic Processes_. Further, since might
be willing to assume that the arrivals are from many users acting
independently, the _renewal_ theorem says that the arrivals will be
approximately Poisson, more accurately for more users -- see W. Feller's
second volume.

Sometimes the central limit theorem can be used to justify a Gaussian
assumption.

Still, net, in practice, mostly we don't and can't know the distribution. To
have much detail on a distribution of one variable takes a lot of data; the
joint distribution on several variables takes much more data; the amount of
data needed explodes exponentially with the number of joint variables. So,
net, don't expect to know or find the distribution.

Often you will be able to estimate mean and variance, etc. but not the whole
distribution. So, usually need to proceed without knowing distributions. In
simple terms: Distributions -- they exist? Yup. We can find them? Nope!

(3) Independence. Probability theory is, sure, part of math, but, really, the
hugely important, unique feature is the concept of independence.

One of the main techniques in applied math is divide and conquer. Well, where
you can make an independence assumption lets you so divide.

Independence? A simple criterion for practice is, suppose you are given random
variables X and Y. You are even given their probability distributions (but NOT
their _joint_ probability distribution). Then X and Y are independent if and
only if knowing the value of one of them tells you nothing more than you
already know about the value of the other one.

The hope here is that often in practice you can check this criterion just
intuitively from what you know about the real situation. E.g., does a
butterfly flapping its wings in Tokyo tell you more about weather tomorrow in
NYC? My intuitive guess is that this is a case of independence which means
that for predicting weather of NYC tomorrow, we can just f'get about that
butterfly.

(4) Conditioning. For random variables X and Y, can have the conditional
expectation of Y given X, E[Y|X]. Such conditioning is the main way X tells
you about Y. Then there is a function f(X) = E[Y|X], and f(X) is the best non-
linear least squares estimate of Y. Note that E[E[Y|X]] = E[Y] which means
that E[Y|X] is an _unbiased_ estimate of Y.

(5) Correlation. If you don't have independence, then likely use the Pearson
correlation -- it works like the cosine of an angle. If random variables X and
Y are independent, then their Pearson correlation coefficient is 0 -- proof is
an easy exercise just from the basic definition and properties of
independence.

(6) The Classic Limit Theorems. Pay close attention to the central limit
theorem (CLT) and the weak and strong laws of large numbers (LLN). The CLT is
the main reason we get a Gaussian distribution, and the LLN is the main reason
we take averages.

(7) Random Number Generation. A sequence of random numbers are to look, for
some practical purposes, like a sequence of random variables that are all
independent and have uniform distribution on [0,1]. Are they "truly random"?
Maybe not. But if they are, then they are independent and identically
distributed (i.i.d.) on [0,1] -- and that's all there is to it, and don't have
to struggle to say or understand more.

~~~
pacala
Probability theory expositions, especially for [software] engineers, would be
better served if they were well typed. What is the type of a random variable,
E[Y|X], E[E[Y|X]]? Hint, a random variable is not a scalar, but rather a
function, the probability distribution.

~~~
mturmon
Hmm, a random variable (in the sense of measure theory, as in OP) is indeed a
function - but it's not a probability distribution.

An R.V. is a measurable function from the sample space into the reals. A
probability distribution is a function assigning probabilities to measurable
sets, formally, a function from the sigma-algebra into [0,1].

So in particular, a R.V. (like a gaussian) can take on negative values. A
probability distribution cannot.

Also, the domain of the R.V. is the sample space. But the domain of the
probability distribution is the sigma-algebra over that sample space.

~~~
graycat
A distribution is a real valued function of a real variable. The domain of the
function is the whole real line.

Note: Below, borrowing from D. Knuth's TeX, we use the underscore character
'_' to denote the start of a subscript.

Details: For real valued random variable X, probability measure P, and the set
of real numbers R, the _cumulative distribution_ of X is the function F_X: R
--> R where, for x in R, F_X(x) = P(X <= x).

If F_X is differentiable, then the _probability density distribution_ of X is
the real valued function of a real variable f_X: R --> R where, for all x in
R, f_X(x) = d/dx F_X(x) where d/dx is the calculus derivative.

For the connections with sigma algebras, that is more advanced than most
engineers care about, but here are some of the details:

For real numbers a and b with a < b, there is the _open interval_

(a,b) = {x|a < x < b}

A topology on R is a collection of subsets regarded as _open_ and that satisfy
the axioms for a topology -- the sets in a _topology_ are closed under finite
intersections and arbitrary unions and both R and the empty set are open. The
_usual topology_ on R is the smallest (a short argument shows that this
"smallest" is well defined) topology that has each open interval an element of
the topology -- right, the topology regards the open intervals as open.

The usual reason to discuss a topology is to have a means of defining
continuous functions, a means more general than from the usual "for each
epsilon greater than zero, there exists a delta greater than zero such that
..." or in terms of limits of sequences. Indeed, there are advanced situations
where we can use topologies to define continuous functions where epsilon and
delta and where converging sequences don't work. If curious, look up Moore-
Smith convergence, nets, and filters or just Kelley, _General Topology_.

Well, a _sigma algebra_ is like a topology, that is, is a collection of
subsets: A sigma algebra is closed under countable unions and relative
complements. Right, we avoid uncountable unions because otherwise we will get
stuck in a big mud hole. It is an early exercise that there are no countably
infinite sigma algebras.

The reason for sigma algebras is to permit defining a _measurable_ function,
that is, one where we can apply the Lebesgue integration theory. The integral
of calculus is due to B. Riemann and is the _Riemann_ integral. W. Rudin,
_Principles of Mathematical Analysis_ shows that for a continuous real valued
function with domain a _compact set_ (closed and bounded, where _closed_ is
the complement of an open set) has a Riemann integral. Well in this case, the
Lebesgue integral gives the same numerical answer -- same thing. The advantage
of the Lebesgue approach is that the function can be even bizarre and its
domain can be much more general. Indeed, in probability theory, expectation is
just the Lebesgue integral. In simple terms, Riemann partitioned on the X
axis, and Lebesgue partitioned on the Y axis.

Well, given the usual topology on R, we can ask for the smallest sigma algebra
on R that has the topology as a subset. That sigma algebra is the _Borel_ sets
of R. Uh, Lebesgue was a student of E. Borel. In Rudin will find the Heine-
Borel theorem.

So, in probability theory, we have a _sample space_. Each point in the sample
space is a _trial_ , i.e., essentially a real world experimental trial (note:
really our attitude is that in all the universe we see only one such trial --
if this seems far out, then blame the Russians, e.g., A. Kolmogorov, E.
Dynkin, etc.!). Well, an _event_ is a subset of the sample space, that is, a
set of trials. So, flip a coin. Let H be the event, the set of all trials
where, that the coin comes up heads.

Well, to apply Lebesgue's theory of integration, we want the set of all events
to be a sigma algebra.

Then a _probability_ measure is a measure in the sense of Lebesgue's measure
theory, that is, a real value function, in the case of probability taking
values in [0,1], and with domain the sigma algebra of events. So, for the
event H, we can ask for the probability of H, that is, P(H), which is a number
in [0,1]. For a _fair coin_ tossed by an honest member of the FBI we have P(H)
= 1/2.

Then a real valued _random variable_ X is just a real valued function with
domain the sample space and also _measurable_ : This part about being
_measurable_ is that for each Borel set A, a subset of R, the set of all
trials w so that X(w) is in A is an event, that is, an element of the sigma
algebra on the set of trials. That is, the inverse image under X of the Borel
sets are events, elements of the sigma algebra on the sample space.

So, with X measurable in this way, we have a near perfect shot at defining the
expectation of X, E[X]. For this we have a little two step dance:

First we look at X^+ ('^' denotes a superscript) where X is >= 0. So, X^+ is
the _positive_ part of X. Similarly X^- is the _negative_ part of X. So, X =
X^+ + X^-. Uh, I'm working quickly from memory; maybe we want X^- to be -X
where X < 0 and 0 otherwise. So both X^+ and X^- are >= 0 and we have X = X*+
- X^-. Either way.

Well, we can use Lebesgue's theory to integrate X^+ and X^-. Biggie stuff: The
X need only be measurable, and that admits lots of really wildly bizarre
functions. We've got great generality, and that's good to have in various
limiting arguments. Uh, we like limiting arguments because that is our main
way to approximate which our main way to being healthy, wealthy, and wise!

So the Lebesgue integral of X^+ we write as E[X^+]. Similarly for X^-. Now no
way do we want to be subtracting one infinity from another since permitting
that would trash the usual laws of arithmetic.

So, for our second step, in the case X^- >= 0, if at least one of E[X^+] and
E[X^-] is finite, then we define E[X] = E[X^+] - E[X^-].

Now we've defined expectation ("average") of a real random variable X. Our
definition is just the Lebesgue integral. For the Lebesgue integral, we wanted
the sigma algebras.

On the real line we can consider the sigma algebra of Lebesgue measurable
sets; that's larger than the Borel sets. Then we just ask, assume, assert,
believe, ..., that our random variables are measurable with respect to the
sigma algebra of Lebesgue sets and the sigma algebra of the events. Uh, right,
Lebesgue measure on R assigns Lebesgue measure b - a to interval (a,b) and
extends from there. Fine details are in various texts by Rudin, Royden, etc.

That's the beginnings of the role of sigma algebras in advanced approaches to
probability, statistics, and stochastic processes. It turns out, the sigma
algebra approach is for several parts of what we want in probability, much
nicer, e.g., for defining independence and conditional expectation. E.g., if
we want to know that some set of uncountably infinitely many random variables
are independent, we can. Same for conditioning on uncountably infinitely many
random variables, e.g., the past history of a stochastic process.

------
ginnungagap
I don't see the point of introducing sigma-algebras if you're not doing
probability based on measure theory.

As others have said I wouldn't suggest this exposition to someone learning
probability for the first time, but it's not as bad if you're familiar with
the material and need a quick review.

------
nicbou
> The set of all sets in a space, X, is called the power set, P(X). The power
> set is massive and, even if the space X is well-behaved, the corresponding
> power set can often contain some less mathematically savory elements.
> Consequently when dealing with sets we often want to consider a restriction
> of the power set that removes unwanted sets.

I wish people could teach math in plain English. I don't know why the math and
physics world refuses to write for the reader. I took this class before, and I
still don't know what the author means by "less mathematically savory
elements".

Here's you explain things to humans:

> There is a set called the power set that contains all the sets in a space.
> This set is huge, and it contains [less mathematically savory elements].
> This is why we usually use a restricted version that removes the unwanted
> sets.

Seriously, there's no point to this sort of fancy language. Math is already
hard. No need to make it harder.

~~~
aje403
"There is a set called the power set that contains all the sets in a space"

I don't think he's the best expositor and some of his terminology is crappy,
but I understood the author from what I've read so far. I literally have no
idea what you're trying to say; this has no meaning

~~~
nicbou
There is a set. It's called the power set. It contains all the sets in a
space.

~~~
aje403
The terms space and set are not synonymous. A space is defined on a set.

The collection of all subsets of a set X is called the power set and is
typically denoted as P(X).

Replacing something confusing but loosely understandable with something even
more confusing isn't an improving anything

------
tree_of_item
I thought this was really nice, strange to see that so many people dislike it.

------
ak_yo
I find that this guide unhelpfully conflates probability and inference in a
few places. Probability theory on its own is interesting but not terribly
useful without the infrastructure of estimation.

------
mlevental
i think these are mistakes

>2.5 Conditional Probability Distributions As we saw in Section 3.4,

>It turns out that in this case a σ-algebra on Z naturally defines a σ-algebra

------
madengr
NO NO NO!!! Don’t start with Venn diagrams, sets, and other such fluff.
Reminds me of the thin, little book they tried sticking on us in my
probability class; undergrad EE. It was meant for math majors.

There is a book “Probability and Statistics for Engineers and Scienctists” by
Raymond Walpole. That book is excellent. Rolling dice and pulling colored
marbles from jars is how you teach probability.

~~~
noufalibrahim
I studied probability during my undergrad (and high school) using dice, coins
and other such things. It made sense to me but there was a dark area in my
understanding. It felt like a blind spot and I could never get into it. In the
final year of engineering, we had someone do a quick refresher on probability
as a prelude to a longer course on pattern recognition and he described the
whole thing using set theory (Venn diagrams, functions mapping from one space
to another etc.) and I felt that the blind spot was illuminated. So, I don't
know if starting from there would make sense but I do think it's useful,
atleast sometime in your studies, to look at the whole system through this
lens.

I've been working through
[http://www.greenteapress.com/thinkbayes/](http://www.greenteapress.com/thinkbayes/)
and am quite enjoying it. My only complaint is that he, as intended, teaches
using programs and a computer and I learn better by doing stuff by hand. He
also has a think stats book at
[http://www.greenteapress.com/thinkstats/](http://www.greenteapress.com/thinkstats/)
which people might find interesting.

~~~
graycat
There is a good connection between probability and Venn diagrams: Both are
about area. Probability is about area where the area of everything under
consideration is 1. So, there is a set of _trials_. It has area 1. Each subset
of the set of trials is an _event_ and has an area, its _probability_. Then we
can move on to random variables, distributions of random variables,
independence of events and random variables, the event that a random variable
has value <= some real number x, etc.

In pure math, since H. Lebesgue in about 1900, the usual good theory of area
is Lebesgue's _measure theory_. The ordinary ideas of area we learned in grade
school, plane geometry, and calculus are all special cases. But Lebesgue's
theory of area handles some bizarre, pathological, extreme cases. And we can
show that there can be no really _perfect_ theory of area -- e.g., there have
to be some bizarre subsets of the real line to which no nice theory of area
can assign a length. But, once we have the Lebesgue theory, the usual way to
show that there is a subset of the real line without an area uses the axiom of
choice.

Well, in 1933, A. Kolmogorov wrote a paper showing how Lebesgue's theory of
area would make a solid foundation for probability, and that approach is the
standard one for advanced work in probability, statistics, and stochastic
processes.

