I took calc BC back in high school and now I'm 31. I realize to a certain extent that many things I believe today are based on a rudimentary understanding of probability theory. Aren't all clinical trials, scientific studies, etc, validated by that 95% confidence interval? And I notice that whenever something happens and my human nature wants draw a conclusion of causality, that cold, rational part of my brain tells me to wait until I have more data.
Anyway I'd love to get into what I feel might be the most fundamental mathematical concept. Probability theory might be our modern day religion.
I especially liked the chapter on sufficient statistics -- the best I've seen, and in my experience not all professors of statistics know this material well. IIRC there is a paper of E. Dynkin that shows that sufficient statistics are not very stable -- I'm not sure yet that the OP covers this.
For your question, it's a good introduction and foundation for work in or that uses probability, statistics, and stochastic processes. Of course, in each case, especially the last two, there is more, e.g., stochastic optimal control. And maybe not all the more recent work in resampling, what Leo Breiman did (used in machine learning, etc.), stochastic differential equations, connections with potential theory, etc. are covered.
But, to answer your question, you sort of need to know what's going on, what the lay of the land is, and that's not so easy to see from the common discussions. I'll try here:
Random Variables: A key, core idea is that of a random variable. So, go out, observe something, get a number, go back. You now have the value of a random variable, call it X. Then X will have a cumulative distribution: For real number x, function F_X(x) = P(X <= x). Here F_X is supposed to be F with a subscript X. So, we use F for cumulative distribution and put the subscript X on it to indicate we're talking about the cumulative distribution of random variable X. The cumulative distribution is simple -- just look at the P(X <= x) part and see that as x increases, that thing grows, cumulatively. So, right, as x increases from -infinity to infinity, F_X grows from 0 to 1, the 1 of certainty.
In the usual cases, X will have an average or expectation, E[X], sometimes -infinity or infinity but usually a finite number. Not all random variables have an expectation -- some goofy, pathological cases don't, but usually don't encounter those in applications.
In nice cases, can take the calculus first derivative (slope) f_X(x) = d/dx F_X(x), and that is the probability density of random variable X. So, the Gaussian bell curve is such a density.
Random variables -- that's the data you work with in probability, statistics, and stochastic processes.
Foundations: For many decades, people had lots of heartburn over the mathematical foundations of probability theory. That was cleared up in 1933 by a paper by A. Kolmogorov, right "father of modern probability theory". Here Kolmogorov used the more fundamental mathematics of measure theory as the mathematical foundations of probability theory. For some decades now, nearly all the more serious work in probability, statistics, and stochastic processes (call those PSSP) has been done using the measure theory foundations. But, often don't need to see the foundations so don't need to confront measure theory.
Measure Theory: You remember calculus, especially the integration part where you find the area under a curve. You did this by partitioning the X axis and getting tall, thin rectangles that under and over approximated the curve. Then you let the width of the widest rectangle go small to zero and took the limit of the areas of the rectangles, the common limit of the over estimate and the under estimate, as the definition of the integral of the curve you started with. Fine. Has worked great quite broadly in pure and applied math, science, engineering, etc. Was invented by Newton but made more precise by B. Riemann and others.
By about 1900, E. Borel and others saw some rough edges with the Riemann integral and cleaned them up. The result is measure theory. Here really measure is just another name for simple old area (length, volume, etc.). The main guy involved was H. Lebesgue, a student of Borel in France. The clean up was important for some math theorem proving, e.g., about sines and cosines in Fourier theory, now used in analyzing signals. The integral in measure theory gives the same numerical values (answers) as the Riemann integral when both integrals exist. But Lebesgue's work has some nicer theorems about convergence. "Exist"? Right, it's easy enough to cook up pathological functions where the Riemann integral does not exist, usually because the areas of the outer and inner rectangles don't converge to the same number. Okay, the first, obvious example is the function that is 1 at each rational number and 0 otherwise. Right, it's pathological.
In simplest terms, what Lebesgue did was do the partitions on the Y axis instead of the X axis. So, right, Lebesgue's partitions resulted in, first, cut, horizontal rectangles instead of vertical ones. For a curve that is positive, Lebesgue's rectangles only underestimate -- he drops consideration of the over estimation. So, as you will notice, Lebesgue's rectangles get chopped up as the curve goes below some of the horizontal rectangles. For when a curve goes negative, Lebesgue treats that separately and, then, subtracts from the positive part. Still, not every curve has an integral -- if both the positive and negative parts have infinite area, then, nope, Lebesgue defines no answer. Still, Lebesgue's approach is better. Really if avoid some absurd uses of infinity, its darned tricky to come up with a function that does not have a Lebesgue integral. And what Lebesgue did also assigns area in a consistent way to more subsets of the real line; a subset of the real line that does not have a Lebesgue measure is really tricky, e.g., the usual examples need the axiom of choice. Net, Lebesgue's stuff is powerful, nicely better than what Riemann did. But for where the Riemann integral is working well, no reason to bother changing to Lebesgue.
PSSP: Well, since Lebesgue is partitioning on only the Y axis, he doesn't partition on the X axis and the X axis, that is, the domain of the function, can be something really abstract, very general, where would have no way to partition the darned thing. Okay, presto, bingo, call the X axis part, the domain of the function, an abstract measure space, and use Lebesgue's integral to integrate the function.
For the abstract measure space, just need a definition of area (measure). On the real line, Lebesgue usually used just ordinary length as the measure (or the start of his definition of measure).
Okay, for PSSP and random variables: Have a set of trials. One of these trials is when you do an experiment and observe values of random variables. A set of these trials, an event, has an area, a measure, a probability. So, right, probability is just an area or Lebesgue's idea of area. Then apply Lebesgue's work and get the integral of a random variable, and that's its expectation, E[X], its average.
As you will see in the OP, the set of all events is assumed to be a sigma algebra -- that's so that you can also consider, as you want to, for events A and B, event A and B, event A or B, event not A, etc.
So, net, you went for a walk in the basement with Kolmogorov and Lebesgue to come up with a mathematical foundation for events, probabilities, random variables. and expectation E[X].
Now you know.
The rest of the OP is just ordinary PSSP, and a nice treatment. So, you can skim the measure theory foundations, or study them carefully if you wish, and then move on to the more ordinary parts.
Yes, some of the more advanced parts touch on the measure theory foundations. The measure theory stuff is good; we wouldn't want to be without it; sometimes even for applications it's good; but for day to day in PSSP mostly we don't see the measure theory part.
Secret Situation: For that trial, it turns out that we are assuming that in all the universe we see only one trial. Most of the elementary approaches to PSSP like to regard each observation as a trial -- if think a little too much, then that approach doesn't work very well.
For the measure theory foundations, for a positive integer n, a sample of size n, that is, what we usually average to estimate the expected value, is not some n trials but n random variables X_1, X_2, ..., X_n that, maybe, are independent and have the same distribution (independent and identically distributed, i.i.d.).
Uh, should insert here, the sigma algebra stuff gives us a super nice generalization of independence -- we can define what it means for infinitely many random variables to be independent. How? Briefly, using the inverse functions of the random variables, get some sigma algebras of events, and then work with the elementary definition of independent events, e.g.,
P(A and B) = P(A)P(B)
This generalization of independence gets to be crucial for stochastic processes.
The measure theory approach gets more serious when considering sufficient statistics, a neglected subject, and stochastic processes.
When you get a little farther into PSSP, you will find that there are two biggies -- independence and correlation (really a cosine and much the same as inner product, that is, in physics, dot product, and co-variance). If random variables X and Y are independent, then they have correlation (if it exists), inner product, and co-variance 0. Those concepts are biggies because typically what we do is observe some random variable X and try to use it to say something about some random variable Y we don't have (e.g., what Google's stock will be selling for tomorrow). If X and Y are independent, then X is never any help at all, ever. Otherwise we have a shot.
Generally we are trying to use a sequence of random variables to approximate what we want. So, we care about how a sequence can converge to what we want. The OP discusses the important cases of convergence -- the most important case is convergence in L^2 or co-variance or mean-square or least squares. Right, in that case often get a generalization of the Pythagorean theorem.
Statistics: The shortest description of statistics is that we observe some random variable X, manipulate it with some function u, and get result random variable U = u(X) which hopefully approximates something we want to know. If E[U] is exactly the right value of what we want to know, then the statistical estimator u(X) is unbiased. Can also consider minimum variance, maximum likelihood, etc. So, here consider the quality of our statistical estimation.
Hopefully this introduction will let you make use of the OP. The elementary stuff there will be fast and easy for you. Full understanding of all of that material would be about three semesters of a graduate course taken three times -- no joke. Take the course just once and can say that you have "seen it".
Nearly all the OP's important references are quite old -- this material has not changed much in decades. Then, other treatments, also decades old, that should be helpful include the famous texts by Neveu, Breiman, Chung, and Loeve. For the measure theory background, texts by Rudin and Royden are standard. So, you can learn a lot of analysis, functional analysis, Banach and Hilbert spaces, Fourier theory, etc. Biggie connection: The set of all L^2 random variables form a Hilbert space -- amazing, astounding, powerful, valuable, and true.
I started that course a few years ago, but never finished, but really liked it. Wondering if they cover the same, if I should do both, or one instead of the other.
Hopefully this means I can pass...
(Didn't look too carefully at the content, but it looks good too.)
Display of mathematical notation is handled by the open source MathJax project."
I've tried the interactive examples also in Safari Technology Preview and Firefox Developer Edition and work ok. :)