Fair point, though with no intention of being too literal, might this qualify as the difference between "Introduction to X" vs. "X"?
In case anyone else is interested: measure-theoretic probability theory "unifies the discrete and the continuous cases, and makes the difference a question of which measure is used. Furthermore, it covers distributions that are neither discrete nor continuous nor mixtures of the two." 
Or, to provide another perspective: The field of stochastics is bigger than just statistics. It includes probability theory as its foundation and then applies it to different domains, including but not limited to statistics.
Second, more of an ask HN than a comment; With a background in math up to ~1st semester calculus (i.e. differential and integral calculus) what are the killer resources for getting up to speed with Statistics & Probability?
For the record: I found the discussion and content discussed last week(1) to be a great refresher and the document linked above even denser as far as topics to explore go.
[[https://news.ycombinator.com/item?id=21382470](1) The Little Handbook of Statistical Practice (2012)]
(1) Linear algebra. Much of applied statistics is multi-variate, and nearly all of that is done with matrix notation and some of the main results of linear algebra.
The central linear algebra result you need is the polar decomposition that every square matrix A can be written as UH where U is unitary and H is Hermitian.
So, for any vector x, the lengths of Ux and x are the same. So, intuitively unitary is a rigid motion, essentially rotation and/or reflection.
Hermitian takes a sphere and turns it into an ellipsoid. The axes of the ellipsoid are mutually perpendicular (orthogonal) and are eigenvectors. The sphere stretches/shrinks along each eigenvector axis according to it's eigenvalue.
U and H can be complex valued (have complex numbers), but the real valued case is similar, A = QS where Q is orthogonal and S symmetric non-negative definite. The normal equations of regression have a symmetric non-negative definite matrix. Factor analysis and singular values are based on the polar decomposition.
You get more than you paid for: A relatively good treatment of the polar decomposition has long been regarded as a good, baby introduction to Hilbert space theory for quantum mechanics. The unitary matrices are important in group representations used in quantum mechanics for molecular spectroscopy.
Matrix theory is at first just some notation, how to write systems of linear equations. But the notation is so good and has so many nice properties (e.g., matrix multiplication is associative, a bit amazing) that with the basic properties get matrix theory. Then linear algebra is, first, just about systems of linear equations, and then moves on with matrix theory. E.g., could write out the polar decomposition without matrix theory, but it would be a mess.
(2) Need some basic probability theory. So, have events -- e.g., let H be the event that the coin came up heads. Can have the probability of H, written P(H), which is a number in the interval [0,1]. P(H) = 0 means that essentially never get heads; P(H) = 1 means that essentially always get heads.
Can have event W -- it's winter outside. Then can ask for P(H and W). That works like set intersection in Venn diagrams. Or can have P(H or W), and that works like set union in Venn diagrams.
Go measure a number. Call that the value of real valued random variable X. We don't say what random means; it does not necessarily mean unpredictable, truly random, unknowable, etc. The intuitive "truly random" is essentially just independence as below, and we certainly do not always assume independence. For real number x and cumulative distribution F_X, can ask for
F_X(x) = P(X <= x)
Often in practice F_X has a derivative, as in calculus, and in that case can ask for the probability density f_X of X
f_X(x) = d/dx F_X(x)
Popular densities include uniform, Gaussian, and exponential.
We can extend these definitions of (cumulative) distribution and density to the case of the joint distribution of several random variables, e.g., X, Y, and Z. We can visualize the joint distribution or density of X and Y on a 3D graph. The density can look like some wrinkled blanket or pizza crust with bubbles.
In statistics, we work with random variables, from our data and from the results of manipulating random variables. E.g., for random variables X and Y and real numbers a and b, we can ask for the random variable
Z = aX + bY
That is, we can manipulate random variables.
Events A and B are independent provided
P(A and B) = P(A)P(B)
In practice, we look for independence because it is powerful case of decomposition and with powerful, even shockingly surprising, consequences. A lot of statistics is from an assumption of independence.
Can also extend independence to the case of random variables X and Y being independent. For any index set I and random variable X_i for i in I, can extend to the case of the set of all X_i, i in I, being independent. The big deal here is that the index I can be uncountably infinite -- a bit amazing.
For the events heads H and winter W above, we just believe that they are independent just from common sense. To be blunt, in practice the main way we justify an independence assumption is just common sense -- sorry 'bout that.
Can define some really nice cases of an infinite sequence of random variables converging (more than one way) to another random variable. An important fraction of statistics looks for such sequences as approximations to what we really want.
Since statistics books commonly describe 1-2 dozen popular densities, a guess is that in practice we look for the (distribution or) density of our random variables. Sadly or not, mostly no: Usually in practice we won't have enough data to know the density.
When we do know the density, then it is usually from more information, e.g., the renewal theorem that justifies an assumption of a Poisson process from which we can argue results in an exponential density or the central limit theorem where we can justify a Gaussian density.
(3) A statistic is, give me the values of some random variables; I manipulate them and get a number, and that is a random variable and a statistic. Some such statistics are powerful.
(4) Statistical Hypothesis Testing. Here is one of the central topics in statistics, e.g., is the source of the much debated "p-values".
Suppose we believe that in an experiment on average the true value of the results is 0. We do the experiment and want to do a statistical hypothesis test that the true value is 0.
To do this test, we need to be able to calculate some probabilities. So, we make an assumption that gives us enough to make this calculation. This assumption is the null hypothesis, null as in no effect, that is, we still got 0 and not something else, say, from a thunderstorm outside.
So, we ASSUME the true value is 0 and with that assumption and our data calculate a statistic X. We find the (distribution or) density of X with our assumption of true value 0. Then we evaluate the value of X we did find.
Typically X won't be exactly 0. So we have two cases:
(I) The value of X is so close to zero that, from our density calculation, X is as close as 99% of the cases. So, we fail to reject, really in practice essentially accept, that the true value is 0. Accept is a bit fishy since we can accept easily enough, that is don't notice the pimple simply by using a fuzzy photograph, just by using a really weak test! Watch out for that in the news or "how to lie with statistics".
(II) X is so far from 0 that either (i) the true value is not zero and we reject that the true value is 0 or (ii) the true value really is 0 and we have observed something rare, say, 1% rare. Since the 1% is so small, maybe we reject (ii) and conclude (i).
So, for our methodology we can test and test and test and what we are left with is what didn't get rejected. Some people call this all science, others, fishy science, others just fishy. IMHO, done honestly or at least with full documentation, it's from okay to often quite good.
I mentioned weak tests; intuitively that's okay, but there is a concept of power of a test. Basically we want more power.
Here's an example of power:
Server farm anomaly detection is, or should be, essentially a statistical hypothesis test. So, then, the usual null hypothesis is that the farm is healthy. We do a test. There is a rate (over time, essentially probability) of false alarms (where we reject that the farm is healthy, conclude that it is sick, when it is healthy) and rate of missed detections
of actual problems (fail to reject the null hypothesis and conclude that the farm is healthy when it is sick).
Typically in hypothesis testing we get to adjust the false alarm rate. Then for a given false alarm rate, a more powerful test has a higher detection rate of real problems.
It's easy to get a low rate of false alarm; just turn off the detectors. It's easy to get 100% detection rate; just sound the alarm all the time.
So, a question is, what is the most powerful test? The answer is in the classic (late 1940s) Neyman-Pearson lemma. It's like investing in real estate: Allocate the first money to the highest ROI property; the next money to the next highest, etc. Yes, the discrete version is a knapsack problem and in NP-complete, but this is nearly never a consideration in practice. There have been claims that some high end US military target detection radar achieves the Neyman-Pearson most powerful detector.
There's more, but that is a start.
For much of engineering, e.g., quality control, the statistical hypothesis testing I outlined is common.
There are now classic texts on hypothesis testing:
E. L. Lehmann,
Testing Statistical Hypotheses,
John Wiley, New York,
E. L. Lehmann,
Statistical Methods Based on Ranks,
Nonparametric Statistics for
the Behavioral Sciences,
McGraw-Hill, New York,
The last is also just hypothesis testing; apparently the book is still referenced.
From some statistical consultants inside GE, there is
Gerald J. Hahn and
Samuel S. Shapiro,
Statistical Models in Engineering,
John Wiley & Sons,
For machine learning, so far the approach appears to be cases of empirical curve fitting.
The linear case of versions of regression analysis remain of interest. There the matrix theory I outlined is central. And regression analysis goes way back, 100 or so years, and is awash in uses of statistical hypothesis tests for the regression (F-ratio), individual coefficients, (t-tests), and confidence intervals on predicted values.
For using probabilistic and statistical methods in the non-linear fitting, that is more advanced and likely an active field of research.
In my own work, I use and/or create new statistical methods when and as I need them. My current work has some original work in, really, applied probability that also can be considered statistics. And when I was doing AI at IBM, I created some new mathematical statistics hypothesis tests to improve on what we were doing in AI. Later I published the work. Parts of the machine learning community might call that paper a case of supervised learning, but the paper is theorems and proofs in applied probability that give results on false alarm rates and detection rates, that is, statistical hypothesis tests (both multi-dimensional and distribution-free). So, that paper can be regarded as relevant to machine learning, but the prerequisites are a ugrad pure math major, about two years of graduate pure math with measure theory and functional analysis, a course in what Leo Breiman called graduate probability, and some work in stochastic processes. At one point I use a classic result of S. Ulam that the French probabilist Le Cam called tightness -- there is a nice presentation in
Convergence of Probability Measures,
John Wiley and Sons, New York, 1968.
The work was for anomaly detection in complex systems so could be regarded as for both engineering and machine learning. But I can't give all the background here.
For linear algebra, Gilbert Strang's course is a revelation.