Why L2 regularization? Same reason. A closed form solution exists from linear algebra.
But at the end of the day, you are most interested in the expectation value of the coefficient and minimizing the squared error gives you E[coeffs] which is the mean of the coefficients.
Gauss quite openly admitted that the choice was borne out of convenience. The justification using Normal or Gaussian distribution came later and the Gauss Markov result on conditional distribution came even later.
Even at that time when Gauss proposed the loss, it was noted by many of Gauss' peers and (perhaps by Gauss himself) that other loss functions seem more appropriate if one goes by empirical performance, in particular the L1 distance.
Now that we have the compute power to deal with L1 it has come back with a vengeance and people have been researching its properties with renewed almost earnest. In fact there is a veritable revolution that's going on right now in the ML and stats world around it.
Just as optimizing the squared loss gives you conditional expectation, minimizing the L1 error gives you conditional median. The latter is to be preferred when the distribution has a fat tail, or is corrupted by outliers. This knowledge is no where close to being new. Gauss's peers knew this.
I am working in chemoinformatics, the main methods used by the academics to regress parameters have not changed in the past 40 years even so we went from small carefully assessed data sets (think 200 experimental points) to larger (10000, sometimes millions) with a lot of outliers from data entry errors, experimental errors, etc.
The end results is that when I see models of interest without the raw data, I reregress the parameters using my own datasets because most of the time you can barely trust them (even if coming from well known research centres).
That's quite interesting. Do you have a reference for that?
From my understanding, the popularity of the least squares method came (at least in part) from Gauss' successful prediction of the position of Ceres. Was this just because people not using least squares were not able to calculate it?
The other useful resource is "The Unicorn, The Normal Curve, And Other Improbable Creatures"
However it is useful to have a closed form solution because it guarantees you actually minimized it. Other strategies to minimize functions don't guarantee that but they're still extremely useful.
Exactly right. It has nothing to do with probability distributions.
As touched upon in the article, the objective not being differentiable is a big deal for modern machine learning methods.
Differentiability is important if you want to have an closed-form formula and derive it in front of undergraduates.
I'm not sure the absolute value is a big problem here. You still get a convex optimization problem. In neural networks a lot of people use ReLU or step activations functions, which are no more differentiable than the absolute value.
And aren't exact zeroes an error scenario for most machine learning models anyway?
Why do we assume gaussian errors? There is seldom a gaussian distribution in the real world usually because the probability for large error values doesn't not decay that fast. We use it because the math is easy and we can actually solve the problem assuming that.
I left out some detail I should have said, like what is so special about a gaussian that makes the math easy. So I will say it.
A measurement can infer a probability distribution for what the measured quantity is. A second measurement, on its own, also infers some probability distribution for what the measured quantity is. It we consider both measurements together, we get yet another probability distribution for what the measured quantity is. The magic is that if we had a gaussian distribution for the measurements, then the distribution for the combined measurements is also a gaussian. This is not true in general. As long as we have gaussian distributions we can do all the operations we want and the probability distributions are gaussian and can be fully described by a center point and a width. (Forgive me for the liberties I am taking here.) The basic alternative to exactly solving the problem is to actually try to carry around the probability distribution functions, which is not practical even with very powerful computers.
You're talking about fat tails?
Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.
I'm not sure I would say that - neural nets are "near everywhere differentiable", for example. Without differentiability we're stuck with, for example, discrete GAs for optimization, and you can throw all your intuition out the window (not to mention training/learning efficiency).
- There is plenty of existing technology for handling non-differentiable function. Functions like the absolute value, 2-norm, and so on have a generalization of the gradient (the subgradient) which can be used in lieu of the gradient.
- That functions are "almost everywhere differentiable" (i.e. the non-differentability lies in a manifold of zero measure) makes these functions behave pretty much like smooth ones. This is often not the case as optima often conspire to lie exactly on these nonsmooth manifolds.
I don't think there was any misconception.
We want a metric essentially because
if we converge or have a good
approximation in the metric
then we are close in some important
Squared error, then, gives one such
But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc.
From 50,000 feet up, the reason for using squared error is that get to have the Pythagorean theorem, and, more generally, get to work in a Hilbert space, a relatively nice place to be, e.g., we also get to work with angles from inner products, correlations, and covariances -- we get cosines and a version of the law of cosines. E.g., we get to do orthogonal
projections which give us minimum
With Hilbert space, commonly we
can write the total error
as a sum of contributions
components, that is, decompose
the error into contributions
from those components -- nice.
The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal
representation and decomposition,
best squared error approximation.
We also like Fourier theory with
of how it gives us the Heisenberg
Under meager assumptions, for
real valued random variables
X and Y, E[Y|X], a function of X,
is the best
squared error approximation
of Y by a function of X.
Squared error gives us variance, and
in statistics sample mean and variance are sufficient statistics for the Gaussian; that is, for statistics, for Gaussian data, can take the sample mean and sample variance, throw away the rest of the data, and do just as well.
For more, convergence in squared error can imply convergence almost surely at least for a subsequence.
Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice.
Many nice properties of the square loss (in fact un-fucking-believably nice properties) stem not from the fact that its square root is a metric but from the fact that it is a Bregman divergence. Another oft used 'divergence' in this class is KL divergence or cross-entropy.
Bregman introduced this class purely as a machinery to solve convex optimization problems. His motivation was to generalize the method of alternating projection to spaces other than a Hilbert space. But it so turned out that Bregman divergences are intimately connected with the exponential family class of distributions, also called the Pitman, Darmois, Koppman class of distribution. It takes some wracking of the brain to come up with a parametric family that does not belong in this class if one is caught unprepared, almost all parametric families used in stats (barring a few) belong to this class.
One may again ask why is this class so popular in probability and statistics, the answer is again convenience, they are almost as easy as Gaussians to work with, they have well behaved sufficient statistics, and their stochastic completion gives you the entire space 'regular' enough distributions with finite dimensional parameterizations.
You mentioned conditional expectation. So one may ask what are the loss functions that are minimized by conditional expectation. Bregman divergences are that entire class. Of course square loss satisfies it too (more importantly L2 metric on its own does not, it is the act of squaring it which does this).
Very interesting stuff (at least to me)
Yes, I was using "squared error" because the OP was. What I wrote was modulo a square root missing here and there!
What book would you recommend for this discussion?
But the biggie points about the Hilbert
space definition are (A) the other,
somewhat astounding (vector space),
examples and (B) how much can do in
Hilbert space that is close to good old
high school plane (2D) and solid (3D)
geometry and close to good, old freshman
calculus with limits, convergence, etc.
So, in particular, in Hilbert space we get
the Pythagorean theorem of plane geometry
and perpendicular projections as shortest
distance from a point to a plane in solid
geometry -- two biggies for pure/applied
math. We get the triangle inequality.
With the inner product, we get angles and
orthogonality. And the core data we need
for projections are just some inner
The most important examples of a Hilbert
space are, for positive integer n, the set
of real numbers R and the set of complex
numbers C, the vector spaces of linear
algebra R^n and C^n -- e.g., C^n is the
set of all n-tuples of real numbers. Then
as in whatever first course you had that
talked about vectors and dot (inner,
scalar) products, with those inner
products R^n and C^n are Hilbert spaces.
The part about complete is a
generalization of completeness in the real
numbers, that is, the biggie way the real
numbers are better than just the rational
numbers. In short, for example,
intuitively, in the real numbers, if a
sequence appears to converge, then there
really is a real number there for it to
converge to. Of course, that's not true
in the rational numbers since, e.g., can
have a sequence of rational numbers that
converges to the square root of 2 but
doesn't really converge in the rational
numbers because square of 2 is not a
rational number. This stuff about
"appears to converge" is called Cauchy
convergence and is a weak definition of
convergence. The point about completeness
is that Cauchy convergent sequences really
are convergent, that is, have something to
converge to and do converge to it
(essentially in the sense of limits you
saw in calculus or high school algebra).
If are taking limits to define or
approximate what really want, then also
really want completeness so that what
converge to exists. So, that's
completeness -- for a Hilbert space, we
insist on that.
Of course, need a background in linear
algebra. So, for a first book, get any of
the popular ones. If you wish,
concentrate on the more geometrical
parts and do less on the algebraic parts
-- e.g., if there is a chapter on group
representation theory, Galois theory,
linear algebra over finite fields, where
things go wrong when the field is the
rationals, or algebraic coding theory,
then feel free to leave that material for
later. Likely should pay attention to
dual spaces, but if wish can go light on
adjoint transformations since they are
less interesting when have an inner
product. Curiously, can go light on
change of basis for linear
transformations, that is, the difference
between vectors and coordinate vectors.
Concentrate on dimension, linear
independence, linear transformations,
maybe touch on quotient spaces, use Gauss
elimination as a good example of such
things, eigenvalues and eigenvectors,
orthogonality, and the polar decomposition
(the core of factor analysis, singular
value decomposition, matrix condition
number, and more). If there is a little
on the associated numerical analysis, then
go ahead -- e.g., learn to accumulate
inner products in double precision --
sure, take 10 minutes to see how to add
iterative improvement to Gauss elimination
and matrix inversion. For
pseudo-inverses, cute material, but,
especially for your question, likely won't
see it again and can skip it.
Then if have some time, take a fast pass
through the classic, Halmos, Finite
Dimensional Vector Spaces. It was
written in 1942 when Halmos had just
gotten his Ph.D. under J. Doob (e.g., as
in Stochastic Processes and, more
recently, Classical Potential Theory and
Its Probabilistic Counterpart) and showed
up at the Princeton Institute of Advanced
Study and asked to be an assistant to John
von Neumann, likely the inventor of
Hilbert space. Well, Halmos wrote his
book to be a finite dimensional version of
linear algebra as if it were Hilbert
space, which is commonly infinite
dimensional. So you get a gentle
introduction to Hilbert space. You get
good at eigenvalues and eigenvectors, the
polar decomposition, orthogonality,
transformations that preserve distances
and angles, etc. You get a lot of
BTW, at one time, Harvard's Math 55 used
Halmos, Baby Rudin (below), and a book by
Spivak as the three main references.
Get a start on probability and statistics.
A college junior level course should be
sufficient. Don't take the course too
seriously since will redo all the good
parts from a much better foundation soon!
Note: Elementary stat courses common get
all wound up about probability
distributions. Well, they do exist, are
at the core of probability theory, and are
very important, both in theory and
conceptually, but, actually, in practice
usually they require more data than you
will likely have, especially in dimensions
above 1. So, in practice, mostly can't
actually see the actual data of the
distributions you are working with! You
should hear about the uniform,
exponential, chi-squared, and Gaussian and
go light on the rest. The Gaussian is
profound and won't go away even in
practice although is less important in
practice than long assumed in, say,
Then take a good pass through at least the
first parts of Rudin, Principles of
Mathematical Analysis, AKA Baby Rudin.
For the exterior algebra in the back of
the more recent editions, well, likely get
that elsewhere, say, now in English,
directly from Cartan. The first parts of
Baby Rudin cover metric spaces well
enough. So, in Baby Rudin, get good at
working with the limits, completeness
property, compactness, etc. of
mathematical analysis, that is, not
algebra, geometry, topology, or
foundations, although the metric space
material is the same as part of the part
of topology called point set topology.
Then learn more in mathematical analysis,
in particular, measure theory. Measure
theory essentially replaces the Riemann
integral you learned in calculus and Baby
Rudin. 'Bout time! Net, measure theory
is a slightly different way to use limits
to define the integral (areas, volumes,
etc.) -- the first, biggie difference is
that do the partitioning on the Y axis
instead of on the X axis. The biggie
reason: The resulting integral easily
handles the pathological cases, especially
involving limits, that the Riemann
integral struggles with. Don't worry:
The integral of, say, x^2 over [0,1] is
still the same, IIRC, 1/3rd, right? But
consider the function f: [0,1] --> R
where f(x) = 1 if x is rational and 0
otherwise. Then the Riemann integral of f
over [0,1] does not exist, but the measure
theory integral does and gives 0 for the
There are at least two now classic books,
Royden, Real Analysis and Rudin, Real
and Complex Analysis. But there are
more, and likely more can be written. For
Rudin, can f'get about the last half on
functions of a complex variable.
Royden is easier to read. Rudin has the
math more succinctly presented. Some
people believe that Rudin is a bit too
severe for a first version; but if get
used to how Rudin writes, he's really
There in Rudin get good introductions to
Banach space (a complete, normed linear
space, that is, assume a little less than
for a Hilbert space) with a few really
surprising theorems, Hilbert space with an
isomorphic argument that they are really
all the same, a really nice chapter on
Fourier theory (Baby Rudin does Fourier
series; R&CA does the Fourier integral),
and some nice applications.
With that background in measure theory,
then take your good pass through
probability. So, probability becomes a
measure as in measure theory except the
values are always real and in [0.1]. A
random variable finally gets a solid,
mathematical definition -- it's just a
measurable function (measurable is very
general; in practice and even in nearly
all of theory, essentially every function
is measurable; in the usual contexts, it
takes some cleverness to think of a
function that is not measurable). And
expectation is just the measure theory
integral (with meager assumptions).
And in probability get some assumptions
don't see in measure theory --
independence and conditional independence,
and these two yield wonders nothing like
in just measure theory.
Go somewhere; look at something; get a
number; then that's the value of a random
variable. In practice, suppose there are
20 random variables, you have numerical
values for 19 of them, and you can argue
that those 19 are not independent of the
20th and that you have some data on how
they are dependent, then you have a shot
at estimating the 20th. Presto, bingo,
get a rich as James Simons, do machine
learning, get big houses, Cadillacs,
Ferraris, a yacht, a private jet -- maybe!
Also get to use the Radon-Nikodym theorem
(proved in both Royden and Rudin, and in
Rudin with von Neumann's cute proof) for
the grown up version of conditioning
(i.e., Bayesian) and, from there, in
stochastic processes, Markov processes and
martingales (astounding results).
Books include L. Breiman, Probability,
K. Chung, A Course in Probability
Theory, J. Neveu, Mathematical
Foundations of the Calculus of
Probability. And there are more. IIRC
Breiman and Neveu were both students of M.
Loeve at Berkeley -- sure, can also get
Loeve's Probability in two volumes. Of
these, Neveu is my favorite; it's elegant;
but for most readers it is too succinct.
Hilbert space again? It turns out, the
set of all real valued random variables X
such that E[X^2] is finite is a Hilbert
space. Yes, completeness holds; with some
thought, that result seems astounding,
like there is no way it could be true, but
Now just derive grown up versions of most
of the main results of elementary
statistics for yourself from what you have
learned. E.g., for the Neyman-Pearson
result on most powerful hypothesis
testing, just use the Hahn decomposition
from the Radon-Nikodym theorem.
And, with the Radon-Nikodym theorem, get a
grown up version of sufficient statistics,
right, based on a classic paper by Halmos
Along the way will notice that, the last
time I looked, Baby Rudin defined the
Riemann integral on closed intervals of
finite length, but right away probability
and statistics want to integrate on the
whole real line, the whole plane, etc., do
change of variable manipulations with such
integrals, etc. Well, for the
prerequisites, those are in measure theory
where the first version of its integral
applies also on the whole real line, the
whole plane, and much more. Measure
theory also give you the clean, powerful
versions of differentiation under the
integral sign and interchange or order of
Ah, why bother to teach the Riemann
integral at all? :-)
I have a utilitarian understanding of mathematics (vector spaces, SVD, orthogonality, invariances, etc.) and over time appreciating the underlying characteristics/relationships, which I recently got a taste of from T. Wickens, The Geometry of Multivariate Statistics.
I look forward to understanding measure theory and related math.
Would be a good value add to a discussion board or a question answer community.
Might fly better than likes of coursers. The key is the community forms first.
> A Hilbert space is a complete inner
> which is usually what you want
Unless you have outliers, in which case it's what you don't want. So you add e.g. a Huber loss function to reach a compromise.
But squared error is easier to compute. So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.
But if you are say fitting a function to the data, you can't tell beforehand which data-points are the outliers. So in that case perhaps you need an iterative approach of removing them (?)
Just look at the success of compressed sensing, based on taking the absolute value error seriously.
However: if you look at the shape of the squareroot of sum squares, it's a circle, so you can rotate it. If you take the absolute, it's a square, so that cannot be rotated; the cuberoot of cubes and fourthroot of fourths, etc. look like rounded edge squares, and that cannot be rotated either, so if you have a change of vector basis, you're out of luck.
With the gaussian forms of other powers, none of them have the central limit property.
If you plot on the plane the distance = 1 line, then L_1 gives you a diamond, L_2 a circle, L_inf a square. [More precisely, the unit circle under the related metric (distance function) looks like those euclidean shapes]
(Per the old math joke - you can make a line passing through any three points on a plane if you make it thick enough.)
Normally distributed variables because the central limit theorem.
It isn't all that complicated.
Oh, and let's not forget that for a lot of problems minimizing the KL-divergence is the exact same operation as maximizing the likelihood function.
it is also extremely poorly behaved numerically and in convergence
To give just a taste for the nice properties of KL, if you are using a layer 1 NN with the sigmoid function as the transform, using square loss gives you an explosion of local minima. OTOH using KL in its place would have given you none. Numerically accuracy is pretty much a non-issue, people have known how to handle KL numerically since the last 40 or so years.
BTW using KL on equivariant Gaussian gives you square loss, apparently the loss you prefer.
Square often corresponds to power/energy in systems AND energy (integral of power) is preserved. That relationship between physics and math allows a lot of useful transformations.
Different problems, different tools. You can't ask "why geometric mean?" without referring to a specific problem you're trying to solve.
When people ask "why machine learning?" the answers are "machine learning can do these things blablabla", not "you must specify the problem you're trying to solve".
It gives you a way to average together two things that have units that have nothing to do with each other and then compare two such averages and have the comparison make sense, as long as your units were consistent.
As a silly example, say you want to average 1kg and 1m and compare that average to the average 2kg and 0.5m. With arithmetic mean, ignoring the fact that it's nonsense to add different units, you could get numbers like (1+1)/2 = 1 and (2+0.5)/2 = 1.25 if you use kg and m, but numbers like (1 + 100)/2 = 50.5 and (2 + 50)/2 = 26 if you use kg and cm. Notice that which one is bigger depends on your choice of units. On the other hand, the geometric mean of the two examples is always the same as long as you use consistent units: 1 for both if you use kg and m, and 10 for both if you use kg and cm.
In practice, this sort of operation is only useful if you have multiple measures of some sort along different axes (think performance on 3 different performance tests) and you're being forced to produce a single average number. Again, a fairly silly thing to do, but _very_ common: just about every single performance benchmark does this.
I have stock position which changed in value by a factor of 1.10 2007, by a factor of 0.80 2008 by a factor of 1.15 2009. Is there any sort of representative number x for how much the value changed per year?
The final value of my stock position should come out the same when using x for every year, i.e. x * x * x = 1.10 * 0.80 * 1.15
BTW, "error" is a misleading term - it communicates some fault, at least in the common sense. Distance would be much better term.
So, "squared distance" makes much more sense, because negative distance is nonsense.
Consider values 1/2 and 1/4: in the original space it's double but in the squared space it becomes 1/4 and 1/16 so the difference is 4x. Also relevantly if you compare eg 0.9 and 1, the gap between them is amplified after squaring.
Just look at bog-standard linear regressions, say Y_i = m X_i + b + ε_i. It makes no sense to call the ε_i terms "distance".
In the sense I think you're using it, "statistics" are really methods for dimensionality reduction - we take means, and medians and standard deviations with the hopes that they will capture the parts of the data we care about. This is important for two reasons - for one, for anything even moderately high dimension we'll never have enough data to be able to forego some means of aggregation due to the "curse of dimensionality". Secondly, the human-machine interaction information bandwidth is annoyingly low, so we need some way to compress any information for human consumption. "Statistics" are one way we do so.
"Statistics" is also a field of study based around understanding how multiple data points relate to each other - that is of course critical to machine learning, and I think the terminology collision is why you're getting downvoted.
Statistics, as a field, already used general-purpose optimization algorithms before modern ML techniques came about, so in that sense, ML just fits into an existing position in the statistical toolbox (like replacing a chisel with a 3D printer). In the other direction, statistical techniques like cross-validation are necessary for you to get your ML correct.
I would like to think that statistics comes more from a pure math approach, loosely, while ML comes from an applied math approach, loosely. ML works spectacularly well on a class of problems. Why it does what it does is (in)conveniently brushed under the rug. How you treat that (in)convenience is left to you.
 - https://www.quantamagazine.org/20151203-big-datas-mathematic...
very curious to hear about your point of view. Statistics is not linear regression and ANOVA, or whatever catalogue of techniques in a freshman book, not even the library of techniques available in R.
Statistics is the application of probability (or more broadly, math) to data.
That said statisticians did miss the neural net wave because of their flippant reaction to it. They said, "oh well yet another non-parametric function approximator we have worked out the asymptotics 30 years ago".
To paraphrase someone wise: asymptotically we are all dead. Not enough heed was paid to that. Among their other lacks were expertise in algorithms and optimization. Mind you, optimization has been at the core of their craft from their very genesis, its just that they did not feel it important enough to ride the cutting edge of research on optimization. Note: you cannot do maximum likelihood with solving an optimization problem. Gauss was doing it several hundreds of years ago for statistics. If I go on a bit further with my rant, they got a bit carried away with their fetish over bias and asymptotic normality. They missed the wave, sure.
But all said and done, by any accepted definition of statistics, NN is very much also statistics.
And who says frequentist and Bayesian are the only two views. Where would you shelve prequential statistics then ? Or nonparametric regression
Prediction has definitely been a part of statistics but often, as you rightly claim, as a byproduct. And yes i would characterize the focus of stats and ml exactly as you did
Statistics works the other way. You basically always start with some kind of probabilistic model. And then, if you even bother with prediction, you work towards prediction from the probabilistic model. With stats you don't need to interpret or add probability after the fact, it's already there.
Obviously stats and ML are enormous fields, with quite some overlap. And people tend to go after low hanging fruit; if many people who studied neural nets have formal probability backgrounds it simply makes sense that someone will write a paper on it. And I'm generalizing here (same goes with frequentist & Bayesian comment). But there absolutely is justification for saying "neural nets are not really a statistical technique".
nns are in no way less fundamentally associated with probability than say linear regression. In both cases you can start with a probability model and derive the final form as a logical consequence, or you can start with the final form and slap a consistent probabilistic model on top.
The main thing is that any test you come up with that carves nns away from stats is going to carve a whole lot of other things that people have no trouble calling them stats. This controversy essentially stems from the need to claim a technique for ones own tribe and not concede to another. I am making no moral claim here just an observation and neither is it a novel one.
BTW I am firmly from the ml camp and not stats. I enjoy poking a little bit of fun at statisticians and try being gracious about their criticism of ml. That said i feel no need to make a groundless claim to a technique when they have no less rights to it in terms of objective claims. Rights steeped in culture, fiat and history are a different matter.
I would say applied statistics draws a line just prior to implementation concerns (say, real-world resource usage measured in time, space and energy) whereas these would be fully within scope and of interest in machine learning.
As an example, applied statistics could provide a useful approach to a vision/image recognition problem, and this approach might be provably unrealizable in practice using real-world execution units (e.g. CUDA cores). Nonetheless, it might still be a very worthwhile theoretical result in applied statistics, although of no immediate interest within ML except to hint at potential new area of research.
To to do anything beyond use tools other people have made (and never be sure whether results are meaningful or not) statistics are required
Of course, to make money from the ML boom you can probably get away with coincidence and correlation
Statistics: A mathematical science concerned with data collection, presentation, analysis, and interpretation.
Statistic: A quantity calculated from the data in a sample, which characterises an important aspect in the sample (such as mean or standard deviation).
If "statistics" is the term for taking a mathematical approach to understanding data, then "machine learning" is basically an applied subset of that. But you seem to be specifically using the "a statistic" definition to describe what you think the "study of statistics" is entirely concerned with.