That is the core reason to use squares - if the error is not normal, it just won't work. This is why, e.g., a small number of outliers can wildly skew a least squares fit - if error was normal, they simply wouldn't exist.
I wanted add a note on it being BLUE (the best linear unbiased estimator)
"The Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator."
The benefit of a linear estimator is it is a lot easier (both computationally and algorithmically) to fit/solve a linear model than a non-linear model. Thus, there are advantages to linear models. Further, despite it being called linear, the model can actually be quite complex. It only needs to be linear in the model coefficients, but the regressors/independent variables don't have to be linear. Thus, you can make one variable the square of another variable or add a variable that is the interaction of two other variables. By doing so, you can add non-linearity while still using linear methods to solve the equation.
Ok, so then once you see the value of a linear model, then the next question is why we would want the best and most unbiased. I think those are rather obvious.
It's not obvious to me why we want an unbiased estimator actually. A Maximum-Likelihood estimate, for example, seems just as reasonable. Do you have an explanation?
Not just that, but there's also the question of why we're looking for point estimates in the first place. It isn't the correct thing to do, but no one ever gives an explanation.
I.e. its great if you dropped your key under the lamppost, but the existence of the lamppost doesn't make it a good place to search.
It's not optimal for any other distribution. For a general error distribution g(x), you want to maximize sum(log(g(x-x0))) (or equivalently prod(g(x-x0))) - this only turns to least squares if log(g(x)) = C + D(x-x0)^2.
I.e., set up maximal likelihood, take logs, and you immediately get least squares for g(x) a gaussian. If g(x)=exp(-|x|), you get l1 minimization. Other distributions give you other things.
The general topic to investigate is robust regression: https://en.wikipedia.org/wiki/Robust_regression
I was telling my 8-year-old daughter about that the other day, and it occurred to me that the other "measure of central tendency" children get taught about, the mode, also fits into this scheme, with a tiny bit of fudging. Define the L0 error to be the sum of the 0th powers of the errors, with the (unusual) convention that 0^0=0. In other words, the L0 error is just the number of data points you don't get exactly right. Then the mode is the value that minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.
(Note 1. The obvious other Lp norm to consider is with p=infinity, in which case you're minimizing the maximum error. That gives you the average of the min and max values in your dataset. Not useful all that often.)
(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure of spread is the number of values not equal to the mode. Your implied probability distribution is an improper one where there's some nonzero probability of getting the modal value, and all other values are "equally unlikely". (I repeat: not actually possible; there is no uniform distribution on the whole real line.) Your regression technique is one a little like RANSAC, where you try to fit as many points exactly as you can, and all nonzero errors are equally bad. I doubt there's any analogue of PCA, but I haven't thought about it. Your regularized-regression technique is "best subset selection", where you simply penalize for the number of nonzero coefficients.)
The median minimising L1 is also very nice, and I did not know that. It also means that the concept of median generalises easily to higher dimensions.
Laplace has recently shown that the integral of minus t squared, where t goes from minus to plus infinity has a definite answer.
Gauss starts a paper, basically says, let's just run with the arithmetic average, the usual mean, as the go-to best estimate, from a big series of measurements of a single thing, each of which has errors, because, yeah, why not.
Moments later the guy shows that squaring the errors makes it equivalent to that thing Laplace did, and we now have a general rule. His argument follows by supposing each repeated error due to a specific cause has an expected error of zero, with no particular rule of distribution, but that as the number of specific causes goes to infinity, the Law of large numbers applies (go Euler, fair play). Hence, wiggley wiggley, Laplace's thing.
Beautiful piece of work, and resting on Laplace and Euler too, two other heroic oddballs.
The best thing is the underlying explanation: Take a nearly-infinite number of different measurements of a thing, each measurement with its own pattern of rubbishness, and the average of those measurement will approach having this distribution. That bit is what makes it so broadly applicable.
The author mostly describes nice mathematical properties, so I'd recommend considering absolute deviation in practical analysis in many cases as well. Then your error is dominated less by few outliers in your model. Specifically, least-squares is best when you expect your error to be gaussian distributed, which isn't always the case.
See for example:
You might say the example is contrived, but if the outliers were gone both estimates would give nearly the same result. The main time to prefer L_2 would be when your error is gaussian and your variance is huge, but that's going to be an uphill battle regardless.
Well, that's not quite accurate -- Taleb doesn't argue that we should do away with squared error, he says we should do away with the standard deviation in favour of expected deviation from the mean.
(Hrm, I'm going back and forth on whether there's a substantive difference between the concepts of variance/standard deviation as measures of a probability distribution and squared/RMS error measures... Obviously they coincide when the "prediction" is the mean of the data, but I don't think that's terribly convincing. I think it'd be perfectly reasonable to use squared error in your model fit and parametrise your Gaussian distributions by their average absolute deviation from the mean.)
Discussed on HN https://news.ycombinator.com/item?id=7064435
The three norms people think of
first are L1, L2, and L-infinity:
L1 is from absolute values.
L2 is from squares.
And L3 is from the absolute value
of the largest value (i.e.,
the worst case).
But in addition it would be nice
if the vector space with a norm
was also an inner product space
and the norm from the inner product.
Then, right, bingo, presto, for
the standard inner product the norm
we get is L2.
Why an inner product space? Well,
with a lot of generality and
meager assumptions, we have a
Hilbert space, that is,
a complete inner product space.
The core of the proof of completeness
is just the Minkowski inequality.
Being in a Hilbert space has a
lot of advantages: E.g., we
get orthogonality and
can take projections and, thus,
get as close as possible in our L2
like projections, e.g., the
Pythagorean theorem. E.g., in
regression in statistics,
we like that the
total sum of squares is the
sum of the regression sum of
squares and the error sum
right, the Pythagorean theorem.
We have some nice separation
results. We can use Fourier
theory. And there's more.
And there are some good convergence
results: If we converge in L2,
then we also converge in other
One reason for liking a Hilbert
space is that the L2 real valued
random variables form a Hilbert
space, and there convergence in L2
means almost sure convergence
(the best kind) of at least
a subsequence and, often in practice,
the whole sequence. So, we
connect nicely with measure theory.
We have some representation
results: A linear operator
on a Hilbert space is just
a point in the space
applied with the inner product.
We like linear operators and
like knowing that on a Hilbert
space they are so simple.
Working with L1 and L-infinity
is generally much less pleasant.
That is, we really like Hilbert
Net, we rush to a Hilbert space
and its L2 norm from its
inner product whenever we can.
You mean L-infinity.
Right! I wrote "L3" but
never defined an L3. So,
yes, I meant L-infinity.
Sorry 'bout that!
Not the first time I typed too
I did omit the other L^p
The coefficients you need in the
projections are just the
values of some inner products.
With random variables, those
coefficients are covariances,
that is, much the same as
correlations, that commonly
can estimate from data.
In the multivariate Gaussian
case, uncorrelated implies
Fourier theory is easier in
L2 than in L1. E.g., in
classic Fourier series, the
error in the approximation
is in L2 and is from the
L2 orthogonality of the
Yes, L-infinity can
also be nice: The
uniform limit of a sequence
of continuous functions
Or, with L2, often get a
Hilbert space but with
L1 or L-infinity usually
get at best just a Banach space --
that is, a complete, normed
vector space. Then, yes, can
get the Hahn-Banach theorem,
but the same thing in Hilbert
space is easier.
There is a sense in which L1 and
L-infinity are duals of
each other, but L2 is self-dual
which is nicer.
Filling in all these details and
more is part of functional analysis
101. There tough to miss at least
three books of W. Rudin:
Principles of Mathematical Analysis,
Real and Complex Analysis,
and Functional Analysis.
There's more, but I've
got some bugs to get out
of the software of my
I like the question -- asked
it myself at the NIST early in my career.
The answer I gave here is better
than what people told me then.
I've indicated likely most of the
main points, but my answer here is
rough and ready (I typed too fast),
and a quite polished answer is
also possible -- I just don't have
to dig out my grad school
course notes, scan through Rudin,
Dunford and Schwartz,
Kolmogorov and Fomin, much of
digital filtering, much of
multi-variate statistics, etc.
Time to dig out Rudin.
But, in practice, the usual situation,
e.g., signal processing, multi-variate
statistics, there's no good reason
not to use L2 and many biggie reasons
to use it. E.g., for a given box
of data, commonly the better tools in
L2 just let you do better.
to the customer: "If you will go for
a good L2 approximation, then we
are in good shape. If you insist
on L1 or L-infinity, then we will
need a lot more data and still
won't do as well.".
Again, a biggie example is just
classic Fourier series. Sure,
if you are really concerned about
the Gibbs phenomenon, then maybe
work on that. Otherwise, L2 is the
place to be.
E.g., L1 and L-infinity can commonly
take you into linear programming.
Generally you will be much happier
with the tools available to you
A really good explanation would
require much of a good ugrad
and Master's in math, with
concentration on analysis and
a wide range of applications.
I've been there, done that but
just don't have time to
write out even a good summary of
all that material here.
No doubt the full literature
is enormous -- I don't know all
But Rudin is a good author, and
as a writer got better, less
severe in style and, thus,
easier to read,
in his career.
Cost (in the real world) is generally a quadratic quantity.
If this isn't obvious, think of all the formulas you've seen for work (or energy), which is the ultimate notion of "cost" in the real world: W = mv^2/2, W = CV^2/2, W = ɛE^2/2, etc.
Furthermore, the nice thing about energy (a frequent notion of cost) is that it is independent of the basis you measure it in. Put another way, since energy is generally the squared length of a vector, its value won't change no matter how you look at the vector.
Obviously, we're not always trying to minimize work or energy, and our variables of interest aren't always the linear counterparts, but these are true surprisingly often, and so this is a nice motivation for defining cost to be quadratic.
(Where is the 'best' place to stand to minimize the average distance required to answer three phones place on a wall).
As the article notes, we can often use more sophisticated optimization techniques to make use of objective functions like absolute error which are less "nice" mathematically, but closer to our end goal.
There's a related idea in speech recognition that goes under the name "discriminative training." The idea is that first you train a model using a standard fast objective function like maximum likelihood to get a good "seed" model. Then, you can retrain using a more expensive objective function (maximum mutual information) which corresponds to what you care about--whether the speech recognizer got the whole sentence right. Since that function is discontinuous, MMI represents a smooth version of it with a "randomized" decoder.
Perhaps you are thinking of probabilistic regression problems, for which mean squared error is indeed not strictly proper. My favorite scoring rule for continuous variables is the continuous ranked probability score, which is actually a generalization of mean absolute error (for non-probabilistic regression).
On the contrary, squared errors are part of the formulas taken as axioms from which to derive the theorem.
I wouldn't be surprised if we could derive a theorem whose essence is tantamount to the essence of the famous Central Limit Theorem, using some other arbitrary formula. Like error to the fourth power.
For instance, actuaries, statisticians, and "neural network guys" (a combination of computer scientists, applied mathematicians & biologists before it really concretized into a discipline) all independently invented logistic regressions (and within the discipline, they usually got invented in a couple of different contexts before they were unified within the same framework). They are all "trying" to do different things in terms of how they were thinking about the problem, but they end up with the same model structure.