
Why squared error? - luu
http://www.benkuhn.net/squared
======
yummyfajitas
I dont like this article. It mentions, but glosses over, the real reason
squared error is used. Assuming a Gaussian error term, squared error yields
the maximum likelihood estimator.

That is the core reason to use squares - if the error is not normal, it just
won't work. This is why, e.g., a small number of outliers can wildly skew a
least squares fit - if error was normal, they simply wouldn't exist.

~~~
ryan_j_naughton
Absolutely true.

I wanted add a note on it being BLUE (the best linear unbiased estimator)

"The Gauss–Markov theorem states that in a linear regression model in which
the errors have expectation zero and are uncorrelated and have equal
variances, the best linear unbiased estimator (BLUE) of the coefficients is
given by the ordinary least squares (OLS) estimator."[1]

[1]
[http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem](http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem)

~~~
wfunction
This is precisely what I _never_ found convincing: it never explains why we
should be looking for a linear unbiased estimator in the first place.

~~~
ryan_j_naughton
Very good point. We shouldn't necessarily.

The benefit of a linear estimator is it is a lot easier (both computationally
and algorithmically) to fit/solve a linear model than a non-linear model.
Thus, there are advantages to linear models. Further, despite it being called
linear, the model can actually be quite complex. It only needs to be linear in
the model coefficients, but the regressors/independent variables don't have to
be linear. Thus, you can make one variable the square of another variable or
add a variable that is the interaction of two other variables. By doing so,
you can add non-linearity while still using linear methods to solve the
equation.

Ok, so then once you see the value of a linear model, then the next question
is why we would want the best and most unbiased. I think those are rather
obvious.

~~~
wfunction
> The next question is why we would want the best and most unbiased. I think
> those are rather obvious.

It's not obvious to me why we want an unbiased estimator actually. A Maximum-
Likelihood estimate, for example, seems just as reasonable. Do you have an
explanation?

Not just that, but there's also the question of why we're looking for point
estimates in the first place. It isn't the correct thing to do, but no one
ever gives an explanation.

------
gjm11
The article mentions the fairly familiar fact that "median is to L1 as mean is
to L2" \-- i.e., the mean is the point from which the total squared error is
minimized, and the median is a point from which the total absolute error is
minimized.

I was telling my 8-year-old daughter about that the other day, and it occurred
to me that the other "measure of central tendency" children get taught about,
the mode, _also_ fits into this scheme, with a tiny bit of fudging. Define the
L0 error to be the sum of the 0th powers of the errors, with the (unusual)
convention that 0^0=0. In other words, the L0 error is just the _number of
data points you don 't get exactly right_. Then the mode is the value that
minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.

(Note 1. The obvious other Lp norm to consider is with p=infinity, in which
case you're minimizing the _maximum_ error. That gives you the average of the
min and max values in your dataset. Not useful all that often.)

(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure
of spread is the number of values not equal to the mode. Your implied
probability distribution is an improper one where there's some nonzero
probability of getting the modal value, and all other values are "equally
unlikely". (I repeat: not actually possible; there is no uniform distribution
on the whole real line.) Your regression technique is one a little like
RANSAC, where you try to fit as many points _exactly_ as you can, and all
nonzero errors are equally bad. I doubt there's any analogue of PCA, but I
haven't thought about it. Your regularized-regression technique is "best
subset selection", where you simply penalize for the number of nonzero
coefficients.)

~~~
TheLoneWolfling
Define it as the limit as n goes to 0 from above of the sum of abs(act -
expected)^(n), i.e. of Ln. No need to muck about with zero to the zeroth power
directly.

------
hessenwolf
Laplace and Gauss were in mad competition to find a general rule for the
pattern of errors for measurements in astronomy. It's around 1800, and the
medium of exchange is letters.

Laplace has recently shown that the integral of minus t squared, where t goes
from minus to plus infinity has a definite answer.

Gauss starts a paper, basically says, let's just run with the arithmetic
average, the usual mean, as the go-to best estimate, from a big series of
measurements of a single thing, each of which has errors, because, yeah, why
not.

Moments later the guy shows that squaring the errors makes it equivalent to
that thing Laplace did, and we now have a general rule. His argument follows
by supposing each repeated error due to a specific cause has an expected error
of zero, with no particular rule of distribution, but that as the number of
specific causes goes to infinity, the Law of large numbers applies (go Euler,
fair play). Hence, wiggley wiggley, Laplace's thing.

Beautiful piece of work, and resting on Laplace and Euler too, two other
heroic oddballs.

The best thing is the underlying explanation: Take a nearly-infinite number of
different measurements of a thing, each measurement with its own pattern of
rubbishness, and the average of those measurement will approach having this
distribution. That bit is what makes it so broadly applicable.

------
TTPrograms
This is a terrific explanation for the uninitiated - covers pretty much all
the reasons I would think of.

The author mostly describes nice mathematical properties, so I'd recommend
considering absolute deviation in practical analysis in many cases as well.
Then your error is dominated less by few outliers in your model. Specifically,
least-squares is best when you expect your error to be gaussian distributed,
which isn't always the case.

See for example: [http://matlabdatamining.blogspot.com/2007/10/l-1-linear-
regr...](http://matlabdatamining.blogspot.com/2007/10/l-1-linear-
regression.html)

You might say the example is contrived, but if the outliers were gone both
estimates would give nearly the same result. The main time to prefer L_2 would
be when your error is gaussian and your variance is huge, but that's going to
be an uphill battle regardless.

~~~
benkuhn
Thanks for the kind words! I actually agree that L1 error is often more
practical, and didn't mean the post to come off as prescriptive. (In fact,
part of my motivation for writing it was that L1 error seemed to have much
better practical properties, and I was curious why people used L2 _despite_
this!)

~~~
repsilat
It's a good post. If you're interested, Nassim Taleb ( _shudders_ ) wrote a
counterpoint for Edge.org that seems to have disappeared. It exists on the
Internet Archive, though:
[https://web.archive.org/web/20140606195420/http://edge.org/r...](https://web.archive.org/web/20140606195420/http://edge.org/response-
detail/25401)

Well, that's not quite accurate -- Taleb doesn't argue that we should do away
with squared error, he says we should do away with the standard deviation in
favour of expected deviation from the mean.

(Hrm, I'm going back and forth on whether there's a substantive difference
between the concepts of variance/standard deviation as measures of a
probability distribution and squared/RMS error measures... Obviously they
coincide when the "prediction" is the mean of the data, but I don't think
that's terribly convincing. I think it'd be perfectly reasonable to use
squared error in your model fit and parametrise your Gaussian distributions by
their average absolute deviation from the mean.)

Discussed on HN
[https://news.ycombinator.com/item?id=7064435](https://news.ycombinator.com/item?id=7064435)

~~~
stdbrouw
When talking about deviation, you're describing the spread of a distribution.
It's perfectly fine to describe your data using mean (or median) average
deviation, but then fit a model on top of that minimizes the sum of squared
errors. Taleb's point is that squared deviations make for a lousy descriptive
statistic as it leads you to underestimate the average distance of
observations from the mean.

------
graycat
With meager assumptions and a standard set up, we are in a vector space and
want a meaning of _distance_ , that is, a vector space _norm_.

The three norms people think of first are L1, L2, and L-infinity: L1 is from
absolute values. L2 is from squares. And L3 is from the absolute value of the
largest value (i.e., the _worst case_ ).

But in addition it would be nice if the vector space with a norm was also an
inner product space and the norm from the inner product. Then, right, bingo,
presto, for the standard inner product the norm we get is L2.

Why an inner product space? Well, with a lot of generality and meager
assumptions, we have a Hilbert space, that is, a complete inner product space.
The core of the proof of completeness is just the Minkowski inequality.

Being in a Hilbert space has a lot of advantages: E.g., we get orthogonality
and can take projections and, thus, get as close as possible in our L2 norm.
We like projections, e.g., the Pythagorean theorem. E.g., in regression in
statistics, we like that the

total sum of squares is the sum of the regression sum of squares and the error
sum of squares

right, the Pythagorean theorem.

We have some nice separation results. We can use Fourier theory. And there's
more.

And there are some good convergence results: If we converge in L2, then we
also converge in other good ways.

One reason for liking a Hilbert space is that the L2 real valued random
variables form a Hilbert space, and there convergence in L2 means almost sure
convergence (the best kind) of at least a subsequence and, often in practice,
the whole sequence. So, we connect nicely with measure theory.

We have some representation results: A linear operator on a Hilbert space is
just a point in the space applied with the inner product. We like linear
operators and like knowing that on a Hilbert space they are so simple.

Working with L1 and L-infinity is generally much less pleasant. That is, we
really like Hilbert space.

Net, we rush to a Hilbert space and its L2 norm from its inner product
whenever we can.

~~~
tome
> And L3 is from the absolute value of the largest value

You mean L-infinity.

~~~
graycat
Thanks!

Right! I wrote "L3" but never defined an _L3_. So, yes, I meant L-infinity.
Sorry 'bout that!

Not the first time I typed too fast!

I did omit the other L^p spaces.

------
wfunction
I've struggled with the motivation for this too, and over time I've come to
the conclusion that there is a much better 1-sentence explanation:

 _Cost (in the real world) is generally a quadratic quantity._

If this isn't obvious, think of all the formulas you've seen for work (or
energy), which is the ultimate notion of "cost" in the real world: W = mv^2/2,
W = CV^2/2, W = ɛE^2/2, etc.

Furthermore, the nice thing about energy (a frequent notion of cost) is that
it is _independent of the basis you measure it in_. Put another way, since
energy is generally the squared length of a vector, its value won't change no
matter how you look at the vector.

Obviously, we're not _always_ trying to minimize work or energy, and our
variables of interest aren't _always_ the linear counterparts, but these are
true surprisingly often, and so this is a nice motivation for defining cost to
be quadratic.

~~~
nwatson
Squaring the error makes sense to me ... having an approximation such that a
sample is 2 units off is 4 times worse than having a sample 1 unit off.
Building your approximation to minimize the square of the error makes more
intuitive sense to me.

~~~
sukilot
You paraprased the definition of squared error but didn't say why it makes
sense.

------
squeakynick
Vaguely related:
[http://datagenetics.com/blog/november12012/index.html](http://datagenetics.com/blog/november12012/index.html)

(Where is the 'best' place to stand to minimize the average distance required
to answer three phones place on a wall).

------
brandonb
This is a great article! It's always surprised me that, in machine learning
and statistics, we take an objective function like squared error as a "given"
rather than trying to design an objective function that's tailored to what we
really care about.

As the article notes, we can often use more sophisticated optimization
techniques to make use of objective functions like absolute error which are
less "nice" mathematically, but closer to our end goal.

There's a related idea in speech recognition that goes under the name
"discriminative training." The idea is that first you train a model using a
standard fast objective function like maximum likelihood to get a good "seed"
model. Then, you can retrain using a more expensive objective function
(maximum mutual information) which corresponds to what you care about--whether
the speech recognizer got the whole sentence right. Since that function is
discontinuous, MMI represents a smooth version of it with a "randomized"
decoder.

------
deng
The squared error is indeed a proper scoring rule, it is however important to
note that it is not strictly proper, meaning that if you have two different
probabilistic forecasts generating the same mean, they will perform the same.
Outside the linear context, this is often not what you want, and you should
look into alternatives like the logarithmic scoring rule.

~~~
shoyer
No, squared error (the Brier score) is absolutely a strictly proper scoring
rule for classification, which is the claim made in the article. See the
linked Wikipedia article:
[https://en.wikipedia.org/wiki/Scoring_rule#Brier.2Fquadratic...](https://en.wikipedia.org/wiki/Scoring_rule#Brier.2Fquadratic_scoring_rule)

Perhaps you are thinking of probabilistic regression problems, for which mean
squared error is indeed not strictly proper. My favorite scoring rule for
continuous variables is the continuous ranked probability score, which is
actually a generalization of mean absolute error (for non-probabilistic
regression).

~~~
deng
You are right, I was thinking of probabilistic problems.

------
cafebeen
A nice and interesting article, but describing the squared error as "merely
convenient" is a bit misleading, given that most noise is Gaussian, due to the
central limit theorem.

~~~
hnmcs
Nah. It's not like that theorem implies that we must square errors, based only
on axioms which are facts of the natural world.

On the contrary, squared errors are part of the formulas taken as axioms from
which to derive the theorem.

I wouldn't be surprised if we could derive a theorem whose essence is
tantamount to the essence of the famous Central Limit Theorem, using some
other arbitrary formula. Like error to the fourth power.

~~~
cafebeen
My point was that many things are normally distributed (likely due to the CLT)
and the square error is the right choice in those cases. This is an empirical
thing as well, in that measurement noise is often Gaussian.

------
fiatmoney
The really cool thing about a lot of machine learning algorithms is their
synchronicity. There are usually multiple, independent rationales for choosing
or developing a particular algorithm that end up saying very deep things about
information theory.

For instance, actuaries, statisticians, and "neural network guys" (a
combination of computer scientists, applied mathematicians & biologists before
it really concretized into a discipline) all independently invented logistic
regressions (and within the discipline, they usually got invented in a couple
of different contexts before they were unified within the same framework).
They are all "trying" to do different things in terms of how they were
thinking about the problem, but they end up with the same model structure.

------
dsjoerg
Thanks for the article; I wish I had read it before competing in a Kaggle
contest where the score was based on absolute deviation. I ended up finding
quantile regression and its relationship to the median eventually, but this
article connects all the ideas very nicely.

------
jmount
The author is right- square error isn't completely natural. However one great
property is that it gets expected values right (so using it helps you develop
unbiased estimators). I have some notes on this here: [http://www.win-
vector.com/blog/2014/01/use-standard-deviatio...](http://www.win-
vector.com/blog/2014/01/use-standard-deviation-not-mad-about-mad/)

------
wtbob
Man, that MathJAX (I assume) is ugly without JavaScript. '\\((P_{predicted} -
P_{actual})\\)' instead of 'P_predicted - P_actual,' which would look fine.
Heck, with <sub> one could even Do the Right Thing in HTML itself.

------
bandrami
So that it's always positive, my prof used to say. I kind of like the
simplicity of that.

------
enupten
I'm surprised the author didn't mention PCA and its invariance to orthogonal
transforms - and its complete breakdown to those that aren't.

