 Why squared error? 124 points by luu on May 16, 2015 | hide | past | favorite | 51 comments I dont like this article. It mentions, but glosses over, the real reason squared error is used. Assuming a Gaussian error term, squared error yields the maximum likelihood estimator.That is the core reason to use squares - if the error is not normal, it just won't work. This is why, e.g., a small number of outliers can wildly skew a least squares fit - if error was normal, they simply wouldn't exist. Absolutely true.I wanted add a note on it being BLUE (the best linear unbiased estimator)"The Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator." This is precisely what I never found convincing: it never explains why we should be looking for a linear unbiased estimator in the first place. Very good point. We shouldn't necessarily.The benefit of a linear estimator is it is a lot easier (both computationally and algorithmically) to fit/solve a linear model than a non-linear model. Thus, there are advantages to linear models. Further, despite it being called linear, the model can actually be quite complex. It only needs to be linear in the model coefficients, but the regressors/independent variables don't have to be linear. Thus, you can make one variable the square of another variable or add a variable that is the interaction of two other variables. By doing so, you can add non-linearity while still using linear methods to solve the equation.Ok, so then once you see the value of a linear model, then the next question is why we would want the best and most unbiased. I think those are rather obvious. > The next question is why we would want the best and most unbiased. I think those are rather obvious.It's not obvious to me why we want an unbiased estimator actually. A Maximum-Likelihood estimate, for example, seems just as reasonable. Do you have an explanation?Not just that, but there's also the question of why we're looking for point estimates in the first place. It isn't the correct thing to do, but no one ever gives an explanation. Yes, all mathematical niceties don't matter if MMSE estimation is not a good estimation in practice. This is the reason Gauss invented MMSE in the first place along with the eponymous distribution, while working with astronomical observations iirc. I'm not sure why you feel that is the "real" reason. It is a strong one, of course, but the fact that the squared error has a derivative - and a nice and simple one - is equally strong in my opinion. Having a derivative is computationally nice. But if your error term is not normal, that only means you can easily compute the wrong answer.I.e. its great if you dropped your key under the lamppost, but the existence of the lamppost doesn't make it a good place to search. Good point, that really is the key. However, itjust turns it into the almost equivalent problem of: why assume Gaussian error terms, and the rest of the article does a good job at motivating this. Gaussian is sometimes a good approximation if the error is a sum of independent thin-tailed variables (due to the c.l.t.). That is all. The article does not, in my opinion, motivate this very well. Nah. Least squares is necessarily optimal under a normal distribution, but that does not imply that it does not work if the distribution is not normal. It might even be optimal for some other distributions, I just cannot remember, and I have to watch the frying pan. Least squares is disastrously bad for anything with a fat tail (e.g., power law decay). The reason is that in these cases one or two outliers will dominate the sum of the tails.It's not optimal for any other distribution. For a general error distribution g(x), you want to maximize sum(log(g(x-x0))) (or equivalently prod(g(x-x0))) - this only turns to least squares if log(g(x)) = C + D(x-x0)^2. That sounds neat. Have you got a reference to it handy? I don't know of any references, most of this is just stuff that falls out pretty easily once you try and do the math, and that's probably faster than reading a book.I.e., set up maximal likelihood, take logs, and you immediately get least squares for g(x) a gaussian. If g(x)=exp(-|x|), you get l1 minimization. Other distributions give you other things.The general topic to investigate is robust regression: https://en.wikipedia.org/wiki/Robust_regression The article mentions the fairly familiar fact that "median is to L1 as mean is to L2" -- i.e., the mean is the point from which the total squared error is minimized, and the median is a point from which the total absolute error is minimized.I was telling my 8-year-old daughter about that the other day, and it occurred to me that the other "measure of central tendency" children get taught about, the mode, also fits into this scheme, with a tiny bit of fudging. Define the L0 error to be the sum of the 0th powers of the errors, with the (unusual) convention that 0^0=0. In other words, the L0 error is just the number of data points you don't get exactly right. Then the mode is the value that minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.(Note 1. The obvious other Lp norm to consider is with p=infinity, in which case you're minimizing the maximum error. That gives you the average of the min and max values in your dataset. Not useful all that often.)(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure of spread is the number of values not equal to the mode. Your implied probability distribution is an improper one where there's some nonzero probability of getting the modal value, and all other values are "equally unlikely". (I repeat: not actually possible; there is no uniform distribution on the whole real line.) Your regression technique is one a little like RANSAC, where you try to fit as many points exactly as you can, and all nonzero errors are equally bad. I doubt there's any analogue of PCA, but I haven't thought about it. Your regularized-regression technique is "best subset selection", where you simply penalize for the number of nonzero coefficients.) Define it as the limit as n goes to 0 from above of the sum of abs(act - expected)^(n), i.e. of Ln. No need to muck about with zero to the zeroth power directly. The L0/L1 norm can have surprising benefits under certain assumptions. If you know (or assume) your signal is sparse (or mostly zero) in some domain, you're just trying to minimize the nonzero terms with the L0 norm.... But that's hard so we use the L1 norm (which still guarantees exact recovery even under Nyquist). Look up the 2004 paper by Candes and Tao for more info (titled "Exact recovery ... undersampled ...") Wow, despite studying probability for years I never quite realised that the mean minimised the L2 norm. This is despite knowing full-well the essentially equivalent fact that the mean is the orthogonal projection in L2!The median minimising L1 is also very nice, and I did not know that. It also means that the concept of median generalises easily to higher dimensions. Laplace and Gauss were in mad competition to find a general rule for the pattern of errors for measurements in astronomy. It's around 1800, and the medium of exchange is letters.Laplace has recently shown that the integral of minus t squared, where t goes from minus to plus infinity has a definite answer.Gauss starts a paper, basically says, let's just run with the arithmetic average, the usual mean, as the go-to best estimate, from a big series of measurements of a single thing, each of which has errors, because, yeah, why not.Moments later the guy shows that squaring the errors makes it equivalent to that thing Laplace did, and we now have a general rule. His argument follows by supposing each repeated error due to a specific cause has an expected error of zero, with no particular rule of distribution, but that as the number of specific causes goes to infinity, the Law of large numbers applies (go Euler, fair play). Hence, wiggley wiggley, Laplace's thing.Beautiful piece of work, and resting on Laplace and Euler too, two other heroic oddballs.The best thing is the underlying explanation: Take a nearly-infinite number of different measurements of a thing, each measurement with its own pattern of rubbishness, and the average of those measurement will approach having this distribution. That bit is what makes it so broadly applicable. This is a terrific explanation for the uninitiated - covers pretty much all the reasons I would think of.The author mostly describes nice mathematical properties, so I'd recommend considering absolute deviation in practical analysis in many cases as well. Then your error is dominated less by few outliers in your model. Specifically, least-squares is best when you expect your error to be gaussian distributed, which isn't always the case.See for example: http://matlabdatamining.blogspot.com/2007/10/l-1-linear-regr...You might say the example is contrived, but if the outliers were gone both estimates would give nearly the same result. The main time to prefer L_2 would be when your error is gaussian and your variance is huge, but that's going to be an uphill battle regardless. Thanks for the kind words! I actually agree that L1 error is often more practical, and didn't mean the post to come off as prescriptive. (In fact, part of my motivation for writing it was that L1 error seemed to have much better practical properties, and I was curious why people used L2 despite this!) It's a good post. If you're interested, Nassim Taleb (shudders) wrote a counterpoint for Edge.org that seems to have disappeared. It exists on the Internet Archive, though: https://web.archive.org/web/20140606195420/http://edge.org/r...Well, that's not quite accurate -- Taleb doesn't argue that we should do away with squared error, he says we should do away with the standard deviation in favour of expected deviation from the mean.(Hrm, I'm going back and forth on whether there's a substantive difference between the concepts of variance/standard deviation as measures of a probability distribution and squared/RMS error measures... Obviously they coincide when the "prediction" is the mean of the data, but I don't think that's terribly convincing. I think it'd be perfectly reasonable to use squared error in your model fit and parametrise your Gaussian distributions by their average absolute deviation from the mean.)Discussed on HN https://news.ycombinator.com/item?id=7064435 When talking about deviation, you're describing the spread of a distribution. It's perfectly fine to describe your data using mean (or median) average deviation, but then fit a model on top of that minimizes the sum of squared errors. Taleb's point is that squared deviations make for a lousy descriptive statistic as it leads you to underestimate the average distance of observations from the mean. With meager assumptions and a standard set up, we are in a vector space and want a meaning of distance, that is, a vector space norm.The three norms people think of first are L1, L2, and L-infinity: L1 is from absolute values. L2 is from squares. And L3 is from the absolute value of the largest value (i.e., the worst case).But in addition it would be nice if the vector space with a norm was also an inner product space and the norm from the inner product. Then, right, bingo, presto, for the standard inner product the norm we get is L2.Why an inner product space? Well, with a lot of generality and meager assumptions, we have a Hilbert space, that is, a complete inner product space. The core of the proof of completeness is just the Minkowski inequality.Being in a Hilbert space has a lot of advantages: E.g., we get orthogonality and can take projections and, thus, get as close as possible in our L2 norm. We like projections, e.g., the Pythagorean theorem. E.g., in regression in statistics, we like that thetotal sum of squares is the sum of the regression sum of squares and the error sum of squaresright, the Pythagorean theorem.We have some nice separation results. We can use Fourier theory. And there's more.And there are some good convergence results: If we converge in L2, then we also converge in other good ways.One reason for liking a Hilbert space is that the L2 real valued random variables form a Hilbert space, and there convergence in L2 means almost sure convergence (the best kind) of at least a subsequence and, often in practice, the whole sequence. So, we connect nicely with measure theory.We have some representation results: A linear operator on a Hilbert space is just a point in the space applied with the inner product. We like linear operators and like knowing that on a Hilbert space they are so simple.Working with L1 and L-infinity is generally much less pleasant. That is, we really like Hilbert space.Net, we rush to a Hilbert space and its L2 norm from its inner product whenever we can. > And L3 is from the absolute value of the largest valueYou mean L-infinity. Thanks!Right! I wrote "L3" but never defined an L3. So, yes, I meant L-infinity. Sorry 'bout that!Not the first time I typed too fast!I did omit the other L^p spaces. Excellent points, though it's not completely clear why Euclidian geometry is the superior choice to me. You get to do projections as in the Pythagorean theorem.The coefficients you need in the projections are just the values of some inner products. With random variables, those coefficients are covariances, that is, much the same as correlations, that commonly can estimate from data.In the multivariate Gaussian case, uncorrelated implies independence.Fourier theory is easier in L2 than in L1. E.g., in classic Fourier series, the error in the approximation is in L2 and is from the L2 orthogonality of the harmonics.Yes, L-infinity can also be nice: The uniform limit of a sequence of continuous functions is continuous.Or, with L2, often get a Hilbert space but with L1 or L-infinity usually get at best just a Banach space -- that is, a complete, normed vector space. Then, yes, can get the Hahn-Banach theorem, but the same thing in Hilbert space is easier.There is a sense in which L1 and L-infinity are duals of each other, but L2 is self-dual which is nicer.Filling in all these details and more is part of functional analysis 101. There tough to miss at least three books of W. Rudin: Principles of Mathematical Analysis, Real and Complex Analysis, and Functional Analysis.There's more, but I've got some bugs to get out of the software of my Web pages!I like the question -- asked it myself at the NIST early in my career. The answer I gave here is better than what people told me then.I've indicated likely most of the main points, but my answer here is rough and ready (I typed too fast), and a quite polished answer is also possible -- I just don't have time today to dig out my grad school course notes, scan through Rudin, Dunford and Schwartz, Kolmogorov and Fomin, much of digital filtering, much of multi-variate statistics, etc. I intuit that you're getting at the real answer with the self-dual. That makes a lot of sense. Also, from a practical perspective L2 is very nice because it causes the problem of error reduction to be quadratic, so it scales well.Time to dig out Rudin. Thia sounds like lampposting -- closing a model because the math is nice, not because it is an accurate model. No: If what you want is the L-infinity norm, then go for it. A standard place for that is numerical approximations of special functions -- want guarantees on the worst case error. And there is some math to help achieve that. It's sometimes called Chebyshev approximation.But, in practice, the usual situation, e.g., signal processing, multi-variate statistics, there's no good reason not to use L2 and many biggie reasons to use it. E.g., for a given box of data, commonly the better tools in L2 just let you do better.Or to the customer: "If you will go for a good L2 approximation, then we are in good shape. If you insist on L1 or L-infinity, then we will need a lot more data and still won't do as well.".Again, a biggie example is just classic Fourier series. Sure, if you are really concerned about the Gibbs phenomenon, then maybe work on that. Otherwise, L2 is the place to be.E.g., L1 and L-infinity can commonly take you into linear programming.Generally you will be much happier with the tools available to you in L2. Again, just now I just don't have time for a more full, complete, and polished explanation.A really good explanation would require much of a good ugrad and Master's in math, with concentration on analysis and a wide range of applications. I've been there, done that but just don't have time to write out even a good summary of all that material here. putting a shout in for Bollobas here, surely all roads to functional analysis don't lead through Rudin? Right. Also, say, Kolmogorov and Fomin. And Dunford and Schwartz.No doubt the full literature is enormous -- I don't know all of it!But Rudin is a good author, and as a writer got better, less severe in style and, thus, easier to read, over time in his career. I've struggled with the motivation for this too, and over time I've come to the conclusion that there is a much better 1-sentence explanation:Cost (in the real world) is generally a quadratic quantity.If this isn't obvious, think of all the formulas you've seen for work (or energy), which is the ultimate notion of "cost" in the real world: W = mv^2/2, W = CV^2/2, W = ɛE^2/2, etc.Furthermore, the nice thing about energy (a frequent notion of cost) is that it is independent of the basis you measure it in. Put another way, since energy is generally the squared length of a vector, its value won't change no matter how you look at the vector.Obviously, we're not always trying to minimize work or energy, and our variables of interest aren't always the linear counterparts, but these are true surprisingly often, and so this is a nice motivation for defining cost to be quadratic. Squaring the error makes sense to me ... having an approximation such that a sample is 2 units off is 4 times worse than having a sample 1 unit off. Building your approximation to minimize the square of the error makes more intuitive sense to me. You paraprased the definition of squared error but didn't say why it makes sense. Vaguely related: http://datagenetics.com/blog/november12012/index.html(Where is the 'best' place to stand to minimize the average distance required to answer three phones place on a wall). This is a great article! It's always surprised me that, in machine learning and statistics, we take an objective function like squared error as a "given" rather than trying to design an objective function that's tailored to what we really care about.As the article notes, we can often use more sophisticated optimization techniques to make use of objective functions like absolute error which are less "nice" mathematically, but closer to our end goal.There's a related idea in speech recognition that goes under the name "discriminative training." The idea is that first you train a model using a standard fast objective function like maximum likelihood to get a good "seed" model. Then, you can retrain using a more expensive objective function (maximum mutual information) which corresponds to what you care about--whether the speech recognizer got the whole sentence right. Since that function is discontinuous, MMI represents a smooth version of it with a "randomized" decoder. The squared error is indeed a proper scoring rule, it is however important to note that it is not strictly proper, meaning that if you have two different probabilistic forecasts generating the same mean, they will perform the same. Outside the linear context, this is often not what you want, and you should look into alternatives like the logarithmic scoring rule. No, squared error (the Brier score) is absolutely a strictly proper scoring rule for classification, which is the claim made in the article. See the linked Wikipedia article: https://en.wikipedia.org/wiki/Scoring_rule#Brier.2Fquadratic...Perhaps you are thinking of probabilistic regression problems, for which mean squared error is indeed not strictly proper. My favorite scoring rule for continuous variables is the continuous ranked probability score, which is actually a generalization of mean absolute error (for non-probabilistic regression). You are right, I was thinking of probabilistic problems. A nice and interesting article, but describing the squared error as "merely convenient" is a bit misleading, given that most noise is Gaussian, due to the central limit theorem. Nah. It's not like that theorem implies that we must square errors, based only on axioms which are facts of the natural world.On the contrary, squared errors are part of the formulas taken as axioms from which to derive the theorem.I wouldn't be surprised if we could derive a theorem whose essence is tantamount to the essence of the famous Central Limit Theorem, using some other arbitrary formula. Like error to the fourth power. My point was that many things are normally distributed (likely due to the CLT) and the square error is the right choice in those cases. This is an empirical thing as well, in that measurement noise is often Gaussian. The really cool thing about a lot of machine learning algorithms is their synchronicity. There are usually multiple, independent rationales for choosing or developing a particular algorithm that end up saying very deep things about information theory.For instance, actuaries, statisticians, and "neural network guys" (a combination of computer scientists, applied mathematicians & biologists before it really concretized into a discipline) all independently invented logistic regressions (and within the discipline, they usually got invented in a couple of different contexts before they were unified within the same framework). They are all "trying" to do different things in terms of how they were thinking about the problem, but they end up with the same model structure. Thanks for the article; I wish I had read it before competing in a Kaggle contest where the score was based on absolute deviation. I ended up finding quantile regression and its relationship to the median eventually, but this article connects all the ideas very nicely. The author is right- square error isn't completely natural. However one great property is that it gets expected values right (so using it helps you develop unbiased estimators). I have some notes on this here: http://www.win-vector.com/blog/2014/01/use-standard-deviatio... Man, that MathJAX (I assume) is ugly without JavaScript. '$$(P_{predicted} - P_{actual})$$' instead of 'P_predicted - P_actual,' which would look fine. Heck, with one could even Do the Right Thing in HTML itself. So that it's always positive, my prof used to say. I kind of like the simplicity of that. I'm surprised the author didn't mention PCA and its invariance to orthogonal transforms - and its complete breakdown to those that aren't. Search: