
Why squared error? (2014) - rpbertp13
http://www.benkuhn.net/squared
======
eanzenberg
Why squared error? Because you can solve the equation to minimize squared
error using linear algebra in closed form.

Why L2 regularization? Same reason. A closed form solution exists from linear
algebra.

But at the end of the day, you are most interested in the expectation value of
the coefficient and minimizing the squared error gives you E[coeffs] which is
the mean of the coefficients.

~~~
bo1024
I don't think this is any more convincing than the article's reasons. There
are closed forms to lots of things that aren't interesting.

~~~
srean
I cannot speak for eanzenberg but I think his comment was less about his
personal justification and more about the rationalizations that have been used
in the history of stats.

Gauss quite openly admitted that the choice was borne out of convenience. The
justification using Normal or Gaussian distribution came later and the Gauss
Markov result on conditional distribution came even later.

Even at that time when Gauss proposed the loss, it was noted by many of Gauss'
peers and (perhaps by Gauss himself) that other loss functions seem more
appropriate if one goes by empirical performance, in particular the L1
distance.

Now that we have the compute power to deal with L1 it has come back with a
vengeance and people have been researching its properties with renewed almost
earnest. In fact there is a veritable revolution that's going on right now in
the ML and stats world around it.

Just as optimizing the squared loss gives you conditional expectation,
minimizing the L1 error gives you conditional median. The latter is to be
preferred when the distribution has a fat tail, or is corrupted by outliers.
This knowledge is no where close to being new. Gauss's peers knew this.

~~~
mtzet
> Gauss quite openly admitted that the choice was borne out of convenience.

That's quite interesting. Do you have a reference for that?

From my understanding, the popularity of the least squares method came (at
least in part) from Gauss' successful prediction of the position of Ceres. Was
this just because people not using least squares were not able to calculate
it?

~~~
riskneural
It's in the original paper in which he derives the normal distribution. Well
worth a read. I last had a copy of it in the fourth basement down in the
university library about fifteen years ago - it might be still there.

------
throw_away_777
There is a Kaggle competition right now that uses mean absolute error, and
this makes the problem substantially harder. For a practical discussion of
techniques used to solve machine learning problems that use mae see the forums
in: [https://www.kaggle.com/c/allstate-claims-
severity/forums](https://www.kaggle.com/c/allstate-claims-severity/forums)

As touched upon in the article, the objective not being differentiable is a
big deal for modern machine learning methods.

~~~
thanatropism
No it isn't.

Differentiability is important if you want to have an closed-form formula
_and_ derive it in front of undergraduates.

~~~
throw_away_777
This is the difference between practice and theory. In theory differential
objectives don't matter, in practice for medium to large datasets they make
machine learning a lot faster. Speed is critical, as you need to be able to
iterate quickly. The solution most commonly used on Kaggle is to transform the
target feature and then minimize mean squared error, but there is some
systematic uncertainty introduced by this.

------
gpsx
For minimizing the square of the errors I think the good reason is because,
assuming your data has gaussian probability distribution, minimizing the
square error corresponds to maximizing the likelihood of the measurement, as
you and others have said.

Why do we assume gaussian errors? There is seldom a gaussian distribution in
the real world usually because the probability for large error values doesn't
not decay that fast. We use it because the math is easy and we can actually
solve the problem assuming that.

~~~
klodolph
That's a summary of the article.

~~~
gpsx
Yes, sort of. But I think he says a lot of unnecessary things not getting at
the root of the issue.

I left out some detail I should have said, like what is so special about a
gaussian that makes the math easy. So I will say it.

A measurement can infer a probability distribution for what the measured
quantity is. A second measurement, on its own, also infers some probability
distribution for what the measured quantity is. It we consider both
measurements together, we get yet another probability distribution for what
the measured quantity is. The magic is that if we had a gaussian distribution
for the measurements, then the distribution for the combined measurements is
also a gaussian. This is not true in general. As long as we have gaussian
distributions we can do all the operations we want and the probability
distributions are gaussian and can be fully described by a center point and a
width. (Forgive me for the liberties I am taking here.) The basic alternative
to exactly solving the problem is to actually try to carry around the
probability distribution functions, which is not practical even with very
powerful computers.

------
tvural
The best explanation is probably that squared error gives you the best fit
when you assume your errors should normally distributed.

Things like the fact that squared error is differentiable are actually
irrelevant - if the best model is not differentiable, you should still use it.

~~~
highd
"if the best model is not differentiable, you should still use it."

I'm not sure I would say that - neural nets are "near everywhere
differentiable", for example. Without differentiability we're stuck with, for
example, discrete GAs for optimization, and you can throw all your intuition
out the window (not to mention training/learning efficiency).

~~~
gabrielgoh
A few misconceptions I should correct in this comment.

\- There is plenty of existing technology for handling non-differentiable
function. Functions like the absolute value, 2-norm, and so on have a
generalization of the gradient (the subgradient) which can be used in lieu of
the gradient.

\- That functions are "almost everywhere differentiable" (i.e. the non-
differentability lies in a manifold of zero measure) makes these functions
behave pretty much like smooth ones. This is often not the case as optima
often conspire to lie exactly on these nonsmooth manifolds.

~~~
kkylin
And error measures involving sum of absolute values (i.e., L1 norm) are
central to methods like lasso
([https://en.wikipedia.org/wiki/Lasso_(statistics)](https://en.wikipedia.org/wiki/Lasso_\(statistics\)))
and their cousins.

------
graycat
I asked that early in my career.

We want a metric essentially because if we converge or have a good
approximation in the metric then we are close in some important respects.

Squared error, then, gives one such metric.

But for some given data, usually there are several metrics we might use, e.g.,
absolute error (L^1), worst case error (L^infinity), L^p for positive integer
p, etc.

From 50,000 feet up, the reason for using squared error is that get to have
the Pythagorean theorem, and, more generally, get to work in a Hilbert space,
a relatively nice place to be, e.g., we also get to work with angles from
inner products, correlations, and covariances -- we get cosines and a version
of the law of cosines. E.g., we get to do orthogonal projections which give us
minimum squared error.

With Hilbert space, commonly we can write the total error as a sum of
contributions from orthogonal components, that is, decompose the error into
contributions from those components -- nice.

The Hilbert space we get from squared error gives us the nicest version of
Fourier theory, that is, orthogonal representation and decomposition, best
squared error approximation.

We also like Fourier theory with squared error because of how it gives us the
Heisenberg uncertainty principle.

Under meager assumptions, for real valued random variables X and Y, E[Y|X], a
function of X, is the best squared error approximation of Y by a function of
X.

Squared error gives us variance, and in statistics sample mean and variance
are _sufficient statistics_ for the Gaussian; that is, for statistics, for
Gaussian data, can take the sample mean and sample variance, throw away the
rest of the data, and do just as well.

For more, convergence in squared error can imply convergence almost surely at
least for a subsequence.

Then there is the Hilbert space result, every nonempty, closed, convex subset
has a unique element of minimum norm (from squared error) -- nice.

~~~
srean
Ah but square error is not a metric, its square root is a metric.

Many nice properties of the square loss (in fact un-fucking-believably nice
properties) stem not from the fact that its square root is a metric but from
the fact that it is a Bregman divergence. Another oft used 'divergence' in
this class is KL divergence or cross-entropy.

Bregman introduced this class purely as a machinery to solve convex
optimization problems. His motivation was to generalize the method of
alternating projection to spaces other than a Hilbert space. But it so turned
out that Bregman divergences are intimately connected with the exponential
family class of distributions, also called the Pitman, Darmois, Koppman class
of distribution. It takes some wracking of the brain to come up with a
parametric family that does not belong in this class if one is caught
unprepared, almost all parametric families used in stats (barring a few)
belong to this class.

One may again ask why is this class so popular in probability and statistics,
the answer is again convenience, they are almost as easy as Gaussians to work
with, they have well behaved sufficient statistics, and their stochastic
completion gives you the entire space 'regular' enough distributions with
finite dimensional parameterizations.

You mentioned conditional expectation. So one may ask what are the loss
functions that are minimized by conditional expectation. Bregman divergences
are that _entire_ class. Of course square loss satisfies it too (more
importantly L2 metric on its own does not, it is the act of squaring it which
does this).

Very interesting stuff (at least to me)

~~~
neutralid
This is very interesting. Thank you.

What book would you recommend for this discussion?

~~~
graycat
product space. Two examples are the real line and the 3 space we live in (at
least locally ignoring general relativity).

But the biggie points about the Hilbert space definition are (A) the other,
somewhat astounding (vector space), examples and (B) how much can do in
Hilbert space that is close to good old high school plane (2D) and solid (3D)
geometry and close to good, old freshman calculus with limits, convergence,
etc. So, in particular, in Hilbert space we get the Pythagorean theorem of
plane geometry and perpendicular projections as shortest distance from a point
to a plane in solid geometry -- two biggies for pure/applied math. We get the
triangle inequality. With the inner product, we get angles and orthogonality.
And the core data we need for projections are just some inner products.

The most important examples of a Hilbert space are, for positive integer n,
the set of real numbers R and the set of complex numbers C, the vector spaces
of linear algebra R^n and C^n -- e.g., C^n is the set of all n-tuples of real
numbers. Then as in whatever first course you had that talked about vectors
and dot (inner, scalar) products, with those inner products R^n and C^n are
Hilbert spaces.

The part about _complete_ is a generalization of completeness in the real
numbers, that is, the biggie way the real numbers are better than just the
rational numbers. In short, for example, intuitively, in the real numbers, if
a sequence appears to converge, then there really is a real number there for
it to converge to. Of course, that's not true in the rational numbers since,
e.g., can have a sequence of rational numbers that converges to the square
root of 2 but doesn't really converge in the rational numbers because square
of 2 is not a rational number. This stuff about "appears to converge" is
called _Cauchy convergence_ and is a weak definition of convergence. The point
about completeness is that Cauchy convergent sequences really are convergent,
that is, have something to converge to and do converge to it (essentially in
the sense of limits you saw in calculus or high school algebra). If are taking
limits to define or approximate what really want, then also really want
completeness so that what converge to exists. So, that's completeness -- for a
Hilbert space, we insist on that.

Of course, need a background in linear algebra. So, for a first book, get any
of the popular ones. If you wish, concentrate on the more _geometrical_ parts
and do less on the _algebraic_ parts \-- e.g., if there is a chapter on group
representation theory, Galois theory, linear algebra over finite fields, where
things go wrong when the field is the rationals, or algebraic coding theory,
then feel free to leave that material for later. Likely should pay attention
to dual spaces, but if wish can go light on adjoint transformations since they
are less interesting when have an inner product. Curiously, can go light on
change of basis for linear transformations, that is, the difference between
vectors and coordinate vectors. Concentrate on dimension, linear independence,
linear transformations, maybe touch on quotient spaces, use Gauss elimination
as a good example of such things, eigenvalues and eigenvectors, orthogonality,
and the polar decomposition (the core of factor analysis, singular value
decomposition, matrix condition number, and more). If there is a little on the
associated numerical analysis, then go ahead -- e.g., learn to accumulate
inner products in double precision -- sure, take 10 minutes to see how to add
iterative improvement to Gauss elimination and matrix inversion. For pseudo-
inverses, cute material, but, especially for your question, likely won't see
it again and can skip it.

Then if have some time, take a fast pass through the classic, Halmos, _Finite
Dimensional Vector Spaces_. It was written in 1942 when Halmos had just gotten
his Ph.D. under J. Doob (e.g., as in _Stochastic Processes_ and, more
recently, _Classical Potential Theory and Its Probabilistic Counterpart_ ) and
showed up at the Princeton Institute of Advanced Study and asked to be an
assistant to John von Neumann, likely the inventor of Hilbert space. Well,
Halmos wrote his book to be a finite dimensional version of linear algebra as
if it were Hilbert space, which is commonly infinite dimensional. So you get a
gentle introduction to Hilbert space. You get good at eigenvalues and
eigenvectors, the polar decomposition, orthogonality, transformations that
preserve distances and angles, etc. You get a lot of geometric intuition.

BTW, at one time, Harvard's Math 55 used Halmos, Baby Rudin (below), and a
book by Spivak as the three main references.

Get a start on probability and statistics. A college junior level course
should be sufficient. Don't take the course too seriously since will redo all
the good parts from a much better foundation soon! Note: Elementary stat
courses common get all wound up about probability distributions. Well, they do
exist, are at the core of probability theory, and are very important, both in
theory and conceptually, but, actually, in practice usually they require more
data than you will likely have, especially in dimensions above 1. So, in
practice, mostly can't actually see the actual data of the _distributions_ you
are working with! You should hear about the uniform, exponential, chi-squared,
and Gaussian and go light on the rest. The Gaussian is profound and won't go
away even in practice although is less important in practice than long assumed
in, say, educational statistics.

Then take a good pass through at least the first parts of Rudin, _Principles
of Mathematical Analysis,_ AKA _Baby Rudin_. For the exterior algebra in the
back of the more recent editions, well, likely get that elsewhere, say, now in
English, directly from Cartan. The first parts of Baby Rudin cover metric
spaces well enough. So, in Baby Rudin, get good at working with the limits,
completeness property, compactness, etc. of mathematical _analysis_ , that is,
not algebra, geometry, topology, or foundations, although the metric space
material is the same as part of the part of topology called _point set_
topology.

Then learn more in mathematical analysis, in particular, measure theory.
Measure theory essentially replaces the Riemann integral you learned in
calculus and Baby Rudin. 'Bout time! Net, measure theory is a slightly
different way to use limits to define the integral (areas, volumes, etc.) --
the first, biggie difference is that do the partitioning on the Y axis instead
of on the X axis. The biggie reason: The resulting integral easily handles the
pathological cases, especially involving limits, that the Riemann integral
struggles with. Don't worry: The integral of, say, x^2 over [0,1] is still the
same, IIRC, 1/3rd, right? But consider the function f: [0,1] --> R where f(x)
= 1 if x is rational and 0 otherwise. Then the Riemann integral of f over
[0,1] does not exist, but the measure theory integral does and gives 0 for the
result.

There are at least two now classic books, Royden, _Real Analysis_ and Rudin,
_Real and Complex Analysis_. But there are more, and likely more can be
written. For Rudin, can f'get about the last half on functions of a complex
variable.

Royden is easier to read. Rudin has the math more succinctly presented. Some
people believe that Rudin is a bit too severe for a first version; but if get
used to how Rudin writes, he's really good.

There in Rudin get good introductions to Banach space (a complete, normed
linear space, that is, assume a little less than for a Hilbert space) with a
few really surprising theorems, Hilbert space with an isomorphic argument that
they are really all the same, a really nice chapter on Fourier theory (Baby
Rudin does Fourier series; R&CA does the Fourier integral), and some nice
applications.

With that background in measure theory, then take your good pass through
probability. So, probability becomes a measure as in measure theory except the
values are always real and in [0.1]. A _random variable_ finally gets a solid,
mathematical definition -- it's just a measurable function ( _measurable_ is
very general; in practice and even in nearly all of theory, essentially every
function is measurable; in the usual contexts, it takes some cleverness to
think of a function that is not measurable). And expectation is just the
measure theory integral (with meager assumptions).

And in probability get some assumptions don't see in measure theory --
independence and conditional independence, and these two yield wonders nothing
like in just measure theory.

Go somewhere; look at something; get a number; then that's the value of a
_random variable_. In practice, suppose there are 20 random variables, you
have numerical values for 19 of them, and you can argue that those 19 are not
independent of the 20th and that you have some data on how they are dependent,
then you have a shot at estimating the 20th. Presto, bingo, get a rich as
James Simons, do _machine learning_ , get big houses, Cadillacs, Ferraris, a
yacht, a private jet -- maybe!

Also get to use the Radon-Nikodym theorem (proved in both Royden and Rudin,
and in Rudin with von Neumann's cute proof) for the grown up version of
conditioning (i.e., _Bayesian_ ) and, from there, in stochastic processes,
Markov processes and martingales (astounding results).

Books include L. Breiman, _Probability_ , K. Chung, _A Course in Probability
Theory_ , J. Neveu, _Mathematical Foundations of the Calculus of Probability_.
And there are more. IIRC Breiman and Neveu were both students of M. Loeve at
Berkeley -- sure, can also get Loeve's _Probability_ in two volumes. Of these,
Neveu is my favorite; it's elegant; but for most readers it is too succinct.

Hilbert space again? It turns out, the set of all real valued random variables
X such that E[X^2] is finite is a Hilbert space. Yes, completeness holds; with
some thought, that result seems astounding, like there is no way it could be
true, but it is.

Now just derive grown up versions of most of the main results of elementary
statistics for yourself from what you have learned. E.g., for the Neyman-
Pearson result on most powerful hypothesis testing, just use the Hahn
decomposition from the Radon-Nikodym theorem.

And, with the Radon-Nikodym theorem, get a grown up version of sufficient
statistics, right, based on a classic paper by Halmos and Savage.

Along the way will notice that, the last time I looked, Baby Rudin defined the
Riemann integral on closed intervals of finite length, but right away
probability and statistics want to integrate on the whole real line, the whole
plane, etc., do change of variable manipulations with such integrals, etc.
Well, for the prerequisites, those are in measure theory where the first
version of its integral applies also on the whole real line, the whole plane,
and much more. Measure theory also give you the clean, powerful versions of
differentiation under the integral sign and interchange or order of
integration.

Ah, why bother to teach the Riemann integral at all? :-)

~~~
neutralid
Thanks very much for your suggestions.

I have a utilitarian understanding of mathematics (vector spaces, SVD,
orthogonality, invariances, etc.) and over time appreciating the underlying
characteristics/relationships, which I recently got a taste of from T.
Wickens, _The Geometry of Multivariate Statistics_.

I look forward to understanding measure theory and related math.

------
shawnz
I am no math expert, but I have always thought about it like this. The squared
error is like weighting the error by the error. This causes one big error to
be more significant than many small errors, which is usually what you want. Am
I on the right track?

~~~
tomp
No, that's exactly why absolute error is better. "Big errors" are called
outliers, they're (relatively) rare, often caused by bad data (measurement
errors, typos, etc.) and substiantially influence the outcome of your
calculation. In other words, squared error is _less robust_.

But squared error is easier to compute. So, in practice, what you do is you
remove outliers (e.g. cap the data at +-3sigma) then use squared error.

~~~
amelius
> So, in practice, what you do is you remove outliers (e.g. cap the data at
> +-3sigma) then use squared error.

But if you are say fitting a function to the data, you can't tell beforehand
which data-points are the outliers. So in that case perhaps you need an
iterative approach of removing them (?)

------
kazinator
Squared error represents the underlying belief that errors in various
dimensions, or errors in independent samples, are linearly independent. So
they add together like orthogonal vectors, forming a vector whose length is
the square root of the sum of the squares. Minimizing the square error is a
way of minimizing that square root without the superfluous operation of
calculating it.

------
thomasahle
It's fine to list some reasons for using squared error, but you really can't
decide on the error function without referring to a problem you're trying to
solve.

Just look at the success of compressed sensing, based on taking the absolute
value error seriously.

~~~
Sean1708
Which is basically the entire message of the last section.

------
dnautics
"inner products/gaussians" \- the absolute value (and also cuberoot of
absolute cubes, fourth root of fourth powers) also define inner products.
Likewise, there are "gaussian-like formulas" which take these powers instead
of squared.

However: if you look at the shape of the squareroot of sum squares, it's a
circle, so you can rotate it. If you take the absolute, it's a square, so that
cannot be rotated; the cuberoot of cubes and fourthroot of fourths, etc. look
like rounded edge squares, and that cannot be rotated either, so if you have a
change of vector basis, you're out of luck.

With the gaussian forms of other powers, none of them have the central limit
property.

~~~
grodeni
What kind of inner products are defined by the absolute value, cuberoot of
absolute cubes, fourth root of fourth powers? I never heard of that and would
be glad to learn about it.

~~~
ska
You may find it interesting to read about Lp norms, and their relationship to
inner products on vector spaces. I think the OP is mixing up norm and inner
product terminology. This happens often because you derive an norm from any
inner product, but the other way may not exist.

If you plot on the plane the distance = 1 line, then L_1 gives you a diamond,
L_2 a circle, L_inf a square. [More precisely, the unit circle under the
related metric (distance function) looks like those euclidean shapes]

------
j7ake
The Bayesian formulation for the likelihood function would make this squared
error explicitly clear.

~~~
stared
For Gaussian uncertainty. Which still makes it a much more natural assumption
than any other I know.

------
TeMPOraL
My explanation for squared error in linear approximation always was: because
it minimizes the thickness of the line that passes through all the data
points.

(Per the old math joke - you can make a line passing through any three points
on a plane if you make it thick enough.)

------
theophrastus
Or why use variances when there are standard deviations (the square root of
the variance) which have more easily interpreted units? One commonly cited
reason is that one can sum variances from different factors, which one cannot
do with standard deviations. There are other properties of variances which
make them more suitable for continued calculations[1]. This is why, for
instance, variances are often utilized in automated optimization packages.

[1]
[https://en.wikipedia.org/wiki/Variance#Properties](https://en.wikipedia.org/wiki/Variance#Properties)

------
bagrow
Interesting discussion. Not sure about the breakdown between ridge regression
and LASSO though. The difference is not in the error term but in the
regularization term.

------
thisrod
Squared error because the uncertainties in independent, normally distributed
random variables add in quadrature. I expect that this could be proved
geometrically using Pythagoras's theorem, so in that sense the comments about
orthogonal axes are vaguely on the right track.

Normally distributed variables because the central limit theorem.

It isn't all that complicated.

------
jostmey
Why not KL-Divergence, which measures the error between a target distribution
and the current distribution? From the perspective of Information Theory, it
is the best error measurement.

Oh, and let's not forget that for a lot of problems minimizing the KL-
divergence is the exact same operation as maximizing the likelihood function.

~~~
enthdegree
kl divergence has no nice theoretical properties other than 'it is the answer
to these questions'

it is also extremely poorly behaved numerically and in convergence

~~~
srean
I am sorry but I have to call bullshit on this.

To give just a taste for the nice properties of KL, if you are using a layer 1
NN with the sigmoid function as the transform, using square loss gives you an
explosion of local minima. OTOH using KL in its place would have given you
none. Numerically accuracy is pretty much a non-issue, people have known how
to handle KL numerically since the last 40 or so years.

BTW using KL on equivariant Gaussian gives you square loss, apparently the
loss you prefer.

~~~
bwwvbiwbw
if your problem is ok with the asymmetry of KLD

------
highd
Another pro tip - absolute error magnitude is the convex hull of non-zero
entry count for vectors (l_0 norm in some circles). So in the convex
minimization context (and for most other smooth loss terms in general) you end
up with solutions with more zero entries and few possibly large non-zero
entries.

------
adamzerner
Also see
[http://www.leeds.ac.uk/educol/documents/00003759.htm](http://www.leeds.ac.uk/educol/documents/00003759.htm).

------
redcalx
Somewhat related; here's my attempt at explaining Cross Entropy:

[http://heliosphan.org/cross-entropy.html](http://heliosphan.org/cross-
entropy.html)

------
heisenbit
Square often corresponds to power in systems.

~~~
heisenbit
I noticed this got voted up and down more than usual. Maybe a little
elaboration:

Square often corresponds to power/energy in systems AND energy (integral of
power) is preserved. That relationship between physics and math allows a lot
of useful transformations.

------
fiatjaf
Why geometric mean?, I would ask.

~~~
thomasahle
"Why addition?", I would ask.

Different problems, different tools. You can't ask "why geometric mean?"
without referring to a specific problem you're trying to solve.

~~~
fiatjaf
What is a problem geometric mean solve? That was my question the entire time.

When people ask "why machine learning?" the answers are "machine learning can
do these things blablabla", not "you must specify the problem you're trying to
solve".

~~~
bzbarsky
> What is a problem geometric mean solve?

It gives you a way to average together two things that have units that have
nothing to do with each other and then compare two such averages and have the
comparison make sense, as long as your units were consistent.

As a silly example, say you want to average 1kg and 1m and compare that
average to the average 2kg and 0.5m. With arithmetic mean, ignoring the fact
that it's nonsense to add different units, you could get numbers like (1+1)/2
= 1 and (2+0.5)/2 = 1.25 if you use kg and m, but numbers like (1 + 100)/2 =
50.5 and (2 + 50)/2 = 26 if you use kg and cm. Notice that which one is bigger
depends on your choice of units. On the other hand, the geometric mean of the
two examples is always the same as long as you use consistent units: 1 for
both if you use kg and m, and 10 for both if you use kg and cm.

In practice, this sort of operation is only useful if you have multiple
measures of some sort along different axes (think performance on 3 different
performance tests) and you're being forced to produce a single average number.
Again, a fairly silly thing to do, but _very_ common: just about every single
performance benchmark does this.

~~~
fiatjaf
Thank you very much. Great examples.

------
jayajay
cause linear algebra is a beautiful framework to think in.

------
dschiptsov
To make it positive and to amplify it (as a side-effect).

BTW, "error" is a misleading term - it communicates some fault, at least in
the common sense. Distance would be much better term.

So, "squared distance" makes much more sense, because negative distance is
nonsense.

~~~
tonyedgecombe
Well it will only amplify values > 1.

~~~
esrauch
That's not correct. Even though the magnitudes of the value in isolation
shrinks, the relative magnitudes are still amplified which is what matters.

Consider values 1/2 and 1/4: in the original space it's double but in the
squared space it becomes 1/4 and 1/16 so the difference is 4x. Also relevantly
if you compare eg 0.9 and 1, the gap between them is amplified after squaring.

~~~
tonyedgecombe
Those values aren't compared individually, they are summed to calculate the
deviation, the result of that sum will be reduced if the values are < 1.

------
bitL
An honest question - do we even need statistics when we have machine learning?
Statistics to me appears as a hack/aggregation of data we couldn't process at
once in the past; these days ML + Big Data can achieve that and instead of
statistics we can do computational inference instead. To me this looks like
looking back to "old ways" for a reference point instead of looking forward to
the unknown but more exciting.

~~~
klodolph
Machine learning is often considered a statistical technique. The main
difference seems to be that in traditional statistics, people derive practice
from theory, whereas in ML people will try out techniques and figure out the
theory later. That's really just a cultural difference. The techniques for
analyzing ML models are all statistical to begin with.

Statistics, as a field, already used general-purpose optimization algorithms
before modern ML techniques came about, so in that sense, ML just fits into an
existing position in the statistical toolbox (like replacing a chisel with a
3D printer). In the other direction, statistical techniques like cross-
validation are necessary for you to get your ML correct.

~~~
bitL
There is much more in ML than just statistics. I was basically asking why the
"statistics filter" is so often on in ML. Neural networks don't seem a
statistical technique, even if somebody uses them for regression. Yes, there
is an overlap, but no, ML != statistics. As you mentioned, non-linear
optimization is used in statistics on meta-level however nobody claims
statistics is operations research or vice versa.

~~~
srean
_In what way is neural network not a statistical technique ?_

very curious to hear about your point of view. Statistics is not linear
regression and ANOVA, or whatever catalogue of techniques in a freshman book,
not even the library of techniques available in R.

Statistics is the application of probability (or more broadly, math) to data.

That said statisticians did miss the neural net wave because of their flippant
reaction to it. They said, "oh well yet another non-parametric function
approximator we have worked out the asymptotics 30 years ago".

To paraphrase someone wise: asymptotically we are all dead. Not enough heed
was paid to that. Among their other lacks were expertise in algorithms and
optimization. Mind you, optimization has been at the core of their craft from
their very genesis, its just that they did not feel it important enough to
ride the cutting edge of research on optimization. Note: you cannot do maximum
likelihood with solving an optimization problem. Gauss was doing it several
hundreds of years ago for statistics. If I go on a bit further with my rant,
they got a bit carried away with their fetish over bias and asymptotic
normality. They missed the wave, sure.

But all said and done, by any accepted definition of statistics, NN is very
much also statistics.

~~~
quicknir
Statisticians are either Frequentists, or Bayesianists. Fundamental to both of
these approaches is the involvement of probability. A technique that does not
have a probabilistic interpretation is not a statistical technique. This is
also largely related to the difference in goals between statistics and machine
learning: statistics is primarily about inference, and machine learning is
primarily about prediction. You can predict without necessarily making any
meaningful probabilistic statements about data. There's not really much to
infer without saying something probabilistic.

~~~
srean
Neural nets have absolutely been associated with probalistic interpretations.
Good resources would be David McCay and Radford Neal. Both their approaches
are Bayesian. A far more trivial way to associate a probabistic interpretation
is to claim that the neural net is the conditional expectation.

And who says frequentist and Bayesian are the only two views. Where would you
shelve prequential statistics then ? Or nonparametric regression

Prediction has definitely been a part of statistics but often, as you rightly
claim, as a byproduct. And yes i would characterize the focus of stats and ml
exactly as you did

~~~
quicknir
The point is not whether somebody has ever tried to associate neural nets with
probability, sorry if my previous comment made it seem that way. The point is
that neural nets are not fundamentally tied to it. You can try to tie them to
probability, but you certainly don't have to, it mostly isn't, it isn't mostly
taught that way, and the big open problems in the field don't involve it.

Statistics works the other way. You basically always start with some kind of
probabilistic model. And then, if you even bother with prediction, you work
towards prediction from the probabilistic model. With stats you don't need to
interpret or add probability after the fact, it's already there.

Obviously stats and ML are enormous fields, with quite some overlap. And
people tend to go after low hanging fruit; if many people who studied neural
nets have formal probability backgrounds it simply makes sense that _someone_
will write a paper on it. And I'm generalizing here (same goes with
frequentist & Bayesian comment). But there absolutely is justification for
saying "neural nets are not really a statistical technique".

~~~
srean
I agree with a lot what you are saying, but rest assured its colored by
personal rationalizations one has made while learning these things and not
objective facts.

nns are in no way less fundamentally associated with probability than say
linear regression. In both cases you can start with a probability model and
derive the final form as a logical consequence, or you can start with the
final form and slap a consistent probabilistic model on top.

The main thing is that any test you come up with that carves nns away from
stats is going to carve a whole lot of other things that people have no
trouble calling them stats. This controversy essentially stems from the need
to claim a technique for ones own tribe and not concede to another. I am
making no moral claim here just an observation and neither is it a novel one.

BTW I am firmly from the ml camp and not stats. I enjoy poking a little bit of
fun at statisticians and try being gracious about their criticism of ml. That
said i feel no need to make a groundless claim to a technique when they have
no less rights to it in terms of objective claims. Rights steeped in culture,
fiat and history are a different matter.

