
Statistics vs. Machine Learning - Jagat
http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/
======
zissou
From one of the comments in the article:

"Economists, of course, broke away hard from mainline stats a while ago,
calling it “econometrics” and reinventing names for EVERYTHING, plus throwing
in a bizarre obsession with the method of moments. In terms of intellectual
arrogance and needless renaming/duplication, economists are much worse than
computer scientists and engineers."

This guy couldn't be more right about [most] [academic] economists. I laughed
at the renaming of terms part because it is so true (classic example is
RSS/SSR/ESS/SSE). Taking a course in the stats department alongside an
econometrics course was bound to confuse any student. But just ask the
economist, they'll tell you which one is right. :)

~~~
pseut
> I laughed at the renaming of terms part because it is so true (classic
> example is RSS/SSR/ESS/SSE). Taking a course in the stats department
> alongside an econometrics course was bound to confuse any student.

Come on, this is silly. Calling things RSS vs. SSR might confuse an undergrad,
but I've taken both classes (stats intro to linear regression + econ intro to
linear regression) at the same time and saw barely a difference in the
material. The stats class had a greater emphasis on finite-sample properties
under normality and the econ class on asymptotics. Not a huge change, and the
differences were complementary.

We both know that the "bizarre obsession with [generalized] method of moments"
is because moments come out of models with rational agents, not distributions.

~~~
zissou
I've taken PhD Econometrics where we touched zero data and was 100%
theoretical over the 9 months, so I don't think we are talking about the same
thing. I don't like spewing out economically technical non-sense on HN, so I
wasn't going to go into the methods of moments part of that comment as it
applies to the current state of econometric theory.

In econometric theory, method of moments comes up in the largest way in the
form of generalized methods of moments (or GMM). The idea behind GMM is one
that is in competition with maximum likelihood, with the point being that GMM
doesn't force you to make arbitrary assumptions about the true probability
distribution of the data, purely for the implementation of the model. This
seems obviously attractive, because then the results of our model won't be
jeopardized just because one of our assumptions was false. In other words, GMM
provides a way to estimate the parameters of a model with out making
assumptions about the population.

Oh, and the topic of rational agents is not relevant here. This is a purely
statistical/philosophical argument.

But there goes the economist in me again... I'm sorry.

~~~
pseut
I've taken just as many stats classes that don't touch data; I don't think
either of us want to argue pedagogy of teaching on HN (fuck, I didn't even
want to spell it and it's likely I didn't), but I don't think that there's a
huge difference in how Econ and stats departments teach the same material or
use terminology. (there are big differences in the material selected,
obviously).

This statement can't be true: "GMM provides a way to estimate the parameters
of a model with out making assumptions about the population.". You need
assumptions, just different assumptions. Frequently those assumptions involve
agent rationality, but not always (after all, MLE is a special case of GMM).

~~~
zissou
"MLE is a special case of GMM"

You have it backwards. MLE is actually a special case of GMM.

I'm done here.

~~~
dbaupp
You seem to have just repeated the quote? Presumably you mean "GMM is a
special case of MLE"?

~~~
zissou
Mistake is on my side, I apologize. I read what the poster said backwards
myself. :) The repeated quote is a true statement: MLE is a special case of
GMM.

------
josemariaruiz
Philipp K. Janert, author of «Data Analysis with Open Source Tools», spends a
few pages for explaining how he perceives this «difference».

From his point of view Machine Learning is a fake science. Fragile, secretive
and specific techniques for big problems that need secret parameters that have
never been published for their application. This parameters will be supplied
to you for a price by the inventors-researchers' companies.

In the other hand, statistics is real science, where everything is published
and studied by a whole community. A science that has accumulated hundred of
years of experience and that offer all its knowledge in any university. The
methods offered by statistics are of broad application, robust and open.

And I think he has a point in this reasoning.

PD: Statistics works, ask in any hard sciences. Its contributions has been
essential for the science in the last centuries. Machine Learning was bashed
(like old AI) because it never offered real solutions or helped us to advance
in our understanding of anything. Machine Learning is a tool, not a science,
that tries to cope with the limitations of our knowledge, which means that
it's a very convenient tool for engineers and problem solvers, as are
numerical methods are, but it means too that its results share the problems of
numerical methods.

~~~
rm999
I disagree with a few things about your comment. Your criticism of machine
learning feels off-base, and is too specific to describe such a wild field.
Who sells parameters? What does that even have to do with whether it is a
science? I can think of other fields where every detail of an experiment isn't
spelled out in every paper.

It's hard to separate machine learning and statistics because so much of
machine learning derives directly from statistics. Motivation is probably the
most important distinction; machine learning is applied statistics. I'd say
it's a mix of science (the scientific method plays a big part in model
building for example), engineering, and math. Statistics is first and foremost
a branch of mathematics, not science; the scientific method does not play a
role in the vast majority of the field.

~~~
alook
> machine learning is applied statistics

It really is hard to separate ML & statistics - any competent practicioner of
ML appreciates the statistical achievements that made Machine Learning methods
possible. And statisticians must understand that to help automate decision-
making systems, using learning methods/boosting is a viable option.

The debate around nomenclature (ML/stats/AI) seems limited to the academic
community. Most data scientists I've met tend to accumulate a repertoire of
tools from different fields, rather that side with either Machine Learning or
Statistical communities.

------
jlogsdon
The site isn't loading for me, here's the text-only cache:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://brenocon.com/blog/2008/12/statistics-
vs-machine-learning-fight/&hl=en&tbo=d&strip=1)

------
mbq
When you remove steering, both groups simply do statistical modelling. From my
observations, the main difference is that stats people make simpler models,
test well, but are a victim of mathematical tendency to build on too
optimistic assumptions; CSists are closer to the data, make more complex
models and do poor verification (but thus report better accuracies).
Regardless of the actual performance, stats has traditional monopoly in
biomedics (minus neuro* stuff) and CS in engineering.

------
pseut
Maybe a moderator could change the link title to highlight that it's from
2008. Not that the date necessarily makes it irrelevant, but it's not really
"news".

------
mjw
I've taken courses from people in a few of these overlapping but historically-
different camps recently, e.g.:

\- Frequentist statisticians \- Bayesian statisticians \- Old-school AI
researchers \- Statistical learning theorists \- Bayesian machine learners \-
Engineers working on optimisation with noisy data \- Information retrieval
folks \- ...

I'm really keen to see these guys starting to talk to each other and unify
more of what they're doing around statistics (Bayesian and frequentist,
parametric and non-parametric, generative and discriminative) as the common
language and framework. Hopefully expanding the horizons of statistics a bit
as a field in the process.

I imagine it'll take a while longer though, and some of the differences in
terminology and talking-past-each-other can be a bit maddening in the mean
time for those learning. What would be really nice would be a course offering
a broad, well-rounded introduction to the various different philosophies to
modelling data, their histories, interactions and overlaps, differing goals,
strengths and weaknesses. It can be hard to get a sense of this when most
introductory courses are taught by someone who's implicitly from one camp or
another, even if (as is usually the case) they're not overly ideological about
it.

One criticism of (some, not all!) statisticians is that they can seem to have
a strange and rather limiting fear of computation which leads them to give
undue preference to computationally simple models, even when the dataset isn't
big enough to make compute time an issue.

I can see an argument for pedagogical reasons why one would rather not teach
(or learn!) the details of fiddly optimisation algorithms -- but having a
basic literacy in optimisation can free you up to treat the algorithm for
optimising your objective as a black box to some extent. Feeling less "guilty"
about this (whether the compute time itself, or remembering all the details of
the optimisation algorithm) can be quite freeing in allowing you to think
about the modelling itself in a more powerful, modular framework. This seems
to be where machine learning has gotten a big advantage.

On the other hand machine learners can be frustrating in the way they reinvent
statistical terminology and methods. Also in a gratuitous tendency to skip the
modelling stage and go straight to inventing different objective functions to
optimise, or even straight to the algorithms. Leading to opaque black-box
methods which (due to the lack of probabilistic motivation) make it harder to
reason about uncertainty in a principled way.

With the increasing popularity of Bayesian machine learning I think this is
less of an issue though, it's bridging the gap between the two camps. One can
also find a lot of more modern ML research using Statistical Learning Theory
as a nice framework for principled frequentist analysis of their models.

------
michaelochurch
Machine learning is more accurately described (in my view) as an
interdisciplinary region that involves computation _and_ statistics both. It's
when you have to use statistics to push the bounds of computational work, or
CS to tease out statistical relationships that require so much data that the
size of the data itself is part of the problem.

When I think of "statistics", I think of problems where there are a few well-
studied parametric approaches, in large part because there aren't enough data
(in most cases) to do anything extremely complicated. If your data set is
small and you need to build a generally adequate model, you can often use
linear regression to get something good enough, and that may the best you can
do, because model simplicity is often valuable in its own right (low risk of
overfitting).

Many of the more non-parametric machine learning algorithms (e.g. neural nets)
are computationally intense and often a bitch to debug. They work best in
situations where (a) you have a lot of data, but (b) you have no idea what the
structure of the relationship should be.

Parametric models perform extremely well when the structure is known. You have
inputs X1, X2, X3 and Y, and you know that a linear model will work well. You
fit one, and it captures 65% of the variance. Good. That may be enough.

What do you use, however, if 65% isn't good enough, or if there's a special
case that becomes economically important. You can look at the residuals. They
may be random noise. If they're truly "random", then the linear model is the
absolute best you can get. That may not be, however. There might be a region
of the X2-X3 space where atypically high Y-values occur for a structural
reason. These atypical points may be connected to an X4 that would otherwise
be discarded (because it had no general correlation with Y).

I think that machine learning approaches start to be worth their additional
complexity in these cases where (a) there are important relationships yet to
be discovered-- no one knows about them!-- and (b) those are, or might be,
economically critical.

~~~
visarga
We can prevent overfitting by adding a regularization term to the expression
we are minimizing, such as the sum of the squares of all coefficients scaled
by a factor.

Also, if you do testing (using a separate test dataset to determine how well
your model works on unseen inputs) you can determine if you are overfitting
(learning even the noise present in the data - which is detrimental) or
underfitting (not learning enough from the data - which is detrimental, too).
In the end it's a sweet spot, and many times the features number in the
hundreds or thousands, so you can't analyze by hand.

Automatic feature selection and disentangling is an amazing new advancement
that came 7 years ago with the deep learning papers. Watch lectures on
Restricted Boltzmann Machines by Geoffrey Hinton and Andrew Ng for this. It's
what allowed Google to achieve the best speech recognition and image
recognition results ever recorded.

