
The main trick in machine learning - tlarkworthy
http://edinburghhacklab.com/2013/12/the-main-trick-in-machine-learning/
======
tel
A professor of mine stated it very well. If you can imagine that there is a
_true_ model somewhere out in infinitely large model space then ML is just the
search for that model.

In order to make it tractable, you pick a finite model space, train it on
finite data, and use a finite algorithm to find the best choice inside of that
space. That means you can fail in three ways---you can over-constrain your
model space so that the true model cannot be found, you can underpower your
search so that you have less an ability to discern the best model in your
chosen model space, and you can terminate your search early and fail to reach
that point entirely.

Almost all error in ML can be seen nicely in this model. In particular here,
those who do not remember to optimize validation accuracy are often making
their model space so large (overfitting) at the cost of having too little data
to power the search within it.

Devroye, Gyorfi, and Lugosi ([http://www.amazon.com/Probabilistic-Recognition-
Stochastic-M...](http://www.amazon.com/Probabilistic-Recognition-Stochastic-
Modelling-Probability/dp/0387946187)) have a really great picture of this in
their book.

~~~
joe_the_user
_In order to make it tractable, you pick a finite model space, train it on
finite data, and use a finite algorithm to find the best choice inside of that
space. That means you can fail in three ways---you can over-constrain your
model space so that the true model cannot be found, you can underpower your
search so that you have less an ability to discern the best model in your
chosen model space, and you can terminate your search early and fail to reach
that point entirely._

It seems like you can "mis-power" your model also.

For example, the Ptolemaic system could approximate the movement of the
planets to any degree if you added enough "wheels within wheels" but since
these were "the wrong wheels", the necessary wheels grew without bounds to
achieve reasonable approximation over time.

~~~
rcthompson
> For example, the Ptolemaic system could approximate the movement of the
> planets to any degree if you added enough "wheels within wheels" but since
> these were "the wrong wheels", the necessary wheels grew without bounds to
> achieve reasonable approximation over time.

That would be an example of over-constraining your model (i.e. imposing the
arbitrary constraint of a stationary Earth).

~~~
joe_the_user
I don't think this is useful way to phrase the situation.

A system of Ptolemaic circles _can_ approximate the paths taken by any system.
So the system really isn't absolutely constrained to follow or not follow any
given path.

You could claim you have constrained your model not be some other better model
but that, again, seems like a poor way to phrase things since a more accurate
model is also constrained not to be a poor model.

Even specifically, the Newtonian/Keplerian system has the constrain of the sun
being stationary as much as the Ptolemaic system has the constraint of the
earth being stationary.

Edit: As Eru points out, the Ptolemaic system basically uses the Fourier
transform to represent paths. Thus the approximation is actually completely
unconstrained in the space of paths, that is it _can_ approximate anything.
But by that token, the fact that it can approximate a given path explains
nothing and the choices that are simple in this system are not necessarily the
best choices for the given case, estimating planetary motion.

See -
[http://en.wikipedia.org/wiki/Deferent_and_epicycle](http://en.wikipedia.org/wiki/Deferent_and_epicycle)

~~~
rcthompson
That's a good point, but after re-reading tel's original comment, I think my
statement is still correct. Notice that tel's statement was that "you can
over-constrain your model space so that the true model cannot be found". This
doesn't necessarily mean constraining your model so that the true model is
excluded from your parameter space. If your constraints technically encompass
the true solution but only admit an overly complex parametrization of the
solution, then it will still reduce (perhaps drastically) your power to find
the true model. In this case, "overly complex" means unnecessarily many
nonzero (or not almost zero) coefficients in the Fourier series.

~~~
joe_the_user
My argument is that there are two kind of situations:

* The model could encompass the behavior of the input in a smooth fashion if it's basic parameters are relaxed.

* The model would tend to start finding models that are wildly different from the main model at the edges (space and time) if its parameter are relaxed, even if the model would eventually find the real model with enough input and training.

one has to handle these two conditions differently, right?

------
hooande
I applaud the author of this post. I've seen a lot of people suffer with
machine learning because they don't understand this basic concept. Taking MOOC
classes and reading textbooks is a great way to learn, but they tend to focus
on a lot on the mathematical principle and not the start-from-nothing
practical considerations.

Machine learning is almost like learning chess in that there are certain
obvious mistakes that noobs continue to make. And like chess there are
multiple levels of thinking and understanding that are almost impossible to
teach to someone that doesn't have lots of experience. Hopefully more blog
posts like this will help people get past the novice level.

Regarding technical content:

N-fold Cross validation [1] can be a more effective approach to having a
single held out or validation set. You split your data into N groups, say N =
10. Then you use groups 2-10 as a training set to make predictions on group 1,
then groups 1,3-10 to make predictions on group 2, etc. Recombine the
prediction output files and use the measured error to tune and tweak your
predictor. It's more work and can still lead to overfitting, but it's
generally better to overfit the entire training set than it is to overfit one
held out sample.

[1] [http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%...](http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%29)

~~~
glifchits
The Coursera ML class has a week of lectures specifically regarding practical
considerations. The prof discusses how to solve for underfitting/overfitting,
and spends a lot of time on this idea of a cross-validation set. To whomever
reads this, its a good course!

------
waterside81
I work at a company that sells applied machine learning services, so I'd like
to add a few more tricks to machine learning:

1) Have lots of data

2) Accept the possibility that your problem domain cannot be generalized.

I always find, whether in academic literature or in message boards, a desire
to fit every round peg into a square hole. The reality of real world data is
that sometimes, it's just a 50/50 coin toss. This might be because the
features that _really_ indicate some sort of pattern can't be defined or they
can and the data can't be reliably retrieved, or the humans running things
have a poor understanding of the problem domain to start with.

TL;DR: There's no magic

~~~
tel
My experience with real world (but still academic) data has been that there is
lots of magic---feature selection to be specific.

(I'm not disagreeing, just referring to a different kind of "magic")

Everything else matters, but when your ML doesn't work it's 100% a feature
selection problem. Which usually means it's 99% a problem of getting lots of
domain expertise jammed up against a lot of ML experience and mathematical
understanding. It's also a bear.

~~~
nabla9
The way 80% of real real world (non academic) data mining problems are solved:

1\. Feature selection.

2\. intelligent data massage. Real world data has usually noise that humans
can easily identify as irrelevant or erroneous.

3\. logit regression.

Starting with simple, well understood algorithms first should be the second
lesson after knowing about validation sets. In those cases where they are not
enough, they set the baseline for comparison against other algorithms.

~~~
larrydag
I would add 4. Ensemble methods. Having a few models helps to generalize the
data fairly well.

~~~
tel
That's a good point.

------
pyduan
It's quite sad that this post is even necessary. That said, having a proper
training/cross-validation/validation setup is sometimes not that obvious, as
you have to stop and think about possible sources of contamination -- some
sampling biases, for instance, can be quite tricky to detect, or your
algorithm design might be flawed in some subtle way.

Personally, I wish people emphasized more the importance of a general
understanding of econometrics when doing machine learning. In most of the
introductory courses I've seen, the link between both field is never made
explicit, despite the obvious analogies (coincidentally, there was an article
by Hal Varian on the front page two days ago that discussed how both fields
could benefit from sharing insights [1]). Understanding the idea behind
minimizing generalization error is one thing, but I find that thinking in
terms of internal/external validity and experiment design often gives people a
more intuitive understanding of validation procedures, both regarding why and
how we should do it. The same goes for understanding effect size, confidence
intervals, causality (and causality inference), and so on.

[1]
[https://news.ycombinator.com/item?id=6870387](https://news.ycombinator.com/item?id=6870387)

~~~
zmjjmz
>stop and think about possible sources of contamination

One great one from my Machine Learning professor was an assignment where we
were required to normalize our data to [0,1]. After doing this and then going
through the typical cross-validation cycle, he had us try and figure out where
we contaminated our validation sets. As it turns out, we all normalized our
data _before_ splitting it up, which meant that training data influenced
testing data.

It's a simple fix, but if you've done that and gone to run a large
convolutional neural network for a week only to find that you made a stupid
error like that, it can be pretty painful. (Especially since the bad
generalization error might not be obvious until you use it the model in
production)

~~~
im3w1l
Maybe one could benefit from a sort of blinding procedure, where the person
designing the learner is never allowed to even look at the validation data.

------
tedsanders
I strongly disagree with the idea that validation sets are central to machine
learning. The whole point of machine learning (usually) is to predict things
well. Validation sets are merely one technique among many to gauge how well
your predictions are doing. Because they are so easy, they are very common.
But just because they are common doesn't mean they are central to the field.
There are many other techniques out there, like Bayesian model selection (as
the author mentions at the end).

~~~
mjw
Good to see Bayesian model selection get a mention. Bayesian model averaging
is pretty interesting, too, in that it comes, in a sense, with built-in
protection against overfitting.

I still think there is something quite fundamental, though, about validation
sets and other related resampling-based methods for estimating generalisation
performance (cross-validation, bootstrap, jackknife and so on).

The built-in picture you get about predictive performance from Bayesian
methods comes with strong caveats -- "IF you believe in your model and your
priors over its parameters, THEN this is what you should expect". Adding extra
layers of hyperparameters and doing model selection or averaging over them
might sometimes make things less sensitive to your assumptions, but it doesn't
make this problem go away; anything the method tells you is dependent on its
strong assumptions about the generative mechanism.

Most sensible people don't believe their models are true ("all models are
false, some models are useful"), and don't really fully trust a method, fancy
Bayesian methods included, until they've seen how well it does on held-out
data. So then it comes back to the fundamentals -- non-parametric methods for
estimating generalisation performance which make as few assumptions as
possible about the data and the model they're evaluating.

Cross-validation isn't the only one of these, and perhaps not the best, but
it's certainly one of the simplest. One thing people do forget about it is
that it _does_ make at least one basic assumption about your data --
independence -- which is often not true and can be pretty disastrous if you're
dealing with (e.g.) time-series data.

~~~
ced
I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's
the probability I have for X under M1's modelling assumptions. Of course M1 is
an approximation, but that's how science is done. You get to understand how
your model behaves, and you may say "Well, X is a bit higher than it should
be, but that's because M1 assumes a linear response, and we know that's not
quite true".

Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It
assumes that either M1 or M2 is true. No conclusions can be derived from that.
It might be useful from a purely predictive standpoint ( _maybe_ ) , but it
has no place inside the scientific pipeline.

There is a related quantity which is P(M1)/P(M2). That's how much the data
favours M1 over M2, and it's a sensible formula, because it doesn't rely on
the abominable P(M1) + P(M2) = 1

~~~
mjw
Yeah good perspective -- I guess I was thinking about this more from the
perspective of predictive modelling than science.

Model averaging can be quite useful when you're averaging over versions of the
same model with different hyperparameters, e.g. the number of clusters in a
mixture model.

You still need a good hyper-prior over the hyperparameters to avoid
overfitting in these cases though, as an example IIRC dirichlet process
mixture models can often overfit the number of clusters.

Agreed that model averaging could be harder to justify as a scientist
comparing models which are qualitatively quite different.

~~~
ced
_Model averaging can be quite useful when you 're averaging over versions of
the same model with different hyperparameters, e.g. the number of clusters in
a mixture model._

Yeah, but in this case, there's a crucial difference: within the assumptions
of a mixture model M, N=1, 2, ... clusters _do_ make an exhaustive partition
of the space, whereas if I compute a distribution for models M1 and M2, there
is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,

P(N=1|M) + P(N=2|M) + ... = 1

but

P(M1) + P(M2) << 1

Is the number of clusters even a hyperparameter? Wiki says that
hyperparameters are parameters of the prior distribution. What do you think?

------
bravura
In more formal terms, you are trying to minimize the expected risk
(generalization error).

The expected risk is the sum of empirical risk (training set error) and the
structural risk (model complexity).

In many instances, having low empirical risk comes at the cost of having high
structural risk, which is overfitting.

------
danso
I was just browsing through the classic "Mining of Massive Datasets" book
(which is free!) when I noticed this apt passage in its introduction that
explains the difference between data mining and machine learning:

[http://infolab.stanford.edu/~ullman/mmds.html](http://infolab.stanford.edu/~ullman/mmds.html)

> _There are some who regard data mining as synonymous with machine learning.
> There is no question that some data mining appropriately uses algorithms
> from machine learning. Machine-learning practitioners use the data as a
> training set, to train an algorithm of one of the many types used by
> machine-learning prac- titioners, such as Bayes nets, support-vector
> machines, decision trees, hidden Markov models, and many others._

 _There are situations where using data in this way makes sense. The typical
case where machine learning is a good approach is when we have little idea of
what we are looking for in the data. For example, it is rather unclear what it
is about movies that makes certain movie-goers like or dislike it. Thus, in
answering the “Netflix challenge” to devise an algorithm that predicts the
ratings of movies by users, based on a sample of their responses, machine-
learning algorithms have proved quite successful. We shall discuss a simple
form of this type of algorithm in Section 9.4._

 _On the other hand, machine learning has not proved successful in situations
where we can describe the goals of the mining more directly. An interesting
case in point is the attempt by WhizBang! Labs1 to use machine learning to
locate people’s resumes on the Web. It was not able to do better than
algorithms designed by hand to look for some of the obvious words and phrases
that appear in the typical resume. Since everyone who has looked at or written
a resume has a pretty good idea of what resumes contain, there was no mystery
about what makes a Web page a resume. Thus, there was no advantage to machine-
learning over the direct design of an algorithm to discover resumes._

[http://infolab.stanford.edu/~ullman/mmds.html](http://infolab.stanford.edu/~ullman/mmds.html)

~~~
apw
Will you need to change that definition if I show you a machine learning
algorithm capable of significantly outperforming the best human algorithms on
the resume classification problem?

------
khawkins
I would say, to be succinct, that the main trick in ML is Occam's Razor
([http://en.wikipedia.org/wiki/Occam%27s_razor](http://en.wikipedia.org/wiki/Occam%27s_razor)).

It has been found that, for most problems, a simple model which well
represents previous experience should be accepted instead of a more complex
one with marginally better representation. I would claim that the reason this
generally works is an empirical discovery, as opposed to a mathematical
result, but probably has philosophical implications in its success.

~~~
JASchilz
Check out Bayesian Model Selection. It's the mathematical expression of
Occam's Razor.

~~~
khawkins
My point is that it shows up everywhere, just in different forms. Sparse
coding has a penalty for large basises. Gaussian process regression tunes the
density of its representation using Bayesian model Selection. SVMs have a
slack parameter which dictates how many errors you'll tolerate to reduce the
number of hyperplanes.

~~~
JASchilz
I apologize, my reply was aimed too low.

------
sampo
Andrew Ng emphasized this quite clearly in his Machine Learning course on
Coursera.

------
rcthompson
It's not directly related, but I always liked this little "koan":

A man is looking around at the ground under a street lamp. You ask him what he
is looking for, and he says "I'm looking for my keys. I dropped them somewhere
in that parking lot over there." "Then why are you looking inder this street
lamp?" you ask. He answers: "Because this is the only place I can see!"

------
stingrae
Seems like HN is causing them problems. I saved the articles text at:
[https://www.evernote.com/shard/s360/sh/4e19f93c-8425-440c-b9...](https://www.evernote.com/shard/s360/sh/4e19f93c-8425-440c-b978-cdd7aa6461f9/309dce824d9b6697b37b4c61251b6cfb)

------
yetanotherphd
I think the situation is more complex than the author states.

For example if I have a linear model, Y = a + b * X, I will choose a and b to
minimize in-sample fit. Choosing a and b to maximize out of sample fit goes
against all theory.

However, if I want to choose which parameters go into my model, maximizing out
of sample fit would be a good approach.

So at the end of the day, there is not a huge philosophical difference between
using in-sample and out-of-sample fit, only different approaches to the same
problem. In both cases, the assumption is (usually) the the data is i.i.d.,
and in both cases, you are choosing some
coefficients/parameters/hyperparameters with the intent of maximizing out of
sample fit, but using different methods.

~~~
nkurz
Are you coming from a theoretical math point-of-view or background? It's hard
for me to say exactly why, but I feel your response is evidence of just that
"huge philosophical difference" between traditional stats and machine
learning.

To me, even the statement "if I have a linear model" makes very little sense
from the perspective of ML. Contrast with "if I think I'm dealing with a
situation where a linear model might offer a good fit".

Regarding "maximizing out of sample fit would be a good approach", I think ML
is always and just-about-only concerned with maximizing out-of-sample fit, for
if it wasn't, the solution would be a lookup table.

I'm not trying to imply that you're wrong, rather that I think the 'gulf' is
real. Or maybe I'm misunderstanding your point. For example, I feel that mjw's
comment in this thread captures my view, which I think is more ML centered:
[https://news.ycombinator.com/item?id=6878336](https://news.ycombinator.com/item?id=6878336)

Is that comment also in accord with your view, and it's me that's on the wrong
side of that gulf?

------
girvo
Great post. It's like maths, you have to check your answers. Validation sets
are a way of doing that.

I've been getting into ML lately for my startup, it's a personal finance
system that will learn your habits and use that to predict things in the
future. It's been overwhelming attempting to move into this domain of software
engineering (so much so that I am currently just hard coding certain important
patterns and using basic statistical modelling instead) but it is absolutely
fascinating!

------
shiven
Funny, that this idea is so foreign to ML. As a macromolecular
crystallographer, R _free_ [0] is something drilled into every student's brain
from day one!

TL;DR: Randomly, a certain percent (5-10%) of data is 'hidden' and never used
for building/refining your model, but is only used to evaluate how well your
model fits (or explains) that unseen data. This is absolutely, fundamentally
essential to prevent _over_ -fitting your data!!

EDIT: Think that you are solving a huge jigsaw puzzle, but made of thousands
of jello pieces. You randomly hide a 100 or so pieces and try to solve the
puzzle. Having used all the pieces (except the hidden 100), you think the
puzzle forms a Treasure Map. Now, you take the previously hidden pieces and
try to fit those into the puzzle and if after using the hidden pieces your
puzzle still looks like a Treasure Map, you may have found a (mostly) correct
solution. But, if you are unable to fit those hidden places in a way that
still keeps the Treasure Map intact, you must question if you did in fact find
the correct solution or if there is another, slightly different, solution that
may be (more) correct because it will account for the hidden pieces a little
better?

[0]
[http://reference.iucr.org/dictionary/Free_R_factor](http://reference.iucr.org/dictionary/Free_R_factor)

~~~
m_ke
Don't kid yourself, the idea must have been foreign to the author of the post,
but you won't find a single published paper that doesn't test its results
using cross validation or at least on some standard test set.

------
mendicantB
Honestly, calling validation a trick isn't helping.

Understanding the motivation behind validation is an absolutely fundamental
concept, and lack of coherence on the topic shows an inherent lack of
understanding of the goal of building the model in the first place;
GENERALIZATION.

This is synonymous with one checking in code that has no issues locally,
without testing in the stack or a production environment.

I work and hire in this space and it's actually a bit shocking how widespread
this lack of understanding is. Asking a candidate how to evaluate a model,
even at a basic level, is this field's version of FizzBuzz. Just like
Fizzbuzz, a lot of candidates I've encountered who are "trained" in machine
learning or statistics fail miserably, and my peers seem to have similar
experiences.

These issues are expected, given how popular data science is these days. We
all win when more people are getting their hands dirty with data, but it's
extraordinarily easy to misuse the techniques and reach misleading
conclusions. This can potentially lead to people pointing fingers at the field
and it's decline. The only thing we can do is correct the wrongs and do our
best to limit incompetence that only serves to tarnish the field.

~~~
pmiller2
Count me among those who thought validation was a thing you just had to do
when training ML algorithms. After all, the most beautiful theoretical model
in the world is of no use if the predictions it delivers are terrible.

The real trick (for most algorithms) is to select the correct features to
train against. This really is more of a black art than an exact science, so I
think labeling it a trick is justified.

------
sadfaceunread
Link appears to be /.'ed (HN'd). CoralCache/NYUD.net doesn't seem to have it
in cache. Anyone got a cached page/mirror?

------
orting
I think you need to view whatever process generated the answers as part of
your model. In some cases, and in all textbook examples, we have a ground
truth that is correct. But in real-world applications, such as segmentation
problem in medial imaging, we have a gold standard which represents our best
estimate, but is not necessarily correct.

Validation is not a magic bullet, we need to be critical of any part of the
model that is given as truth, otherwise we might end up fitting a solution to
the wrong problem.

More generally I think that textbooks should emphasize the need for the
scientific method and stress that any model (or theory) is only as good as its
ability to explain the entire problem domain.

------
tocomment
How does the brain generalize to data it hasn't seen before? Any theories?

~~~
Maria1987
According to Piaget's theory of development while we grow up we have different
experiences from which we acquire new information. If we lets say, are naive
with no experiences or memories at all, otherwise known as a "tabula rasa"
stage, then we will start learning this new information and grouping it into
correlated structures of knowledge, known as schemas. For example, different
types of dogs can be one schema as they share characteristics and they are
correlated knowledge..As we learn we not only create these schemas, but we
also adapt them when new unknown information arrives. For example, if I only
experience dog in my life, then when I see a cat I know that this is likely to
be an animal and share characteristics with dogs, as this is similar to dogs
and will most likely belong to the same or a similar schema..And that's how I
personally believe we learn and interpret new information that arrives...

Of course there are many different theories, but that's my favourite.

~~~
YZF
I think in the context of machine learning the brain's ability to model the
real world has evolved and a better model for the world represents a survival
advantage. I don't know too much about how the brain actually models reality
(and I don't know if anyone does) but the theory of machine learning still
applies in the sense that each individual brain of each animal is a model and
if you have a model that is too complex it will generalize poorly and
therefore the owner of that brain is likely to do poorly in the real world.

It's very interesting in the sense that the totality of brains over time is
essentially a sort of supervised learning with huge amounts of input data.

------
michaelochurch
Validation isn't "a trick", or shouldn't be. It's just being responsible. I'm
sure there are people getting funded who don't know about it, but they're
charlatans if they don't understand the dangers of overfitting (and
underfitting).

~~~
tlarkworthy
see the early history of learning, it was a discovery that is actually counter
intuitive and a common trap for beginners. (don't minimize the training error)

I have seen new PhDs read about it "in theory", but not internalise it for
practice, and then they go off an do Bayesian structure learning without a
validation set. This DOES happen.

This post is to hammer into the brains of any beginner thinking about machine
learning that understanding the validation set's purpose is the most important
thing to internalise first.

e.g. Machine learning is easier than it looks:
[https://news.ycombinator.com/item?id=6770785](https://news.ycombinator.com/item?id=6770785)

