
Machine learning for the impatient: algorithms tuning algorithms - aelaguiz
http://www.aelag.com/147952673
======
danger
As another commenter pointed out, the accuracy _really_ needs to be evaluated
using a validation set, not the test set--the approach described in the post
is training with the testing data. In the field, we call this "cheating".

The basic idea of automatically tuning hyperparameters (the things this post
discusses tuning with genetic algorithms) is cool, though, and is becoming a
popular subject in machine learning research. A couple recent research papers
on the topic are pretty readable:

Algorithms for Hyper-Parameter Optimization:

<http://books.nips.cc/papers/files/nips24/NIPS2011_1385.pdf>

Practical Bayesian Optimization of Machine Learning Algorithms:

<http://arxiv.org/abs/1206.2944>

~~~
aelaguiz
Thanks for the information! I've updated the article to reflect this.

Here's a question: where does "the field" hang out? Is there a cohesive
community of any sort?

~~~
danger
I'd say the closest thing to a cohesive community would be the MetaOptimize
Q&A forum, but maybe others have other suggestions:

<http://metaoptimize.com/qa>

------
scottfr
Such aggressive usage of the test data set in determining the tuning
parameters in effect makes your test data set part of your training data set.

The more times you go back to your test data set to evaluate the effectiveness
of a model, the more optimistic your error predictions will be and the greater
your chance of overfitting. Several iterations of his loop will probably
improve the model, but if you keep repeating it eventually the true model
performance will start to degrade.

~~~
brador
Question: Why is the data for machine learning split into a training dataset
and a test dataset? Wouldn't using the entire dataset to build the model
result in greater accuracy of the predictions?

~~~
reginaldo
When you develop a model, for instance, when implementing a classifier, you
supposedly want to apply the developed model to _other_ data, i.e., data you
don't have available during development.

In many situations, it doesn't make sense to test your model _only_ when it's
put to make or influence decisions in the real world (although you have to
test in the real world too). You'll want to test the predictions of your model
on data you already have the _actual_ results for. To test your model you'll
split your data into data you know _and will let the model know about_
(training dataset), and data you know _but the model can't know about_ (test
dataset). That way you can use the data the model doesn't know about to make
controlled experiments and compare models (and, if your data is really
representative of the real world, your mofrl comparison and the performance of
your chosen model will hold).

The moral of the story is: if you don't split your data, you won't have any
idea of how it performs in the real world, you'll only know how it performs
with data it already knew about.

~~~
brador
I understand this is the way it is done, however, it all feels a bit hand wavy
and not how my gut tells me to do it.

I'd get the entire dataset, set columns as variables, assign weight to each of
these variables and process each weight in 0.1 increments. (So the final
number of passes is 11^n where n is number of variables). I'd have something
at the end of the row to know what was predicted right (+1) and what was
predicted wrong (-1), sum this column. Hit run until optimal weights for each
variable are found. I'd use the entire dataset to do this.

Is there any mathematics on defining what % of the dataset should be training
vs. testing or is it left to the analyst (like with confidence intervals (95%
hypothesis testing etc.))?

~~~
ninjin
> Is there any mathematics on defining what % of the dataset should be
> training vs. testing or is it left to the analyst (like with confidence
> intervals (95% hypothesis testing etc.))?

In my own field, Natural Language Processing (NLP), it is either up to the
original creators of a dataset or you do your own split if there isn't one
established already. I'll go with what I have learned for supervised learning.

In an ideal world all three sets; training, development (the Machine Learning
people sometimes call this one verification if I remember correctly) and test
should be infinitely large. Also, you should preferably not stratify or try to
make the assignment anything but random (there are cases where this could be
justified, but let's not go there just yet).

I personally go for a 3/6 train, 1/6 development and 2/6 test, but I have just
as well seen 2/4, 1/4 and 1/4, etc. Training is essential, so it gets the
biggest cut, testing is important too so it also gets a large chunk and
development is the least important out of the three so it gets the smaller
one. In short, train is for making sure your algorithm can learn something,
development is in order to guide your development and not fool yourself,
lastly, test is in order to be able to make claims that you state to other
people (thus, it is pretty darn important).

I then use the train and development set when constructing the model, I do the
write-up of most of my results and then generate the final results by running
on the test set only once with the model that performed the best on
development. What you usually see is a drop in performance, but this is
expected since you have most likely overfitted the development set. Since the
hyper parameters need to be adjusted as well I commonly do ten-fold cross-
validation on the test set and use some variant of grid-search (read yesterday
that this approach for hyper parameters is coming under fire as being naive, I
need to have a look at what has been going on in ML for the last two years).

~~~
ninjin
> I commonly do ten-fold cross-validation on the test set...

Um, darn, edit period ran out, test set should obviously be train set in the
above quote. Otherwise my PI would probably smack me in the face for
overfitting the test set.

------
bencpeters
I had a few questions about the actual implementation of this stuff. I took
the coursera ML course, so much of the terminology and techniques are familiar
after that, but Professor Ng structured the exercises in the course around
Matlab/Octave and suggested using one of these tools for a first-pass solution
when implementing machine learning problems.

Have you used Matlab much in your work? How does the performance (and
libraries available) compare with Python?

Also, does anyone know a good resource for finding good, high performance ML
libraries for other languages (Ruby, C++, etc.)?

~~~
punee
Scikit-learn for Python, great tutorials/user guide covering various ML
techniques, makes prototyping very easy: <http://scikit-learn.org/stable/>

------
misiti3780
This article is very interesting but I am really confused about the edit:

"A few people have pointed out that using the testing set for tuning demands
that final measure of effectiveness be doing using a validation test set which
is not part of either the training or testing datasets. This is due for the
very real potential of over fitting. Also – apparently this technique is
called “Hyper-Parameter Optimization.” A helpful commenter over at Hacker News
supplied the following resources"

Does that mean there is definitely overfitting going on but it is acceptable
for the purposes of the article. That 82.5% accuracy rate has overfitting
written all over it.

Good stuff regardless!

