Hacker News new | comments | show | ask | jobs | submit login

Such aggressive usage of the test data set in determining the tuning parameters in effect makes your test data set part of your training data set.

The more times you go back to your test data set to evaluate the effectiveness of a model, the more optimistic your error predictions will be and the greater your chance of overfitting. Several iterations of his loop will probably improve the model, but if you keep repeating it eventually the true model performance will start to degrade.

Question: Why is the data for machine learning split into a training dataset and a test dataset? Wouldn't using the entire dataset to build the model result in greater accuracy of the predictions?

When you develop a model, for instance, when implementing a classifier, you supposedly want to apply the developed model to other data, i.e., data you don't have available during development.

In many situations, it doesn't make sense to test your model only when it's put to make or influence decisions in the real world (although you have to test in the real world too). You'll want to test the predictions of your model on data you already have the actual results for. To test your model you'll split your data into data you know and will let the model know about (training dataset), and data you know but the model can't know about (test dataset). That way you can use the data the model doesn't know about to make controlled experiments and compare models (and, if your data is really representative of the real world, your mofrl comparison and the performance of your chosen model will hold).

The moral of the story is: if you don't split your data, you won't have any idea of how it performs in the real world, you'll only know how it performs with data it already knew about.

I understand this is the way it is done, however, it all feels a bit hand wavy and not how my gut tells me to do it.

I'd get the entire dataset, set columns as variables, assign weight to each of these variables and process each weight in 0.1 increments. (So the final number of passes is 11^n where n is number of variables). I'd have something at the end of the row to know what was predicted right (+1) and what was predicted wrong (-1), sum this column. Hit run until optimal weights for each variable are found. I'd use the entire dataset to do this.

Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

> I understand this is the way it is done, however, it all feels a bit hand wavy and not how my gut tells me to do it.

I know what you mean, I felt the same way when I first learned about training and test sets. However...

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

Indeed there is - the proper name for the statistical technique is "cross-validation" and much work has been done on the matter. The most common way to do it is "k-fold cross validation", which, when k=2 simply means to split your data in half randomly and use one for training and one for test - basically what the author did. The generalized k-fold cross-validation means that you slice up your data into k subsets, pick one to use as a test set and use the rest as a training set. Then repeat the process ("folding") k times, picking a different test set every time and averaging the results. This, of course, takes longer, but it has a better gut feel because your algorithm is "learning" from all the data rather than just whatever slice you originally chose as your training set. Wikipedia lists a few more common ways to do it:


However, since the author's process involves tuning parameters after testing the algorithm against the test set, I'm not really sure how cross-validation would work... Maybe standard k-fold with three sets instead of two - a training set, an intermediate test set, and a true test set?

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

In my own field, Natural Language Processing (NLP), it is either up to the original creators of a dataset or you do your own split if there isn't one established already. I'll go with what I have learned for supervised learning.

In an ideal world all three sets; training, development (the Machine Learning people sometimes call this one verification if I remember correctly) and test should be infinitely large. Also, you should preferably not stratify or try to make the assignment anything but random (there are cases where this could be justified, but let's not go there just yet).

I personally go for a 3/6 train, 1/6 development and 2/6 test, but I have just as well seen 2/4, 1/4 and 1/4, etc. Training is essential, so it gets the biggest cut, testing is important too so it also gets a large chunk and development is the least important out of the three so it gets the smaller one. In short, train is for making sure your algorithm can learn something, development is in order to guide your development and not fool yourself, lastly, test is in order to be able to make claims that you state to other people (thus, it is pretty darn important).

I then use the train and development set when constructing the model, I do the write-up of most of my results and then generate the final results by running on the test set only once with the model that performed the best on development. What you usually see is a drop in performance, but this is expected since you have most likely overfitted the development set. Since the hyper parameters need to be adjusted as well I commonly do ten-fold cross-validation on the test set and use some variant of grid-search (read yesterday that this approach for hyper parameters is coming under fire as being naive, I need to have a look at what has been going on in ML for the last two years).

> I commonly do ten-fold cross-validation on the test set...

Um, darn, edit period ran out, test set should obviously be train set in the above quote. Otherwise my PI would probably smack me in the face for overfitting the test set.

Another way to say what you're proposing is to use a linear predictor, and to train it by globally optimizing 0-1 loss (just the number of mistakes that you make) on the full set of data that you have. Even ignoring computational issues (this can be shown to be NP-hard), you seem to be making several mistaken assumptions. I'd really recommend reading some basic stuff on generalization, but a couple of the mistaken assumptions are as follows:

1. That the particular input features that you've chosen are somehow the only possible choice. But who's to say that you shouldn't add new features which are the square of each original feature? Or maybe some cross product terms, like the product of the ith feature times the jth feature. Or maybe some good features to add would be the distance from each point you've seen so far. Etc. Continuing down this path, you basically get to the question discussed in the OP about choosing a kernel for SVMs. This is just one example where hyperparameters come into play, and you need some method for choosing them.

2. That a linear predictor is impervious to overfitting. Consider the extreme case (which comes up often) where you have millions or billions of features and far fewer examples (e.g., if features are n-gram occurrences in text, or gene expression data). Then it's likely that there are many settings of weights that fit the data perfectly, but there's no way to tell if you're just picking up on statistical noise, or if you've learned something that will make good predictions on new data that you encounter. In both theory and practice, you need some form of regularization, and along with this comes more hyperparameters, which need to be chosen.

Finally, by your reasoning, it seems like you would always choose a 1-nearest neighbors classifier [1] (because it will always end up with 0 error under the setting your propose). But there's no reason why this is in general a good idea.

[1] http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?

I think you're thinking about this wrong. You need to set aside some of your data for testing/cross-validation, otherwise you have no way of knowing even if you have enough data to train your model! i.e. you want to get to the point where additional training data isn't getting you a better model, and if you're not at that point you should probably collect more data.

It seems I'm thinking of a prediction algorithm and being discussed is a genetic algorithm, for which I can understand the necessity of a training set and a test set.

Either way, my knowledge on machine learning is lacking, time to hit up Google.

You might be interested in Andrew Ng's ML course via Coursera.

Solomonoff induction is a principled but uncomputable way to do without a test set. There might be a useful approximation to it for your problem.

For this I actually had sufficient data available that adding more didn't do much.

You need a testing dataset in order to validate the performance of the model. If you validate against the training set really what you're doing is measuring the model's ability to fit the training data - which it will be able to do with high accuracy. That will, however, result in a much diminished ability to predict any new games - as it isn't "learning" the features of college basketball as much as it is memorizing the contents of the training set.

As others pointed out, it'd probably even be better to add a third grouping which is tested against only after the algorithm has finished - as an objective validation against as of yet unseen data.

To prevent overfitting. A common technique is to use cross-validation, where you train some percentage of the data (90%) and test on the remaining (10%), and cycle through which 10% you use as test (each train/test split is called a "fold"). Once you've identified a model that resists overfitting, you can train on all of the data.

edit: As noted in the other reply, having a truly blind validation set is still ideal.

Suppose that I have 2 possible models that I might try to build. I can train them both on my data set, but how do I get a sense as to which is better?

When you have the split, you can then tell which idea seems to work better.

Yes. Training on the test set overestimates performance, cross validation underestimates it.

Yep, I've definitely seen that. I bet if I were to reserve a portion of the dataset for validation after all tuning, I'd get a much better measure of it's actual ability at prediction. Using it in the way that I did, I wouldn't be surprised if there was significant overfitting. In fact I'd be surprised if there wasn't.

You can split it into training, validation, and test sets if you have enough data. Then, the validation set would be used for the training. Presumably by using a genetic algorithm, he isn't going for a perfect fit, which may help avoid overfitting.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact