The more times you go back to your test data set to evaluate the effectiveness of a model, the more optimistic your error predictions will be and the greater your chance of overfitting. Several iterations of his loop will probably improve the model, but if you keep repeating it eventually the true model performance will start to degrade.
In many situations, it doesn't make sense to test your model only when it's put to make or influence decisions in the real world (although you have to test in the real world too). You'll want to test the predictions of your model on data you already have the actual results for. To test your model you'll split your data into data you know and will let the model know about (training dataset), and data you know but the model can't know about (test dataset). That way you can use the data the model doesn't know about to make controlled experiments and compare models (and, if your data is really representative of the real world, your mofrl comparison and the performance of your chosen model will hold).
The moral of the story is: if you don't split your data, you won't have any idea of how it performs in the real world, you'll only know how it performs with data it already knew about.
I'd get the entire dataset, set columns as variables, assign weight to each of these variables and process each weight in 0.1 increments. (So the final number of passes is 11^n where n is number of variables). I'd have something at the end of the row to know what was predicted right (+1) and what was predicted wrong (-1), sum this column. Hit run until optimal weights for each variable are found. I'd use the entire dataset to do this.
Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?
I know what you mean, I felt the same way when I first learned about training and test sets. However...
> Is there any mathematics on defining what % of the dataset should be training vs. testing or is it left to the analyst (like with confidence intervals (95% hypothesis testing etc.))?
Indeed there is - the proper name for the statistical technique is "cross-validation" and much work has been done on the matter. The most common way to do it is "k-fold cross validation", which, when k=2 simply means to split your data in half randomly and use one for training and one for test - basically what the author did. The generalized k-fold cross-validation means that you slice up your data into k subsets, pick one to use as a test set and use the rest as a training set. Then repeat the process ("folding") k times, picking a different test set every time and averaging the results. This, of course, takes longer, but it has a better gut feel because your algorithm is "learning" from all the data rather than just whatever slice you originally chose as your training set. Wikipedia lists a few more common ways to do it:
However, since the author's process involves tuning parameters after testing the algorithm against the test set, I'm not really sure how cross-validation would work... Maybe standard k-fold with three sets instead of two - a training set, an intermediate test set, and a true test set?
In my own field, Natural Language Processing (NLP), it is either up to the original creators of a dataset or you do your own split if there isn't one established already. I'll go with what I have learned for supervised learning.
In an ideal world all three sets; training, development (the Machine Learning people sometimes call this one verification if I remember correctly) and test should be infinitely large. Also, you should preferably not stratify or try to make the assignment anything but random (there are cases where this could be justified, but let's not go there just yet).
I personally go for a 3/6 train, 1/6 development and 2/6 test, but I have just as well seen 2/4, 1/4 and 1/4, etc. Training is essential, so it gets the biggest cut, testing is important too so it also gets a large chunk and development is the least important out of the three so it gets the smaller one. In short, train is for making sure your algorithm can learn something, development is in order to guide your development and not fool yourself, lastly, test is in order to be able to make claims that you state to other people (thus, it is pretty darn important).
I then use the train and development set when constructing the model, I do the write-up of most of my results and then generate the final results by running on the test set only once with the model that performed the best on development. What you usually see is a drop in performance, but this is expected since you have most likely overfitted the development set. Since the hyper parameters need to be adjusted as well I commonly do ten-fold cross-validation on the test set and use some variant of grid-search (read yesterday that this approach for hyper parameters is coming under fire as being naive, I need to have a look at what has been going on in ML for the last two years).
Um, darn, edit period ran out, test set should obviously be train set in the above quote. Otherwise my PI would probably smack me in the face for overfitting the test set.
1. That the particular input features that you've chosen are somehow the only possible choice. But who's to say that you shouldn't add new features which are the square of each original feature? Or maybe some cross product terms, like the product of the ith feature times the jth feature. Or maybe some good features to add would be the distance from each point you've seen so far. Etc. Continuing down this path, you basically get to the question discussed in the OP about choosing a kernel for SVMs. This is just one example where hyperparameters come into play, and you need some method for choosing them.
2. That a linear predictor is impervious to overfitting. Consider the extreme case (which comes up often) where you have millions or billions of features and far fewer examples (e.g., if features are n-gram occurrences in text, or gene expression data). Then it's likely that there are many settings of weights that fit the data perfectly, but there's no way to tell if you're just picking up on statistical noise, or if you've learned something that will make good predictions on new data that you encounter. In both theory and practice, you need some form of regularization, and along with this comes more hyperparameters, which need to be chosen.
Finally, by your reasoning, it seems like you would always choose a 1-nearest neighbors classifier  (because it will always end up with 0 error under the setting your propose). But there's no reason why this is in general a good idea.
I think you're thinking about this wrong. You need to set aside some of your data for testing/cross-validation, otherwise you have no way of knowing even if you have enough data to train your model! i.e. you want to get to the point where additional training data isn't getting you a better model, and if you're not at that point you should probably collect more data.
Either way, my knowledge on machine learning is lacking, time to hit up Google.
You need a testing dataset in order to validate the performance of the model. If you validate against the training set really what you're doing is measuring the model's ability to fit the training data - which it will be able to do with high accuracy. That will, however, result in a much diminished ability to predict any new games - as it isn't "learning" the features of college basketball as much as it is memorizing the contents of the training set.
As others pointed out, it'd probably even be better to add a third grouping which is tested against only after the algorithm has finished - as an objective validation against as of yet unseen data.
edit: As noted in the other reply, having a truly blind validation set is still ideal.
When you have the split, you can then tell which idea seems to work better.