
How and why to create a good validation set - tatadada
http://www.fast.ai/2017/11/13/validation-sets/
======
nl
In warfare amateurs talk tactics and professionals study logistics. In machine
learning amateurs talk models and professionals build good
test/train/validation sets.

It's an open secret on Kaggle that a good validation procedure is what
separates the top of the leaderboard from the rest. People are happy to share
their modelling insights and their ensembling processes, but it is very very
rare to see some share their test/train/validation splits before a competition
is finished.

------
cdancette
TL;DR : a random subset of the data is not always the best choice. Ex: for
time series you might want to choose a chunk of time (consecutive examples)

------
kmike84
The article is good, but "The dangers of cross-validation" section is wrong.

Cross-validation means you split your data into folds and run multiple
experiments, getting better data efficiency as compared to a single split.
Splits don't have to be random.

scikit-learn provides GroupKFold and TimeSeriesSplit objects to use with
cross-validation which address exactly the problems described in the article.

Cross-validation is not a panacea, in real world there are many other issues -
e.g. it is common to have a training dataset which doesn't represent
production data distribution, often because such dataset turns out to be much
cheaper to get.

Andrew Ng's [http://www.mlyearning.org/](http://www.mlyearning.org/) is a
great resource on this; it discusses real-world problems with
train/test/validation splits and gives practical advice, can't recommend it
highly enough.

