

Kaggle Ensembling Guide - jphilip147
http://mlwave.com/kaggle-ensembling-guide/

======
sdenton4
Couple points:

a) I think one of the biggest challenges in a Kaggle competition is getting
away from overfitting to the leaderboard. It's super common... I won a Kaggle
competition last year, and was something like 65th place on the public
leaderboard at the end: the other teams were overfitting like crazy. As such,
one should be super careful when taking 'well-performing' models to build an
ensemble.

b) The point about the ensembling of uncorrelated models is hella important.
If you make an ensemble consisting of 20 near-identical predictions from one
algorithm, and 10 near-identical predictions from another algorithm, you're in
effect taking a vote between the two algorithms and giving the first one a
2/3's weighting.

It might be interesting to think about explicitly de-correlating the model
outputs, and finding an nice 'voting' method for combining the results... (And
actually, this comes down to Z_2 arithmetic, so we could probably use a
fourier transform for it... think I feel a blog post coming on.)

~~~
dthal
>> a) I think one of the biggest challenges in a Kaggle competition is getting
away from overfitting to the leaderboard.

This actually depends on the data. The commenter above won the Social Circles
competition. That competition had a very small number of instances - it looks
like it was 60 in the training set and 50 in the test set. It had one of the
larger shakeups in Kaggle history.

~~~
mikeskim
It's basically impossible to overfit to the leaderboard in some Kaggles like
Avazu where both the train and test are massive in terms of unique
observations.

------
solve
Surprisingly good, both as a broad overview and in the specifics.

