
Our perfect submission - luu
https://www.kaggle.com/c/restaurant-revenue-prediction/forums/t/13950/our-perfect-submission
======
toxicFork
Can someone please enlighten me on the context? Assume I know nothing about
what Kaggle is, what the contest is about, and even what 'overfitting' means.

~~~
compbio
Kaggle hosts data science competitions. There is a public leaderboard and a
private leaderboard. During the competition only the public leaderboard is
visible. After the competition ends the private leaderboard is revealed and
your standing on this private leaderboard decides your final score.

The public leaderboard gives some feedback on your model performance. But when
a human is in the feedback loop, then there is a risk of overfitting.
Overfitting can be explained basically as: "memorizing the data" or "learning
from noise, not signal".

When you overfit, you do well in cross-validation and may do well on the
public leaderboard, but your predictions do not generalize well to new data.

What this team did was to submit a lot of predictions and only take the
predictions that improved public leaderboard score. Then they'd add slightly
random noise and try to submit again. They repeated this until they ranked nr.
1 and left quite a few other competitors scratching their heads: How did they
do this? Did they find a perfect ML algorithm? Is there data leakage? Are they
cheating?

When the private leaderboard was revealed, this team dropped around 2000
spots. Their good performance on the public leaderboard was purely artificial.
The contest was valid and well-organized (this could happen on any Kaggle
competition with little data). They did not receive a prize.

There are other benefits to ranking well on the public leaderboard: It helps
with teaming up with other high-ranking competitors and you can market
yourself (recruiters are pretty interested in the top 10, eventhough the
competition has not ended yet.)

------
willis77
If you enjoyed this, don't miss Moritz Hardt's entertaining blog post on Wacky
Boosting (linked from the forum thread):

[http://blog.mrtz.org/2015/03/09/competition.html](http://blog.mrtz.org/2015/03/09/competition.html)

------
pieguy
IPSC 2014 problem E looks like a machine learning problem, but the intended
solution was to make multiple submissions and use the judge feedback to
reverse-engineer the test data. The winning team needed only 6 submissions to
hit 95% accuracy. It was one of my favourite problems from the contest.

[http://ipsc.ksp.sk/2014/real/problems/e.html](http://ipsc.ksp.sk/2014/real/problems/e.html)

~~~
henrikschroder
What did they use the sixth submission for? You can win it in five
submissions, can't you? :-)

------
jerf
On the 4th page towards the bottom, Kaggle says they will host the team's
writeup "soon"; this was 26 days ago. I do not see it on the Kaggle blog
[1]... still coming? Or did I miss it? (I scanned over the titles for the
previous month and ran a search on BAYZ. Despite the topic I don't think the
"Microsoft Malware Winners' Interview: 1st place, "NO to overfitting!"" post
is related.)

[1]: [http://blog.kaggle.com/](http://blog.kaggle.com/)

~~~
hassiktir
Usually the writeup is just supplemental to the posting's on the forum (and
there is usually a 'post your solution' type deal where people walk through
their process and sometimes post code) but I'm guessing it's just one of the
competitions that will end up taking a month or so to gather all the
interviews from the actual winners as well as BAYZ. Probably will not be too
informative as they go quite into depth about the process and the linked blog
post show's almost literal step by step of how to "game" kaggle leaderboards
which is a huge aspect of competing (and why there is a private vs public
dataset, why there are competition 'must enter by' deadlines and why people
post benchmark code that will beat 25% of the current leaderboard before the
competition is over and a lot of various other reasons).

Also looking at my submissions and notes from the competition (I stopped after
the first week or so though) I even noted which 'groups' were most likely on
the public leaderboard as it becomes easy to tell based on how your personal
metric scores vs how that score results on the leaderboard since you know the
evaluation metric for each competition.

------
raverbashing
Winning by overfitting. Is that the end for those contests?

Maybe they need a "any score lower than X will be considered as a bast score",
or score in other aspects of the solution (like complexity)

Or just have a lot of data for scoring purposes.

~~~
jerf
Here is the public leaderboard for the now-completed contest, where they are
in first place out of 2,257: [https://www.kaggle.com/c/restaurant-revenue-
prediction/leade...](https://www.kaggle.com/c/restaurant-revenue-
prediction/leaderboard/public)

Here is the private leaderboard, which I believe represents the actual score
of the contest, where they are in place 1,962:
[https://www.kaggle.com/c/restaurant-revenue-
prediction/leade...](https://www.kaggle.com/c/restaurant-revenue-
prediction/leaderboard/private)

That's, ah, not in the prize range, to put it lightly.

There's no problem with the contests here. It's just a particularly vivid, and
entertaining to humans, demonstration of the well-known problem with
overfitting.

If I am reading this correctly, the green numbers to the left of the teams in
the private leaderboard represent that team's position relative to the public
leaderboard. Note they're quite large... the 9th place team in the final
scores jumped 1,058 positions! That's nearly half the field. If this is
typical, and I have no idea (reply & let me know!), the public leaderboards
are basically a joke.

~~~
raverbashing
So, why have 2 leaderboards? Or the private one has limited visibility by
contestant?

(there seems to be simple ways for kaggle to solve this: put a delay on the
score generation, limit number of submissions, limit precision of scores on
the public leaderboard)

~~~
Zarel
I'm guessing that the private one isn't visible to contestants or the public
until the contest is over (hence "private").

The published dataset gives contestants an idea of how good they are, but
prevents gaming the system by overfitting, because the contest is actually
scored with the second unknown dataset.

------
nilkn
I thought a perfect result on test data via overfitting was trivial (just
memorize the data). What am I missing? (I've never done Kaggle and don't know
how it's set up at all.)

~~~
waqf
You're missing the fact that the leaderboard dataset is not made available to
the contestants, so they cannot memorize it.

The only way they can determine anything about the leaderboard dataset is by
submitting a model, at which point they are told a single scalar score for
that model.

