
Kaggle Post-Mortem: The dangers of overfitting - rouli
http://www.gregorypark.org/?p=359
======
tel
Kaggle is doing ML wrong. It has the opportunity to teach some people about
these perils, this blog post being the beginning of such a realization, but I
feel terrible for anyone with a hard problem who invests in a funded Kaggle
competition.

Black box machine learning is unlikely to improve a meaningful metric beyond a
stock SVM or random forest. It's possible to tune an algorithm to achieve
arbitrarily good training set performance, of which you can include the small
"test set" evaluation, but this does not make a generalizable, practical tool.

Opportunities might exist if you have the opportunity to feature engineer with
domain expertise in order to keep improving your Bayes error. Kaggle is
designed not to allow for this, though.

~~~
feral
Could you explain what you mean by its 'designed not to allow for this'? That
seems a very sweeping criticism.

I took part in the last essay scoring competition on Kaggle.

In that competition, the data set was the original text of the essays, scored
by human raters. A lot of the effort in our submission was in developing
features. We didn't spend much time on our entry, but we did use domain
specific features (e.g. writing code to calculate
<http://en.wikipedia.org/wiki/Gunning_fog_index>), implementing spell
checking, etc.

That entire competition took place within the Kaggle framework.

Furthermore, in addition to the training-set (which we did most of our
development with, using cross validation) and the test-set, on which the final
scores were calculated, there was also a validation-set, from which the public
scoreboard was calculated. We could only submit our predictions for the
validation set twice a day, and were told only our overall accuracy on the set
-- not given the scores of each essay.

The role of the validation set in Kaggles framework was surely to prevent
errors such as overfitting. In fact, we had a situation where our cross-
validation accuracy increased, and validation-set accuracy decreased, because
of a bug where our feature selection didn't properly implement cross-
validation (similar to the linked article).

The difference between the cross-validation and validation-set scores drew our
attention to this problem.

That all happened within the Kaggle framework.

So I don't really understand criticisms as general as 'Kaggle is doing ML
wrong'?

~~~
shoo
I think that the Kaggle competition you describe has essentially the same
structure as the Kaggle competition outlined in the blog post, and is
susceptible to the same potential hazard of overfitting to the validation
score.

Here by validation score I mean the combination of both your own cross-
validation procedure on the training data, and the validation score that
kaggle calculates and displays on the leaderboard while the competition is
running.

I imagine if the essay scoring competition had been extended to run for
another 10 years before the final scores on held-back test data were reported
we would see a similar trend of overfitting in most if not all of the entries.

edit: I might be underestimating the sophistication of your approach or the
approaches of others.

At least in the Kaggle competitions I have personally entered, I have used my
own cross-validation scores and the Kaggle leaderboard scores to tune my
approaches, without properly taking into account that this breaks the validity
of the subsequent validation procedures, since I have no way of validating my
tuning.

~~~
feral
Yes - even if you are only allowed look at the validation set twice a day, you
can still overfit it, with sufficient days.

But thats hardly a criticism of Kaggle.

Its the responsibility of the person building the ML model to take steps to
avoid overfitting. Even if competitors fail to take these steps, the fact that
Kaggle has a built in validation-set helps avoid overfitting.

Yes, with enough time it would still be possible to overfit, but I don't see
how that is Kaggle's fault.

If anything, they've a framework to discourage overfitting, not encourage it.

------
carlob
And the irony of it is that the author is warning us against the dangers of
overfitting while overfitting the same data that proves his point.

I'm referring to the blue lines in the scatterplots of the rank vs. no of
submissions: those unidentified curves should have been straight lines. What's
the point of having some high degree polynomial there?

------
gbog
Reading "Thinking, fast and slow" right now, it resonates strangely with this
article.

For those not in the knowing, this book proves how most of our reasoning is
false most of the time, shows that people vote based on facial features, that
success is luck, that the world is out of our control, that experts are worse
forecasters than monkeys randomly picking options, and the better the experts
the worse the forecast. Depressing.

Edit: one thing reassuring is that it explain rationally how random people
moderately clever and sometime plain despicable can become the head of very
successful companies.

~~~
jacques_chester
I second the recommendation for _Thinking, Fast and Slow_.

> that experts are worse forecasters than monkeys randomly picking options

For a more nuanced study on this, see _Expert Political Judgement: How Good Is
It? How Can We Know?_ by Phillip Tetlock.

If you aggregate expert opinions then yes, they barely beat monkeys. Happily
they outperform aggregated non-expert opinions.

However you can classify _how_ experts reach conclusions and find that some
categories of thinkers (foxes) will consistently outperform other categories
(hedgehogs).

Mind you -- humans and monkeys both are absolutely _trounced_ by statistical
models.

~~~
cbsmith
Can you briefly summarize what distinguishes foxes vs. hedgehogs?

~~~
jacques_chester
It comes from an essay by Isaiah Berlin, _The Hedgehog and the Fox_. Hedgehogs
know "one big thing" and Foxes know "lots of little things".

In the expert prediction study, experts who had a universal framework which
they extended rigidly to every problem performed poorly on both calibration
(correlation of confidence and actual post-factual probabilities) and
discrimination (correlation of more/less/same predictions to actual outcomes).

Foxes tended to throw together a bunch of thoughts, including from
incompatible theories, express a lot of "one of the one hand, and on the
other" sort of thinking and qualified most of their predictions. They did much
better on calibration and somewhat better on discrimination.

Basically, Hedgehogs tend to make more extreme predictions ("Russia will fall
into civil war by 1997") more confidently ("it's almost certain") than Foxes.
But statistically, about 40% of the time variables in the complex systems
being forecast actually stayed within a band and neither decisively moved up
or down.

I know that you're probably thinking of holes and objections to the study
already -- read the book. Tetlock and his collaborators were astonishingly
thorough in trying to deal with all the relevant arguments and
counterarguments with actual data.

~~~
cbsmith
> I know that you're probably thinking of holes and objections to the study
> already

Quite the contrary. Your description of the differences is clear and the
outcomes from the study fit well with a similar meme in machine learning
around ensemble methods.

------
dvse
Not surprising at all - kaggle (as currently implemented) is a fundamentally
broken model. On top of the rather unpleasant "everybody pays auction" or
"winner take all" system, they have a severe problem with metrics - the
majority of the datasets are not anywhere near large enough to give stable out
of sample error estimates, which means that in many cases the "winners" are
barely better than random.

Perhaps they might be onto something with "kaggle prospect", but unless they
pivot in some creative new direction, it's hard to see the service being very
useful.

~~~
antgoldbloom
(Disclaimer: I work at Kaggle.)

We're actually using public competition as a way to find out who are some of
the world's strongest data scientists. Once somebody has performed well in
several public competitions, we start inviting them to private competitions
where: 1\. we invite 15 members; 2\. prize money is generally much higher; and
3\. everyone wins something (the higher the position the more one wins)

We're aiming to give many of the world's data scientists the opportunity to
earn great full time incomes by competing in our competitions.

~~~
dvse
Thanks for the reply. If you are already running these private competitions it
might be useful to advertise them a bit more openly (then you will need some
formal criteria for joining to avoid user resentment, e.g. those in top 100).
You are right that the system as is can help to find people who have some
combination of persistence, general intuition for dealing with data and a
modicum of modelling skills - "the world's best data scientists" is doubtful
but certainly beats interviewing.

------
jboggan
I've been competing in the Facebook competition and submitting twice a day.
I've managed to drag up into the 90th percentile of scores but I wonder how
much the final leaderboard will resemble the daily leaderboard on Tuesday when
the competition closes. I'm really looking forward to a post-mortem of this
particular problem, since the data was basically featureless and the most
successful approaches seem so similar.

~~~
rouli
how did it go for you? I finished at the disappointing 12th place...

------
stupidhed
From a business perspective, it doesn't matter if Kaggle is flawed. What
matters is whether they can convince customers to buy in to their ideas and
fund their business.

In other words, more Snake Oil from the computer industry. In some (most?)
cases, the salesmen may even believe that the Oil works. That is, they may not
be acting fraudulently, but just foolishly (as are their clients who believe
the hype).

Overconfidence in math and computers to solve problems that are not suited to
be solved this way. In some ways this is partly responsible for the global
financial mess. Reliance on models that quants and their bosses would defend
vigorously, because they work to make money for the bosses (because clients
believe the hype), but which are, overall, in the long run, not sound.

~~~
tensor
Overconfidence in flawed human intuition is equally if not more dangerous.
Math and algorithms done correctly are unbiased, unlike humans. Economics,
unlike much of machine learning and computer science, is often based on very
loose base assumptions. Hence the flaws.

Science and math have varying levels of rigor depending on the domain. On the
other hand, not using them has a consistent outcome: guessing and ignorance.

------
chris_wot
Why even look at these scoreboards?

~~~
shoo
Assume that the validation data is similar to the test data. Then, by
performing experiments (making submissions) and recording measurements
(analysing the returned leaderboard score) you can attempt to infer something
about the structure of the validation data, and make more accurate predictions
about the test data. You can probably infer quite a bit of useful information
by looking at other people's scores too, even if you don't have access to
their submitted predictions.

