
Amateurs beat specialists in data-prediction competitions - ivoflipse
http://www.slate.com/articles/health_and_science/new_scientist/2012/12/kaggle_president_jeremy_howard_amateurs_beat_specialists_in_data_prediction.html
======
micro_cam
I am a non-biologist applying random forests and other methods to large
genetic studies and I think it is unfair to discount the importance of
specialists.

I'm not really surprised that domain knowledge doesn't predict success in
these contests because the data is all ready featurized and sanitized to
remove any features that could be used to "cheat."

Random forest works really well and will quickly pick up on any features in
the data that it can use to build an accurate model that isn't useful in the
real world. This includes unique identifiers and features that are correlated
for the wrong reason.

For example if you are an e-commerce site with tiered shipping cost scheme and
turn it loose on your raw database it will think shipping cost predicts
purchase price. It will also think order id is an excellent predictor of
purchase price since there since every unique order id has a cost.

The role of the domain expert is to determine what features can be fed into
these black box methods and to recognize when the model is overfitting and re
featurize the data. IE they might replace the "shipping cost" feature with a
boolean "discounted shipping promotion" feature and see if still effects total
purchase.

~~~
jph00
Excellent point. This is called "data leakage". Analysis of this topic won the
2011 KDD best paper award: [http://staging.m6d.com/wp-
content/uploads/research_Leakage-i...](http://staging.m6d.com/wp-
content/uploads/research_Leakage-in-Data-Mining-Formulation-Detection-and-
Avoidance.pdf)

~~~
micro_cam
Thanks. I don't suppose you can talk about how you (kaggle) eliminate leakage
in practice?

I would imagine something based on repeated application of random forest?

------
jph00
I'm the Kaggle President & Chief Scientist that is interviewed in this
article. Feel free to ask me any questions that you have about the role of
domain experts in data -driven decision making (or any other relevant topic!)

(It didn't occur to me to submit the article to HN - silly me! Thanks
ivoflipse for doing so.)

~~~
pgroves
If I create 1000 predictive models, with random parameter settings, some of
those models will outperform others. In fact, the accuracy rankings of the
models given a single hold-out validation set will form their own
distribution. It is difficult, if not impossible, to tell if one model
outranks another b/c it is actually better or just occupies a different place
in that distribution due to the random properties of the hold-out set.

So, why do you believe this isn't happening when hundreds of people submit
their models to be scored against a secret hold-out set? Is there any evidence
that the hundreds of entries are forming a better distribution of accuracy
scores than random models? Is there evidence the winners of these competitions
truly have better models and don't just have a luckier placement in a random
distribution of accuracy scores?

~~~
jph00
Yes, we can calculate the distribution and find the likelihood that a given
team would beat another if the data is resampled. Here's a detailed example:
<http://www.tiberius.biz/pakdd07.html>

The size of improvements seen in competitions vs previous best performing
models (eg Allstate competition showed improvement of 270%) is well beyond
that explicable by chance. Also: the test set size in most competitions is
hundreds of thousands of rows - sampling error is very small in these cases.

~~~
pgroves
I still find it hard to believe the people in the top 10 aren't all at the
theoretical limit of predictive accuracy with their final order being
determined by nuances in the experimental design (things like how the raw data
was collected).

Which would be the sort of thing statisticians live with all the time, except
that Kaggle then makes it winner take all.

~~~
jph00
That certainly does happen - indeed our tag line is "making data science a
sport"... Like in all sports, there can be a stochastic element to who ends up
winning!

However, in private competitions (which is where the money is!) everyone
selected to participate wins something. It's more like golf - make the cut,
and you'll definitely make a few dollars, but the winner will take home the
biggest prize.

------
DougBTX
Title is misleading, it should say, "Data-prediction specialists beat
specialists from other fields at data-prediction".

~~~
ivoflipse
My bad, I used the Hacker News submit bookmarklet, which probably picked out a
quote, rather than the title of the article. I probably should have paid more
attention. But now its been too long for me to edit it

------
zwass
This immediately makes me think of De Moivre's equation, which I just learned
about from another front-paging HN article, "The Most Dangerous Equation"
(<http://news.ycombinator.com/item?id=4893258>)

------
damian2000
How different or similar is this to the wisdom of crowds
(<http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds>)... or those online
decision markets which have proved quite successful
([http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds#Prediction...](http://en.wikipedia.org/wiki/The_Wisdom_of_Crowds#Prediction_markets))?
Seems related in some ways.

~~~
balakk
It's pretty different; sites like Kaggle focus on algorithmic approaches to
data prediction. There's one winner for each contest; the person who applies
the best algorithm and model wins.

~~~
iandanforth
I think the parent was asking for a comparison of Random Forests to Wisdom of
the Crowds, which isn't a totally bad comparison. With every ensemble
technique there has to be a way to deal with the many votes from your
ensemble. You could average their answers (as the anecdote in the WoC
wikipedia article does) or use a first-past-the-post voting scheme, or use the
historical success of a subset of models to pick which models should be
listened to. So yes, there are similarities, but the specifics are quite
tricky :)

~~~
jph00
That's a very insightful connection you've made. Both approaches can be
considered "ensembles of weak learners". However the wisdom of the crowds is
more likely to suffer from systematic bias, which can be a drawback of that
approach.

------
im3w1l
Is this a failure for bayesianism in practice? The people with the best priors
still lost.

~~~
stochastician
It's not clear that the people with the best priors lost -- the people with a
great deal of prior information (the specialists) are also often still using
SAS/etc. and linear regression / classical neural nets / etc. Many of the
people doing Predictive Analytics (A term I hate) in large organizations have
spent 20 years gaining domain knowledge and not keeping up on state-of-the-art
methods (there are only so many hours in the day). I mean, hell, Andrew
Gelman's multi-level modeling book is considered pretty advanced to this day,
which is hard to understand as a machine learning person.

The challenge with black-box models is "how to extend them" -- reasons I still
have a great deal of faith in the power of Bayesian methods and the utility of
joint inference. I think modern methods have really commoditized the predict-
y-from-x problem, but there's a lot more to it than that.

The future, in my opinion, is letting you specify more, richer, prior
knowledge, to solve more interesting problems. That's why I work with bayesian
nonparametrics, and am excited about probabilistic programming:
<http://probabilistic-programming.org/wiki/NIPS*2012_Workshop>

------
JanneVee
It is actually scary considering that the analysts Wall Street would be
considered specialists in economic forecasting and prediction.

