
Crowdsourced Data mining, forecasting and bioinformatics via competitions - pitdesi
http://www.kaggle.com/
======
mwexler
This stuff is great. The rewards/prizes for the best model are so minimal
compared to what it usually costs to build a great model via a consulting
contract or hiring a high quality miner.

Similar to Mechanical Turk, we've managed to create a completely different
value structure for some amazing work by smart folks... mostly by making it a
competition. Great exposure for winners, sure, but these prizes are pretty
minimal.

<http://www.kaggle.com/c/GiveMeSomeCredit>, for example, has a total basket of
US$5K (only US$3K for first place) for a model predicting credit scores (in
this case, likelihood to default or have financial distress). Folks I talk to
who do this type of work professionally tend to charge far more than that to
create these models.

For the company sharing this data, of course, big win: They get a cheap,
potentially fantastic new model, and the creator gets some good exposure and
some cash. But if these take off, they can really change the economics of how
this work is created.

~~~
Vandy_Travis
You are right -- there are some neat aspects to this model.

However, it also tends to devalue the work invested by the analysts. They are
doing the work for essentially a lottery ticket -- winner take all. That's the
reason that it can be so cost effective to the company running the competition
-- they don't absorb the costs of the failures (or less optimal approaches).

Those costs have to be absorbed by someone. In this model, they are eaten by
the analysts, who (generally) don't have enough resources to cover those
costs. Due to that, I don't think this model is sustainable or will really
catch on.

OTOH, in some (ideal) academic endeavors, having multiple groups compete for
more funding or for a prize has certainly benefitted the sciences. In that
case, however, the competition was more friendly than zero sum. Also, I
believe the different sides tend to share information more than hoard it,
yielding lessons learned to the entire group from one team's failures.

~~~
mwexler
That's fair. I originally had the word "devalue" in the comment, but I pulled
it out. Why? No one is forcing the analyst to participate, but instead, they
absorb the costs to get exposure or growth (or fun). Whether that's a fair
tradeoff is up to each person, but given the number of folks participating in
Netflix competitions, these, and others out there, a group of smart, analytic
types feel the tradeoff is worth it.

~~~
_delirium
For the Netflix competition in particular, a lot of the entrants were being
"paid" in research papers, grants, and academic salary, because it was a high-
profile competition in a theoretically not-entirely-settled area
(collaborative filtering), so even many non-winning entries could get
published papers out of it. I don't think that can be infinitely replicated,
in part because once there are dozens of contests, it's less likely any one
will be as high profile, and in part because not every ML contest is as
academically interesting (the Netflix setting was sort of "weird" from the
perspective of traditional statistical models, not being a straightforward
regression problem, whereas some of these are pretty straightforward).

------
ap22213
How does Intellectual Property work in competitions like these? Are the
entrants allowed to use proprietary methods? Do they give up IP by
participating? Do the hosts of competitions gain any rights to IP or its
usage?

It's not immediately clear by skimming the legal terms of service. I couldn't
find a FAQ.

~~~
moserware
See section 9 of our terms: <http://www.kaggle.com/pages/terms>

In general, the competition host gets a license to the winning algorithm in
exchange for a winner accepting the prize money.

We also have private competitions that do not appear to you unless you're
invited. In these competitions hosts can offer appearance fees just for
participating as well as access to more restricted data.

------
zeratul
Would "kaggle" be able to handle patient data? Would "kaggle" sign data use
agreements with hospitals that are interested in a shared task? There is a
growing number of medical data mining competitions, e.g.:

<https://www.i2b2.org/NLP/Coreference/PreviousChallenges.php>

But the data mining challenge delivery systems in medicine are scattered.
Mostly because of inability to create a secure and centralized web service.

------
stfu
Very interesting project! Can anyone recommend some good data mining for
dummies tutorials/books/etc?

~~~
moserware
I recommend the video at [http://blog.kaggle.com/2011/03/23/getting-in-shape-
for-the-s...](http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-
sport-of-data-sciencetalk-by-jeremy-howard/)

I gave more recommendations at
[http://stackoverflow.com/questions/598726/overwhelmed-by-
mac...](http://stackoverflow.com/questions/598726/overwhelmed-by-machine-
learning-is-there-an-ml101-book/598772#598772)

(Disclaimer: I work at Kaggle)

EDIT: Fixed second link

~~~
stfu
Sweet, thanks! I would very much appreciate if you guys could somehow motivate
the great data mining talent on your site to produce tutorials. I would be
even willing to pay there for some training course because traditional
tutorial providers (Lynda etc) have apparently not yet discovered this as a
potential subject.

~~~
moserware
It's something we're interested in doing. I think Jeremy's video (that I
linked to first) is a good start.

In addition, by working in the field, I think I'm in a better position to
write blog tutorials like my TrueSkill one:
<http://www.moserware.com/2010/03/computing-your-skill.html>

It's our hope that we can encourage a lot of our community to post insights of
their approaches. That's what we already do on our blog with our "How I Did
It" posts like [http://blog.kaggle.com/2011/10/19/deceitful-beast-william-
cu...](http://blog.kaggle.com/2011/10/19/deceitful-beast-william-
cukierski/#more-1341)

~~~
boneheadmed
This is great. I'm a practicing medical doctor and a coder. My friend and I
are going to give the healthcare one a go. If anything it's good fun and we'll
learn more about the real world application of data mining to medicine. Thanks
for the links.

