

How Kaggle Is Changing How We Work - antgoldbloom
http://www.theatlantic.com/technology/archive/2013/04/how-kaggle-is-changing-how-we-work/274908/

======
rm999
Kaggle is a great idea; I actually entered the industry partly through a data
mining contest (pre-kaggle). But I think the article overstates Kaggle's
influence on industry, at least at this point in time.

>the Kaggle ranking has become an essential metric in the world of data
science. Employers like American Express and the New York Times have begun
listing a Kaggle rank as an essential qualification in their help wanted ads
for data scientists

No, it hasn't, and the nytimes job posting they link to doesn't list it as an
"essential" qualification (they didn't link to the amex posting). I know many
people with experience in the data science space and very few of us have taken
part in a Kaggle competition. It's not that there is anything wrong with
Kaggle, but the pay is low for the required effort to differentiate yourself.
Many modeling competitions I've seen require an inefficient use of time in the
"diminishing returns" part of the process, which means winning requires a lot
of free time. I worked with a guy who won a couple prominent data modeling
competitions, and frankly I thought he was a mediocre data scientist (but a
very hard worker).

I sometimes wonder who takes part in the competitions; and then I remember
myself from six years ago, applying for jobs and looking for a way to make my
resume stand out.

~~~
achompas
Agreed. In conversations with data scientists looking to hire, we've never
discussed Kaggle aside from "wow they reeeeeally focus on that evaluation
metric."

In fact, our conversations quickly turn to the Netflix Prize, where first
place won the competition with an algorithm that could not be ported to
production, and we discuss how poorly these competitions map to reality.

None of the data scientists I know hire based on Kaggle score. Several don't
even think positively of Kaggle.

~~~
tocomment
Why couldn't it be ported to production?

~~~
textminer
If I recall, BellKor made many, many models based on Gradient Boosted Decision
Trees, Restricted Boltzmann Machines, and kNN. They tried many different
feature subsets, added temporal weighting, and tried many reduced-
dimensionality representations (SVD, NMF). They then stacked them all together
into one final ensemble whose RSME beat everyone else's on a hidden validation
set.

In a production environment, this is probably an insane amount of
transformation, feature extraction, and classification for marginally little
gains in precision (as defined here). But I'm only a year or two in to
building production-environment classifiers, and nothing at Netflix's scale
(though not tiny either-- it is a problem if I can't do feature extraction and
high-precision/-recall* classification within a few milliseconds).

* - mid 90s, for a hard NLP/social graph problem.

~~~
willis77
It's funny to me that people are so quick to poo-poo the complicated modeling
done for the Netflix prize. When did production-worthiness become the only
important thing? It's like saying Watson was useless because it can't play a
concurrent game of Jeopardy with thousands of people on the web.

Just like in research, it turns out that relaxing real-world constraints on a
problem is often a great way to make progress. I would not have to search long
or far to provide much worse uses of million-dollar grants/projects/big-data-
software.

~~~
textminer
Don't mean to poo-poo. I love scrambling for models and methods. In the latest
problem I worked on, I did the same thing. Implementing papers, trying to
avail myself of semi-supervised or reduced-dimension representations, tweaking
models and features every which way... it's illuminating work, and good things
come out of it.

But, in the end, companies like Netflix or where I work are immediately
looking for cheap ways to make X happen easier, better, and cheaper. But then
hopefully the smart papers go on a shelf or are easily Googleable, and the
rest of us get to learn from their efforts.

I can fit long-term and short-term goals in my brain, too, mister.

------
reader5000
I think Kaggle is great, but I don't really see the model of "work full time
for months competing with 80,000 other people in hopes that your data model
will turn out 0.001% more accurate than everybody else's [which at some point
becomes largely a function of the initial seed on your RNG when you fit the
model] in order to get a single pay check" to be a workable concept for work.

~~~
rm999
Yes, you succinctly made the same argument I was trying to make. I think
Kaggle is pushing the industry to be more efficient, but in an ironically
inefficient way.

------
jph00
A lot of commenters here seem to have missed this: "The really disruptive
thing about Kaggle, though, comes through the company's new service, Kaggle
Connect. Here, Kaggle acts as a match-maker, where customers with a specific
problem can hire a specific data scientist well-suited to their problem;
candidates are drawn the top tier of Kaggle participants: the top 1/2 of 1
percent, or about 500 data scientists."

The competitions are a good way to learn, practice, and get feedback on your
methods. Kaggle Connect is where you can make a good living while doing a
range of interesting work.

(I work for, and compete on Kaggle).

------
achompas
The article is sorely mistaken: a job posting that says "Kaggle participation
will give you an advantage" does not mean Kaggle provides "essential
metric[s]." That's just one example of the article overstating Kaggle's case.

Also love the irony of explaining the importance of a data science competition
host by citing Tom Friedman, the King Of Generalizing Anecdotes.

------
paulgb
It's interesting how differently the data science world has embraced this
model vs. designers, of which a vocal set seem to scoff at any mention of
99designs. (I am currently competing in a Kaggle competition and the
competitive factor is what makes it fun)

That said, I doubt more than a handful of people could make a living off
competition winnings alone.

~~~
dbecker
Kaggle is pivoting, and they don't intend "making a living on Kaggle" to
involve contest winnings. Instead, they are offering a platform for
prospective employers can use to contract to highly ranked Kaggle members.

And those gigs pay hourly.

------
jack_trades
The cynic in me says, "How Kaggle wastes the time of many talented individuals
while enabling corporations to give the finger to their staff."

I should start providing prizes for the best start-up. I'll give you a $20k
prize and then turn around and sell it for $1M or more. You can put it on your
resume. Win win.

EDIT: Sure, it's good for various things, but it is so detached from the
reality that it's a bit out in the thicket. The cynic in me just doesn't get
over the value handed over by competitors to the sponsors.

~~~
dj_axl
> The cynic in me just doesn't get over the value handed over by competitors
> to the sponsors

I would think you don't have to hand over your algorithm, if you forego the
prize money. Another way of looking at it... as alluded to by other posters,
the "winning" data model may not be the best, so there might be less value
handed over than you think.

------
stratosvoukel
I cant see how Kaggle is good for the data scientist (but I can certainly see
how it is good for the hiring companies). It seems like the design spec work.
I find it horendous that people do work for a company and most of them never
get paid for it. It is disgusting. The top #1 in Kaggle according to the
article only got paid 6 times. Lets put it into perspective. Companies that
wanted web developers stopped hiring them and started using a similar service,
in which, everyone would create a version of the application, with only the
frontend available for other contestants, in order to foster competition. Then
the company chose the best one and only the winning team would get paid. This
is capitalism at its worst, and long-term it is severely harmful. In a modern
work-rights aware society this should be illegal, spec work that is... At
least the communities should critic it in my opinion and not embrace it.

------
axusgrad
There was a similar website, where customers posted a bounty for the cheapest
travel itinerary meeting certain conditions, and anyone could submit a plan
meeting those conditions, to try and win the bounty. Does anyone remember what
it was called?

~~~
todsul
Hi axusgrad, that's us, Flightfox. Like Kaggle and 99Designs, we're also
Australian.

In reference to a few comments here about the inefficiency of crowdsourcing,
we're moving more of the work to the back-end (i.e. after an "expert" is
guaranteed the bounty). Instead of awarding an expert on Flightfox, you now
"hire" an expert based on an initial "pitch".

So, instead of receiving work from let's say 99 experts, we're aiming to
provide great results from only 3 experts. To do this, we're working on
segmenting and profiling both customers and experts. Think Uber/Amazon rather
than eBay.

------
jmount
Kaggle is pretty important. But concentrating only on accuracy tuning
(ignoring data collection, data curation, and interviewing stakeholders to
find real business needs) in machine learning is like celebrating only
(premature) optimization in software engineering.

But don't get me wrong, lots of top notch results in the contests. It is just
that it is testing only one facet of what is needed in a data scientist.

------
dbecker
I wish there was a similar site where participants collaborate rather than
compete to solve interesting data problems.

------
juskrey
Big numbers of data crunchers in a big play. I wonder how many top scientists
are themselves spurious, therefore, random leaders, since there are 85
thousands of them. Is Kaggle really a science place or a gamble? You know, we
have seen things like that in any prediction market, e.g. financial.

------
danbmil99
I find irony in the cognitive dissonance betwixt these comments, and these:
<https://news.ycombinator.com/item?id=5540841>

------
michaelochurch
I'm not anti-Kaggle, but this is not the future.

Credible people will do one or two competitions because they care about the
problem, and because they want to establish themselves well enough to get
better jobs, etc. If it works for them, great. If it doesn't, they'll get
bored and quit. In that case, the best people leave and you have a ghetto.

Right now, Kaggle makes sense because "data science" is still an ill-defined
field but a lot of people want to get into it, and no one knows what it means
or takes to get in, so people will try things out to see what happens.

If Kaggle wants to stay in play for the long term, they'll need to get
_really_ good at connecting talented people with very high-quality jobs.

There is something that I don't think all of the hiring-related startups get
yet: as things are, there's such a shortage of quality jobs. That's a 5-year
existential threat to the whole business model. What happens when people
realize that these sexy startup jobs are just corporate jobs with better
marketing? What happens when the dream dies? Right now, high-quality jobs are
too rare for the hiring startups (unless the genuinely change the economy) to
prevent people from getting just as disillusioned with these new services as
they are with headhunters. Now, that's not because there's an intrinsic limit
on interesting work (see: Lump of Labor fallacy) but we'd have to overthrow
the management of a whole industry to change that.

Now, data science. It is attractive right now because it carries with it a
promise of what _software engineering_ was supposed to deliver but, for most
people, doesn't: interesting work, implicit respect, autonomy. I feel like
data scientist in many companies means "software engineer who gets dibs on the
most interesting projects". I'm afraid that title inflation into the data
science field might dilute that, however.

What we really need is to fire 90+ percent of software managers and trust
engineers to pick their own projects and call some shots. I don't know how to
turn that into a specific startup idea, but it will solve a lot of problems.

