

Engineering Practices in Data Science - numlocked
http://blog.kaggle.com/2012/10/04/engineering-practices-in-data-science/

======
jboggan
Well I guess I'll pat myself on the back for always using git. Having seen a
lab's major university research project literally evaporate due to lack of
proper version control I've been keen on it ever since. Besides, I like the
incremental feel of accomplishment when I make the next commit.

I agree that setting up a dedicated pipeline early on in a project and
committing yourself to work within its confines aids organization, but it also
contributes to putting your creativity in the right place. We often enjoy
building things for the sake of building and those of us with stronger
proclivities towards engineering can sometimes get too jolly putting together
new pipes when we should be training models. I have been guilty in the past of
writing a Perl script that spawned shell scripts that cued Perl scripts on a
cluster that ran R scripts that piped back into Perl scripts. Then again I
always liked the boardgame Mouse Trap.

------
rm999
Cool article, thanks. I used to spend 90% of my downtime trying to improve my
stats and machine learning knowledge, but in the last couple years I've come
to realize how much my lack of proper engineering was hurting me.

A (flawed) analogy: if data science is storytelling, stats is the story and
engineering is the words you use to tell it. You need to do well at both to
effectively tell your story.

>...good engineering going out the window quickly with elaborate ensembling.

This is one of my criticisms of data mining contests (sorry kaggle!). When I
was in grad school I liked doing these contests to get practical experience -
my last company actually recruited me through one. But as they got more
popular I found the best engineers and data scientists stopped having as much
of a chance of winning. Good modelers get 90% of the way there and then are
beaten by impractical solutions that would get someone fired from a job.

~~~
numlocked
It's an interesting criticism and there are competition properties that we've
identified that can lead to overly complex models. An obvious one, for
instance, is running a competition for a very long period of time. A more
subtle factor is the degree to which the context of the data has been stripped
away - with very little context (just an anonymized feature matrix) it's more
likely that victory will go to someone who can squeeze every bit of
performance out of an ensemble of ML techniques. But with good context, it's
more likely that feature engineering and insight will lead to victory.

Many times it is in fact the well-organized, well-engineered solutions that
win. And they are often elegant and intuitive because the data scientist could
iterate quickly and try out lots of ideas. Though the crazy ensemble does
sometimes eek out a victory, it's not an inherent property of competitions by
any means.

One final note: I'm not sure if this is widely known, but impractical
solutions are actually just fine for a certain set of competitions. Some
competitions are designed to identify the "data frontier" - the limit of
what's possible in a dataset - rather than getting a model into production.

------
numlocked
It appears our blog is down. Oof. Apologies for the inconvenience. While we're
figuring it out on it here is a mirror of the post:
[http://blog.untrod.com/2012/10/engineering-practices-in-
data...](http://blog.untrod.com/2012/10/engineering-practices-in-data-
science.html)

Edit: And we're back up.

------
peatmoss
I love definitions that wax my ego! I grok revision control, and can sling
enough R / stats to officially make me a nerd of an urban planner. Does this
mean I can start calling myself an "urban data scientist"? I definitely need a
pay raise.

