Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Engineering Practices in Data Science (kaggle.com)
24 points by numlocked on Oct 4, 2012 | hide | past | favorite | 5 comments


Well I guess I'll pat myself on the back for always using git. Having seen a lab's major university research project literally evaporate due to lack of proper version control I've been keen on it ever since. Besides, I like the incremental feel of accomplishment when I make the next commit.

I agree that setting up a dedicated pipeline early on in a project and committing yourself to work within its confines aids organization, but it also contributes to putting your creativity in the right place. We often enjoy building things for the sake of building and those of us with stronger proclivities towards engineering can sometimes get too jolly putting together new pipes when we should be training models. I have been guilty in the past of writing a Perl script that spawned shell scripts that cued Perl scripts on a cluster that ran R scripts that piped back into Perl scripts. Then again I always liked the boardgame Mouse Trap.


Cool article, thanks. I used to spend 90% of my downtime trying to improve my stats and machine learning knowledge, but in the last couple years I've come to realize how much my lack of proper engineering was hurting me.

A (flawed) analogy: if data science is storytelling, stats is the story and engineering is the words you use to tell it. You need to do well at both to effectively tell your story.

>...good engineering going out the window quickly with elaborate ensembling.

This is one of my criticisms of data mining contests (sorry kaggle!). When I was in grad school I liked doing these contests to get practical experience - my last company actually recruited me through one. But as they got more popular I found the best engineers and data scientists stopped having as much of a chance of winning. Good modelers get 90% of the way there and then are beaten by impractical solutions that would get someone fired from a job.


It's an interesting criticism and there are competition properties that we've identified that can lead to overly complex models. An obvious one, for instance, is running a competition for a very long period of time. A more subtle factor is the degree to which the context of the data has been stripped away - with very little context (just an anonymized feature matrix) it's more likely that victory will go to someone who can squeeze every bit of performance out of an ensemble of ML techniques. But with good context, it's more likely that feature engineering and insight will lead to victory.

Many times it is in fact the well-organized, well-engineered solutions that win. And they are often elegant and intuitive because the data scientist could iterate quickly and try out lots of ideas. Though the crazy ensemble does sometimes eek out a victory, it's not an inherent property of competitions by any means.

One final note: I'm not sure if this is widely known, but impractical solutions are actually just fine for a certain set of competitions. Some competitions are designed to identify the "data frontier" - the limit of what's possible in a dataset - rather than getting a model into production.


It appears our blog is down. Oof. Apologies for the inconvenience. While we're figuring it out on it here is a mirror of the post: http://blog.untrod.com/2012/10/engineering-practices-in-data...

Edit: And we're back up.


I love definitions that wax my ego! I grok revision control, and can sling enough R / stats to officially make me a nerd of an urban planner. Does this mean I can start calling myself an "urban data scientist"? I definitely need a pay raise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: