Hacker News new | past | comments | ask | show | jobs | submit login

It's worth nothing that the article does not discuss the reproducibility of results (e.g. with a Jupyter Notebook) and the implementation of said results (e.g. deploying/validating models), both of which matter much more than any code style conventions for data-related projects.

^ this.

I cannot describe how many times I've been shown results and when asking how to reproduce them, after several notes (and sometimes complaints to higher ups) I eventually get a series of command line arguments or a barely functioning R-script.

These conclusions are too important to be so sloppily produced. We need verification, validation and uncertainty quantification for any result provided to decision makers.

Learning uncertainty propagation in engineering statistics was one of those concepts that seemed to be immediately useful and have far more implications than any textbooks emphasized.

I was very happy to have that background when I took part developing statistical models used for wind hazard analysis on nuclear powerplants in my first job out of college.

To put it cynically, if you other people can reproduce your results, you might not be demonstrating that you are x10 more productive than them.

Which is to say I think starting with the idea that you're aim for x10 is pernicious and tends to create dysfunctional teams. The claim some developers in some circumstances are ten time more productive than others may or may not be true but software development needs processes whose goal is to help an entire team rather than helping an individual to that "level".

This is why the only time I use 10x is "your team will become 10x more productive."

Yes, it's better to have an x10 team, rather than an x10 developer.

Developers should strive to better themselves, but it's important not to fool yourself, too. Having a strong team is almost always better from a business point of view.

Indeed. I have seen several teams with a bunch of (self-styled) "10x" devs, and found that the productivity and quality of the team decreases in direct proportion to the amount of "10x" devs on the team.

I shun "rockstar" and "10x" (and whatever other bullshit moniker they will come up with next) team members. Give me a group of smart people that gel well together, and are highly self-confident without egos getting into the way, and we can move mountains.

This 100%. I have read postmortems of some "significant discoveries" which have turned out to only be reproducible on a particular build or software on a single analyst's machine. Or not at all. One "result" turned out to hinge on the iteration order of python dictionaries.

And this definitely didn't happen after those working on tools to help said analysts make reproducible results encouraged the analysts to use said tools... No, that would be crazy.

I completely agree. Almost all of this article appears to have little to do with being a Data Scientist in particular and more to do with some good practices for writing code in general. So the advice itself is fine, just not what I was hoping for based on the title.

Reproducibility, like you say, however, is something that is an issue far more particular to data science, and worth more serious consideration and discussion. Hand-in-hand with that is shareability. I'm a fan of what airbnb has open sourced to address some of those issues in their knowledge repo project: https://github.com/airbnb/knowledge-repo

Hey thanks for the comment! I'm the author of the talk-turned-post :-) You are completely correct about reproducibility being super important in data science workflows. While I did mention it in the post (and in the talk the post was based on), I mentioned it as a part of version control tools. That said, I think it's not something that is focused on enough (obviously I'm guilty of that too) so I plan on doing a follow up post focused on reproducibility and the tools that can help you recreate your results. Kinda putting the "science" back in data science. Really I want an excuse to play around with tools like https://dataversioncontrol.com/ which looks super useful and I mentioned it in the post, but haven't had a chance to use.

'Kinda putting the "science" back in data science. ' Exactly! This is the primary goal in DVC project.

Nothing feels cleaner than storing everything (notebook, raw data, cleansed data, misc scripts, etc.) in a docker image when you're finished with the project. Data science and docker are meant to be besties.

I would prefer recommending a stable build process: — a Docker image can be just like having a VM image or that one PC in the corner of the lab nobody is sure is unneeded. It's far better than having nothing or just the result file but it still has the possibility of needing to reverse-engineer the internal state and given how fast the Docker world moves I would not want to bet on format compatibility 5 years out.

Docker could be that stable build process but it requires the additional assertion that there wasn't, say, a truckload of changes made using `docker exec` or a bunch of customizations to files which were copied into the image. Simply putting a note on the source repo which says that might be enough.

(I really like what C Titus Brown has written about reproducibility in computation research over the years: http://ivory.idyll.org/blog/tag/reproducibility.html)

Potentially problematic for those who want to check your findings in 30 years time?

Is there a good solution to that problem, though? (Serious question). I recently did a laptop refresh and am using it as an opportunity to solidify my approach to ML development, and would love to hear if there is a good solution to long-term reproducibility. I'm currently leaning towards Docker, but maybe Vagrant or another "pure" VM approach is better...

Not perfectly, but a good start is to keep all the software assets AND data assets you used to train the model.

There needs to be an immutable, high performance read data store that has a 30+ year plan for survival if we're really going to retool our world around expert systems.

I think the only real problem times you'll find are when the architectures are changed. x86, arm, you'd probably want to port your solution images then if ever. There will always be folks emulating hardware in software on new architectures.

Well you can always fire up LaTeX and write a report. If detailed enough, that in conjunction with the data set should be enough to survive anything.

Carve it onto stone tablets.

You joke it, but it's a major problem that our tech for very stable WORM media has lagged behind demand.

Our use of data has grown so much faster than our network capacity (and indeed, it seems like we're going to hit a series of physical laws and practical engineering constraints here). "Data has gravity" but the only way to "sustainable" hold a non-trivial volume of data for 20 years right now is to run a data center with a big dht that detects faults and replicates data.

I prefer gold.

I've never used Docker. Searching "reproducible research with Docker" yields lots of results. Any stand-out resource suggestions?

I would simply familiarize with just the basics because you don't have to go much further than that to make use of it for research purposes. My usual process involves breaking the process down into multiple stages (cleansing, conformance, testing, reporting), including a data dir, and finally creating a dockerfile that simply adds the data/source to a simple hierarchy and includes all dependencies. As long as you know how to build a dockerfile, you're golden. You can then upload the image to dockerhub, and have somebody else pull the image and run it to reproduce your entire environment. Helps a ton for online DS courses and MOOCs.

This doesn't guarantee reproducible results though.

A lack of reproducibility is a major problem for DSEs and practitioners right now. In fact, I'd argue its the single biggest problem.

Thus making you a 1x Data Scientist because your result can only be demonstrated once? ;)

In my experience if you develop a "data science pipeline" forcing the data scientist to build

- reproducible

- validated

- back-tested

- easy to deploy

models, they are going to hate it. It just kills the fun and/or makes obvious if they made a mistake.

So we should sacrifice all the things that actually make a Data Scientist's work valuable in the name of fun and obscuring mistakes?

Fun I almost get, obviously good for productivity (though I think you'd really be sacrificing productive output for non-productive output), but I just don't get where you're even coming from with the "making mistakes more obvious" angle.

I think he was being facetious. Of course we need all these things, but data science right now is still not that mature I guess.

I blame software.

I don't understand why we couldn't have some system, perhaps using strace and friends, which tracks everything I ever do, and how every file was created. Then I could just say "how did I make X?"

Make it happen!

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact