Hacker News new | past | comments | ask | show | jobs | submit login

Nothing feels cleaner than storing everything (notebook, raw data, cleansed data, misc scripts, etc.) in a docker image when you're finished with the project. Data science and docker are meant to be besties.

I would prefer recommending a stable build process: — a Docker image can be just like having a VM image or that one PC in the corner of the lab nobody is sure is unneeded. It's far better than having nothing or just the result file but it still has the possibility of needing to reverse-engineer the internal state and given how fast the Docker world moves I would not want to bet on format compatibility 5 years out.

Docker could be that stable build process but it requires the additional assertion that there wasn't, say, a truckload of changes made using `docker exec` or a bunch of customizations to files which were copied into the image. Simply putting a note on the source repo which says that might be enough.

(I really like what C Titus Brown has written about reproducibility in computation research over the years: http://ivory.idyll.org/blog/tag/reproducibility.html)

Potentially problematic for those who want to check your findings in 30 years time?

Is there a good solution to that problem, though? (Serious question). I recently did a laptop refresh and am using it as an opportunity to solidify my approach to ML development, and would love to hear if there is a good solution to long-term reproducibility. I'm currently leaning towards Docker, but maybe Vagrant or another "pure" VM approach is better...

Not perfectly, but a good start is to keep all the software assets AND data assets you used to train the model.

There needs to be an immutable, high performance read data store that has a 30+ year plan for survival if we're really going to retool our world around expert systems.

I think the only real problem times you'll find are when the architectures are changed. x86, arm, you'd probably want to port your solution images then if ever. There will always be folks emulating hardware in software on new architectures.

Well you can always fire up LaTeX and write a report. If detailed enough, that in conjunction with the data set should be enough to survive anything.

Carve it onto stone tablets.

You joke it, but it's a major problem that our tech for very stable WORM media has lagged behind demand.

Our use of data has grown so much faster than our network capacity (and indeed, it seems like we're going to hit a series of physical laws and practical engineering constraints here). "Data has gravity" but the only way to "sustainable" hold a non-trivial volume of data for 20 years right now is to run a data center with a big dht that detects faults and replicates data.

I prefer gold.

I've never used Docker. Searching "reproducible research with Docker" yields lots of results. Any stand-out resource suggestions?

I would simply familiarize with just the basics because you don't have to go much further than that to make use of it for research purposes. My usual process involves breaking the process down into multiple stages (cleansing, conformance, testing, reporting), including a data dir, and finally creating a dockerfile that simply adds the data/source to a simple hierarchy and includes all dependencies. As long as you know how to build a dockerfile, you're golden. You can then upload the image to dockerhub, and have somebody else pull the image and run it to reproduce your entire environment. Helps a ton for online DS courses and MOOCs.

This doesn't guarantee reproducible results though.

A lack of reproducibility is a major problem for DSEs and practitioners right now. In fact, I'd argue its the single biggest problem.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact