Hacker News new | comments | show | ask | jobs | submit login

Recent article on git and reproducability in science: http://www.scfbm.org/content/8/1/7

It is badly needed.

That article says "Data are ideal for managing with Git."

I one time tried using git to manage my data. The problem is, I frequently have thousands of files and gigabytes of data. And git just does not handle that well.[1]

One time, I even tried building a git repo that just had the history of pdb snapshots. The PDB frequently has updates, and I have run into many cases where an analysis of a structure was done in a paper 3 years ago, but the structure has been updated and changed since then, making the paper make no sense until I thought to look at the history of changes to the structure. Unfortunately, git could not handle this at all when I tried it, taking days to construct the repo and then that repo was unbearably slow when I tried to use it.

Git would probably work well for storing the data used by most bench scientists, but for a computational chemist puking up gigabytes of data weekly on a single project, it is sadly horrible for handling the history of your data.

[1] http://osdir.com/ml/git/2009-05/msg00051.html

You might find git-annex useful:


Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact