Ask HN: Data scientists, how do you organize and store your data files? - willj
======
willj
One problem I'm stumbling into is that multiple projects and experiments
require the same source data, which may be several gigabytes in size. While it
would be nice to have each project folder be self-contained, this isn't
practical due to limits on hard disk size. Unfortunately my work place doesn't
have S3, etc., and probably even if it did, keeping multiple copies would be
costly.

I imagine this is a common problem. I'd love a solution that allows me to
maintain metadata/comments about the dataset: who created it, when it was
created, and what it contains (at a high level). It'd also contain a link to
the full dataset which I could copy into Jupyter notebooks. Maybe there's a
better solution that eludes me right now. A related issue is version control
of large datasets, which I'm aware is a difficult problem getting worked on.
But for now, maintaining a library of datasets is the task at hand.

~~~
jaz46
I'm one of the creators of Pachyderm (github.com/pachyderm/pachyderm) that
might be able to help.

We do require an object store of some kind, but if you dont have s3 you can
always use Minio in front of whatever storage you do have.

Pachyderm will let you mount that data locally in a jupyter notebook so you
dont need to constantly copy it around. As many people as you want can do this
and there will only ever be one true copy of the data in the centralized
system. That central copy also includes version control for your data set so
you can make changes to it and get deduplication of the files while still
maintaining all your data lineage info.

~~~
willj
Neat! I'll have to check this out.

