Hacker News new | past | comments | ask | show | jobs | submit login
DVC – Open Source Machine Learning Version Control System (dvc.org)
148 points by jonbaer on Feb 10, 2019 | hide | past | favorite | 36 comments

In the intro video they say it’s “based on git but supports large files”.

Are they using Git LFS [1] or did they make something else?

And what is their proposed value add over using git directly?

Edit: They say a little more about the large file stuff

> DVC keeps metafiles in Git instead of Google Docs to describe and version control your data sets and models. DVC supports a variety of external storage types as a remote cache for large files.

So from what they said in the video and what I read on the page this is probably a limited front-end to make using git easier for people that don’t know git.

And in terms of the large file stuff it seems from what they are saying like they have implemented the equivalent of git-annex [2]. Or maybe they are using that even. I didn’t look to see if they wrote their own or used git-annex.

[1]: https://github.com/git-lfs/git-lfs/blob/master/README.md

[2]: https://git-annex.branchable.com/

Git itself is not suitable for huge files:

- Large binary files tend to be not very "deflatable"

- xdelta (used in Git to diff files) tries to load the entire content of a file into memory, at once.

This is why there are solutions like Git-LFS, where you keep your versions on a remote server / cloud storage and you use git to track only the "metadata" files.

DVC implemented its own solution, in order to be SCM agnostic and cloud flexible (supporting different remote storages).

Here's more info comparing DVC to similar/related technologies: https://dvc.org/doc/dvc-philosophy/related-technologies

EDIT: formatting

Thank you for the link, that’s the kind of comparison I was looking for all the way down to even talking about how DVC compares to git-annex :)

They use cloud storage backends as remotes aws, google cloud, azure. No specific git lfs support. But possible compatibility by using it to track the dvc cache.

honestly i'm confused..i went to the site, watched the video. I don't get it.. what is it?

Basically you're not going to check-in to GIT your data used to train models and models themselves if they are multi-GB.

DVC basically sym-links those big files and checks-in the symlinks.

It also can download those files from GCS/S3, and track which file came from where (e.g. if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well).

That's my understanding.

> if you generated output.data using input.data, then whenever input.data changes, DVC can detect that output.data needs to be regenerated as well

To my understanding you could do the same with Docker. E.g. if you COPY your input files into the image, rebuilding the image would only be an action if the input files changed.

Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.

Good points, I'd appreciate if you could elaborate since it seems you've thought a lot about this.

> Docker can help only if there is a single step in your project. In ML projects you usually have many steps - Preprocess, Train. Each of the steps can be divided: extract Evaluate step from Train etc.

Yeah this is something I've been struggling with. In a project I'm working on I use docker build to 1) set up the environment 2) get canonical datasets 3) pre-process the datasets. However I've left reproducing as manual instructions, e.g. run the container, call script 1 to repro experiment 1, call script 2 to repro experiment 2, etc. I think I could improve this by providing `--entrypoint` at docker run, or by providing a docker-compose file (wherein I could specify an entrypoint) for each experiment.

What do you think are the generalizability pitfalls in this workflow? How could dvc help?

> Also, Docker has an overhead - copy of data needs to be created. While DVC just saves links (sym-links, hard-links or reflinks) with a minimum overhead. It is crucial when you work with multi-GB datasets.

Good point!

I could see using an entry point for that. The entry point script could take as an argument the experiment to run and then the docker command would be: docker run -ti experiments:latest experiment-1

I could also see creating a base dockerfile, and a dockerfile per experiment. The base docker file would do the setup, and the experiment docker files would just run the commands necessary to reproduce the experiment, and exit.

The major pitfall - it depends on your goal. Your approach looks good if you just need to retrain an existing model\code in production. However, this approach is not perfect in the ML modeling\development stage. Let me explain...

`--entrypoint` defines a single step\script. You make a single black box from your entire ML processes. There is a lack of granularity for ML modeling process when people tend to separate different stages to make the process more manageable and efficient: manage and version dataset separately from modeling, preprocess data before training, training code as a separate unit plus some problem specific steps\units.

DVC gives you the ability to increase the granularity of your ML project while still keeps it manageable and reproducible. The steps still can be wrapped in Docker - it is a good practice. As @theossuary said, run `docker run -ti experiments:latest` as a step in DVC.

my understanding: it's a combination of

1 - git based management (not storage) of data files used in ML experiments;

2 - lightweight pipelines integrated with git to allow reproducibility of outputs and intermediaries

3 - integrating git with experimentation

If you've worked on teams building ML products, this is something you've at least half-built internally. So you can share outputs internally with tracked lineage showing how to repro. Plus the pipeline management.

allows you to track the progress of your models. You can improve reproducibility by having a tool like this to track training/testing data, you can use this to see where that data was used to train or test a specific model, the parameters with which that model is built, and how that model affects downstream model performance.

Basically, it computes a fingerprint of each large file (large = too large for Git) and commits the fingerprint to Git, while the large files are stored on some remote location of your choice (AWS S3, local, cache...). So you have Git commands to update your fingerprints, and the same commands with dvc (add, push, pull...) to interact with the large files. Quite simple and neat actually.

We've started using it as a replacement for git LFS for different projects internally, not especially for Data Science, and we're very happy with it. Works like a charm with Linux, Mac and Windows.

I haven't used Git LFS but I am planning to use it. What made you guys want to replace git LFS with DVC?

We had troubles with setups on different OS, hard to understand error messages, and in general had to use multiple repos, some with Git LFS and some with "standard" Git and combine each, which was a mess. We ended up deleting the local clones and re-cloning quite often, and playing tech support for less technical colleagues.

Also we realised we don't really need a word- or even line-level diff, but we just want to know which files have been modified (e.g. large binaries). So maybe we shouldn't have started with Git LFS in the first place.

DVC allowed us to have everything in our monorepo, kept in sync, without having people to install Git LFS before they clone the repo. You don't have to pull the large files if you don't want to or don't need them for your personal work. In general I think you're more flexible in terms of local and remote caching and sharing of these large files IMHO. If network is an issue (technical or money-wise) it's pretty useful.

I am sure there are more reasons for and against DVC, but it worked surprisingly well for us, the support on Github is super reactive, and so far we couldn't find a reason against it for our use case.

My experience with Git(hub) LFS was that it doesn't work with files that are several GB in size. Was constantly getting upload/download errors from Github.

The sweet spot for LFS is a few 100Mb files with less than a couple Gb overall. While DVC was designed with 10Gb-100Gb scenarios in mind.

This is a bit different, but commonwl.org ( https://github.com/common-workflow-language/common-workflow-... ) has similar aims, at a pipeline / workflow level and used in genomics, astronomy, imaging etc.

For those who familiar with cookiecutter-data-science, there is a good example how DVC can version and manage data -https://github.com/drivendata/cookiecutter-data-science/issu... + this PR to play with it - https://github.com/drivendata/cookiecutter-data-science/pull...

I haven't tested yet but I would have to say this would be the key feature in testing new models: Reproducible The single 'dvc repro' command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

Hmm. If I understand correctly, in order to reproduce the steps taken in creating machine learning models, I need to version control more things than just the code:

1. Code

2. Configuration (libraries etc)

3. Input/training data

1 and 2 are easily solved with Git and Docker respectively, although you would need some tooling to keep track of the various versions in a given run. 3 doesn't quite figure.

According to the site DVC uses object storage to store input data but that leads to a few questions:

1. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC just a wrapper for these tools?

2. Why wouldn't I just version control the query that created the data along with the code that creates the model?

3. What if I'm working on a large file and make a one byte change? I've never come across an object store that can send a diff, so surely you'd need to retransmit the whole file?

@mcncfie your understanding is correct. #3 might include output data\models as well and intermediate results like preprocessed data. DVC also handles dependencies between all of these.


1. DVC does dependency tracking in addition to that. It is like a lightweight ML pipelines tool or ML specific Makefile. Also, DVC works just faster that LFS which is critical in 10Gb+ cases.

2. This is a great case. However, in some scenarios, you would prefer to store the query output along with the query and DVC helps with that.

3. Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones - https://dvc.org/doc/commands-reference/gc

> Correct, there are no data diffs. DVC just stores blobs and you can GC the old ones

Have you looked into using content-defined chunking (a-la restic or borgbackup) so that you get deduplication without the need to send around diffs? This is related to a problem that I'm working on solving in OCI (container) images[1].

[1]: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

Content-defined chunks - very interesting. I'd suggest you ask this question in DVC issue tracker or DVC channel https://dvc.org/chat

Thanks! Regarding 2, could you give an example?

Also, can I combine DVC with a pipeline tool like Apache Airflow?

Example. A query to DB gives you different results since the data\table evolves over time. So, you just store the query output (let say a couple GBs) in DVC to make your research reproducible.

This is like assigning a random-seed to DB :)

Sure, some teams combine DVC with AirFlow. It gives a clear separation between engineering (reliability) and data science (lightweight and quick iteration). A recent discussion about this: https://twitter.com/FullStackML/status/1091840829683990528

I think maybe this could better illustrate what can be done using DVC: https://dagshub.com/DAGsHub-Official/DAGsHub-Tutorial-MNIST

You can navigate branches, and be able to download the data, model, and intermediate pipeline files from a shared team AWS,GS,Hadoop, or plain NFS or SSH server, as they were in a specific commit. Also compare metrics between branches for comparison of different experiments, etc. A team member can checkout a branch, immediately get the relevant files which were already computed by someone else, modify the training code, reproduce the out-of-date parts of the pipeline using dvc repro, and then git commit the resulting metrics + dvc push the resulting model back to the shared team storage.

Just ran thru official tutorial and I'm pretty impressed.

As I understood the idea is to: - use git branch for each experiment (change of hyperparameters etc.) - define pipeline stages (preprocessing, train/test split, model training, model validation) - after this steps you can change any part of pipeline (say data preprocessing or model parameters) and run `dvc repro` to reproduce all stages for which dependancies changed and track metrics for all branches, which os pretty cool and reduce experiment logs in wiki

Exactly. And it is flexible as Git. You can define your own workflow. For example, some data scientists avoid using git-branches for experiments - they use directories.

This is definitely needed and DVC has a few cool ideas. I think the most useful feature missing from existing tools is integrating data versioning with git, and simple commands to tag, push, and pull data files.

I'm not sure I buy the pipeline and repro functionality as useful. I'd rather see nice integration with Docker since it can be used to define the environment as well as repro steps.

Great point! `dvc pull\push mydataset` actually exists in DVC. But `dvc tag` is missing and now we (DVC team) understand that it prevent DVC from a proper datasets and ML models tracking. It will be implemented soon.

There is an ongoing discussion in DVC GitHub about datasets tracking and tags https://github.com/iterative/dvc/issues/1487 and some discussions in DVC discord channel https://dvc.org/chat

How does this compare to pachyderm?

From a very high level perspective - Pachyderm is a data engineering tool designed with ML in mind, DVC is a tool to organize and version control an ML project. Probably, one way to think would be Spark/Hadoop or Airflow vs Git/Github.



This is not version control using ML, but version control for ML development and research.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact