
DVC – Open Source Machine Learning Version Control System - jonbaer
https://dvc.org/
======
codetrotter
In the intro video they say it’s “based on git but supports large files”.

Are they using Git LFS [1] or did they make something else?

And what is their proposed value add over using git directly?

Edit: They say a little more about the large file stuff

> DVC keeps metafiles in Git instead of Google Docs to describe and version
> control your data sets and models. DVC supports a variety of external
> storage types as a remote cache for large files.

So from what they said in the video and what I read on the page this is
probably a limited front-end to make using git easier for people that don’t
know git.

And in terms of the large file stuff it seems from what they are saying like
they have implemented the equivalent of git-annex [2]. Or maybe they are using
that even. I didn’t look to see if they wrote their own or used git-annex.

[1]: [https://github.com/git-lfs/git-
lfs/blob/master/README.md](https://github.com/git-lfs/git-
lfs/blob/master/README.md)

[2]: [https://git-annex.branchable.com/](https://git-annex.branchable.com/)

~~~
moocowtruck
honestly i'm confused..i went to the site, watched the video. I don't get it..
what is it?

~~~
deepsun
Basically you're not going to check-in to GIT your data used to train models
and models themselves if they are multi-GB.

DVC basically sym-links those big files and checks-in the symlinks.

It also can download those files from GCS/S3, and track which file came from
where (e.g. if you generated output.data using input.data, then whenever
input.data changes, DVC can detect that output.data needs to be regenerated as
well).

That's my understanding.

~~~
maksimum
> if you generated output.data using input.data, then whenever input.data
> changes, DVC can detect that output.data needs to be regenerated as well

To my understanding you could do the same with Docker. E.g. if you COPY your
input files into the image, rebuilding the image would only be an action if
the input files changed.

~~~
dmpetrov
Docker can help only if there is a single step in your project. In ML projects
you usually have many steps - Preprocess, Train. Each of the steps can be
divided: extract Evaluate step from Train etc.

Also, Docker has an overhead - copy of data needs to be created. While DVC
just saves links (sym-links, hard-links or reflinks) with a minimum overhead.
It is crucial when you work with multi-GB datasets.

~~~
maksimum
Good points, I'd appreciate if you could elaborate since it seems you've
thought a lot about this.

> Docker can help only if there is a single step in your project. In ML
> projects you usually have many steps - Preprocess, Train. Each of the steps
> can be divided: extract Evaluate step from Train etc.

Yeah this is something I've been struggling with. In a project I'm working on
I use docker build to 1) set up the environment 2) get canonical datasets 3)
pre-process the datasets. However I've left reproducing as manual
instructions, e.g. run the container, call script 1 to repro experiment 1,
call script 2 to repro experiment 2, etc. I think I could improve this by
providing `--entrypoint` at docker run, or by providing a docker-compose file
(wherein I could specify an entrypoint) for each experiment.

What do you think are the generalizability pitfalls in this workflow? How
could dvc help?

> Also, Docker has an overhead - copy of data needs to be created. While DVC
> just saves links (sym-links, hard-links or reflinks) with a minimum
> overhead. It is crucial when you work with multi-GB datasets.

Good point!

~~~
theossuary
I could see using an entry point for that. The entry point script could take
as an argument the experiment to run and then the docker command would be:
docker run -ti experiments:latest experiment-1

I could also see creating a base dockerfile, and a dockerfile per experiment.
The base docker file would do the setup, and the experiment docker files would
just run the commands necessary to reproduce the experiment, and exit.

------
loving-g
Basically, it computes a fingerprint of each large file (large = too large for
Git) and commits the fingerprint to Git, while the large files are stored on
some remote location of your choice (AWS S3, local, cache...). So you have Git
commands to update your fingerprints, and the same commands with dvc (add,
push, pull...) to interact with the large files. Quite simple and neat
actually.

We've started using it as a replacement for git LFS for different projects
internally, not especially for Data Science, and we're very happy with it.
Works like a charm with Linux, Mac and Windows.

~~~
iamcreasy
I haven't used Git LFS but I am planning to use it. What made you guys want to
replace git LFS with DVC?

~~~
scribu
My experience with Git(hub) LFS was that it doesn't work with files that are
several GB in size. Was constantly getting upload/download errors from Github.

~~~
dmpetrov
The sweet spot for LFS is a few 100Mb files with less than a couple Gb
overall. While DVC was designed with 10Gb-100Gb scenarios in mind.

------
dkural
This is a bit different, but commonwl.org ( [https://github.com/common-
workflow-language/common-workflow-...](https://github.com/common-workflow-
language/common-workflow-language) ) has similar aims, at a pipeline /
workflow level and used in genomics, astronomy, imaging etc.

------
ishcheklein
For those who familiar with cookiecutter-data-science, there is a good example
how DVC can version and manage data
-[https://github.com/drivendata/cookiecutter-data-
science/issu...](https://github.com/drivendata/cookiecutter-data-
science/issues/158) \+ this PR to play with it -
[https://github.com/drivendata/cookiecutter-data-
science/pull...](https://github.com/drivendata/cookiecutter-data-
science/pull/159)

------
jonbaer
I haven't tested yet but I would have to say this would be the key feature in
testing new models: Reproducible The single 'dvc repro' command reproduces
experiments end-to-end. DVC guarantees reproducibility by consistently
maintaining a combination of input data, configuration, and the code that was
initially used to run an experiment.

------
mcncfie
Hmm. If I understand correctly, in order to reproduce the steps taken in
creating machine learning models, I need to version control more things than
just the code:

1\. Code

2\. Configuration (libraries etc)

3\. Input/training data

1 and 2 are easily solved with Git and Docker respectively, although you would
need some tooling to keep track of the various versions in a given run. 3
doesn't quite figure.

According to the site DVC uses object storage to store input data but that
leads to a few questions:

1\. Why wouldn't I just use Docker and Git + Git LFS to do all of this? Is DVC
just a wrapper for these tools?

2\. Why wouldn't I just version control the query that created the data along
with the code that creates the model?

3\. What if I'm working on a large file and make a one byte change? I've never
come across an object store that can send a diff, so surely you'd need to
retransmit the whole file?

~~~
dmpetrov
@mcncfie your understanding is correct. #3 might include output data\models as
well and intermediate results like preprocessed data. DVC also handles
dependencies between all of these.

Answers:

1\. DVC does dependency tracking in addition to that. It is like a lightweight
ML pipelines tool or ML specific Makefile. Also, DVC works just faster that
LFS which is critical in 10Gb+ cases.

2\. This is a great case. However, in some scenarios, you would prefer to
store the query output along with the query and DVC helps with that.

3\. Correct, there are no data diffs. DVC just stores blobs and you can GC the
old ones - [https://dvc.org/doc/commands-
reference/gc](https://dvc.org/doc/commands-reference/gc)

~~~
cyphar
> Correct, there are no data diffs. DVC just stores blobs and you can GC the
> old ones

Have you looked into using content-defined chunking (a-la restic or
borgbackup) so that you get deduplication without the need to send around
diffs? This is related to a problem that I'm working on solving in OCI
(container) images[1].

[1]: [https://www.cyphar.com/blog/post/20190121-ociv2-images-i-
tar](https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar)

~~~
dmpetrov
Content-defined chunks - very interesting. I'd suggest you ask this question
in DVC issue tracker or DVC channel
[https://dvc.org/chat](https://dvc.org/chat)

------
tolstoyevsky
I think maybe this could better illustrate what can be done using DVC:
[https://dagshub.com/DAGsHub-Official/DAGsHub-Tutorial-
MNIST](https://dagshub.com/DAGsHub-Official/DAGsHub-Tutorial-MNIST)

You can navigate branches, and be able to download the data, model, and
intermediate pipeline files from a shared team AWS,GS,Hadoop, or plain NFS or
SSH server, as they were in a specific commit. Also compare metrics between
branches for comparison of different experiments, etc. A team member can
checkout a branch, immediately get the relevant files which were already
computed by someone else, modify the training code, reproduce the out-of-date
parts of the pipeline using dvc repro, and then git commit the resulting
metrics + dvc push the resulting model back to the shared team storage.

------
usasha
Just ran thru official tutorial and I'm pretty impressed.

As I understood the idea is to: \- use git branch for each experiment (change
of hyperparameters etc.) \- define pipeline stages (preprocessing, train/test
split, model training, model validation) \- after this steps you can change
any part of pipeline (say data preprocessing or model parameters) and run `dvc
repro` to reproduce all stages for which dependancies changed and track
metrics for all branches, which os pretty cool and reduce experiment logs in
wiki

~~~
dmpetrov
Exactly. And it is flexible as Git. You can define your own workflow. For
example, some data scientists avoid using git-branches for experiments - they
use directories.

------
maksimum
This is definitely needed and DVC has a few cool ideas. I think the most
useful feature missing from existing tools is integrating data versioning with
git, and simple commands to tag, push, and pull data files.

I'm not sure I buy the pipeline and repro functionality as useful. I'd rather
see nice integration with Docker since it can be used to define the
environment as well as repro steps.

~~~
dmpetrov
Great point! `dvc pull\push mydataset` actually exists in DVC. But `dvc tag`
is missing and now we (DVC team) understand that it prevent DVC from a proper
datasets and ML models tracking. It will be implemented soon.

There is an ongoing discussion in DVC GitHub about datasets tracking and tags
[https://github.com/iterative/dvc/issues/1487](https://github.com/iterative/dvc/issues/1487)
and some discussions in DVC discord channel
[https://dvc.org/chat](https://dvc.org/chat)

------
AlexCoventry
How does this compare to pachyderm?

~~~
ishcheklein
From a very high level perspective - Pachyderm is a data engineering tool
designed with ML in mind, DVC is a tool to organize and version control an ML
project. Probably, one way to think would be Spark/Hadoop or Airflow vs
Git/Github.

~~~
AlexCoventry
Thanks.

