|Hey HN, creator of DVC here!|
DVC (https://dvc.org/) is known as Git for data projects. Technically, DVC codifies your data and machine learning pipelines as text metafiles (with pointers to actual data in S3/GCP/Azure/SSH) while you use Git for the actual versioning. DevOps folks call this approach GitOps or more specifically in this case - DataOps or MLOps.
We’ve been working towards 1.0 since we started 3 years ago. What began as my pet project now has 100+ code contributors, 100+ documentation contributors, and thousands of users.
Our community has taught us a lot - here are some of the biggest lessons:
1. Users say the serverless and distributed nature of DVC (inherited from the underlying Git) is one of its "killer features".
2. To share ML projects within and between teams, it’s not enough to track only files and pipelines. You also need metrics, plot and hyperparameter tracking. In DVC 1.0 we implemented hyper-parameter diffs, metrics and plot diffs right from Git history.
3. In DataOps, data transfer optimization is huge. Large deep learning models, millions of images in datasets, etc. We doubled down on optimizing 1.0.
4. ML pipelines evolve faster than data engineering pipelines and need to be easy to change. In 1.0, we’ve simplified the pipeline metafile format.
5. More and more teams use DVC as a part of CI/CD for ML and other MLOps tools. DVC is used under the hood in the CD4ML tool that was described in the canonical post on Martin Fowler’s blog: https://martinfowler.com/articles/cd4ml.html. We built 1.0 with CI/CD users in mind.
More details on https://dvc.org/blog/dvc-1-0-release.
Happy to answer any questions here or at DVC Discord chat https://dvc.org/chat.