Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: DVC 1.0 release. 5 lessons from 3 years of building open-source ML tool
19 points by dmpetrov 10 days ago | hide | past | web | favorite | 4 comments
Hey HN, creator of DVC here!

DVC (https://dvc.org/) is known as Git for data projects. Technically, DVC codifies your data and machine learning pipelines as text metafiles (with pointers to actual data in S3/GCP/Azure/SSH) while you use Git for the actual versioning. DevOps folks call this approach GitOps or more specifically in this case - DataOps or MLOps.

We’ve been working towards 1.0 since we started 3 years ago. What began as my pet project now has 100+ code contributors, 100+ documentation contributors, and thousands of users.

Our community has taught us a lot - here are some of the biggest lessons:

1. Users say the serverless and distributed nature of DVC (inherited from the underlying Git) is one of its "killer features".

2. To share ML projects within and between teams, it’s not enough to track only files and pipelines. You also need metrics, plot and hyperparameter tracking. In DVC 1.0 we implemented hyper-parameter diffs, metrics and plot diffs right from Git history.

3. In DataOps, data transfer optimization is huge. Large deep learning models, millions of images in datasets, etc. We doubled down on optimizing 1.0.

4. ML pipelines evolve faster than data engineering pipelines and need to be easy to change. In 1.0, we’ve simplified the pipeline metafile format.

5. More and more teams use DVC as a part of CI/CD for ML and other MLOps tools. DVC is used under the hood in the CD4ML tool that was described in the canonical post on Martin Fowler’s blog: https://martinfowler.com/articles/cd4ml.html. We built 1.0 with CI/CD users in mind.

More details on https://dvc.org/blog/dvc-1-0-release.

Happy to answer any questions here or at DVC Discord chat https://dvc.org/chat.

How do you make money? Sorry to be that guy; I wouldn't ask this if you didn't claim "Happy to answer any questions here".

Good question! We build separate products (that use DVC) for monetization. No plans to monetize DVC.

Some analogy - Git is free for versioning, GitHub/GitLab as monetization.

Is it possible to use DVC within the new implementation of GitHub Actions? I checked it out on the website and apparently it looks like it supports it, but I wanted to know more about how you guys are getting ready for this new CI / CD feature?

We have a lot in store here- we are unrolling a new tool for CI/CD soon that works with GitHub Actions & GitLab CI. Adding run-cache to DVC 1.0 is just one way of preparing for more CI/CD uses of DVC. (FYI I am part of DVC)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact