|Hey, there! I’m one of the maintainers of the DVC project that we just released 1.0 version of - https://dvc.org/blog/dvc-1-0-release. I have a software engineering background, and I joined the DVC team when I started doing some ML and realized that there is no process - no version control, no reproducibility, no CI, no CD, hard to deploy. Here I’d like to share some thoughts on how we see it can be solved with Git.|
Across ML, we see teams solving the same problems over and over again (similar DevOps folks were writing bash scripts before Terraform over again). Where do we put our files and ML models? How do we name them? How do we get them in prod? How do we roll back? How do we know how it was built? How is the model performing? How about CI/CD for ML - how do we test a model’s performance before releasing it? How do we test datasets? Etc, etc.
Ironically, DVC (Data Version Control) is not a Version Control System :) It applies GitOps philosophy to ML and Data and "codifies" data files, data pipelines (think makefiles for data), ML experiments “telemetry” in a simple declarative YAML. It provides commands to capture the state, move and sync files efficiently across local env, training machines, and cloud storages, compare “diff” experiments, etc.
GitOps is not a new idea, but the term is catching on thanks to projects that use a Git framework for managing infrastructure. One of the most famous examples is Hashicorp’s tool, Terraform. The essence is to capture infrastructure state with declarative notation and treat it as code - which means you can version, share, deliver with Git whenever you need, use CI to detect changes, etc ... Just a whole bunch of benefits!
So, DVC is more of a Terraform for Data. Add Git to this and we have a powerful mechanism with all the benefits of the GitOps ecosystem applied to Data and ML.