
Open-Source Version Control System for Machine Learning Projects - gilad
https://dvc.org/
======
latenightcoding
Be warned: if you star any of their repos they will find your email, add you
to their mailing list and spam you with announcements. Happened to me.

Email looks like this: Hi XXXX, I'm Dmitry Petrov - creator of the open-source
project DVC Since you were interested in DVC ... spam, spam, spam

~~~
dmpetrov
Hi @latenightcoding Dmitry Petrov here. Yes, I sent an email after we released
an open source product that complements DVC. I expected that this might be
interesting to DVC fans. I am very sorry if it was unpleasant for you.

~~~
quadrifoliate
As someone who has no knowledge or interest in your project, I will say that
starring can be anything from "I love this project" to "Looks interesting,
maybeee I'll take a look at this in 6 months". There is no guarantee of
fandom.

Finding the email of someone who starred your repo and auto-subscribing them
to your email list is in poor taste. It's the software version of that person
who suddenly adds you on Facebook, LinkedIn and Twitter the same day you
briefly said Hi to them at some event, and then sends you an email saying
"Hi!" just in case.

I would request you to please stop doing it, and not contribute to this
slippery slope of treating your open source project as a marketing exercise.

------
site-packages1
I have used this quite a bit, really enjoy it, but use it for a very limited
use case. Use case is basically saving versioned datasets for supervised ML.
Features I would like to be added (or features I don't know exist): Get
multiple datasets into one place, i.e. I want datasets A, B, and C in one
place for use, rather than downloading them all separately and then combining
them, mix and match! Also, some built in meta about the dataset, class
distribution/class map, things like that, so I can intelligently pick what
datasets I might need at any given time. Combined these two things would be
like a model zoo or something, for data.

~~~
chatmasta
(Disclaimer: Splitgraph co-founder)

> I want datasets A, B, and C in one place for use, rather than downloading
> them all separately and then combining them, mix and match

This is the use case we're focused on at Splitgraph [0]. We think data is most
interesting when combined with other data, so it's important to be able to
reference multiple datasets in one query.

The Splitgraph engine is powered by Postgres and allows you to build versioned
"data images," which are snapshots of databases (like Docker images are
snapshots of filesystems). You can "mount" multiple databases in one engine
[1] and query them with regular SQL, or you can write a Splitfile [2] that
references both of them when "importing" data into an image.

This can be really useful at times. We have a blog post about mounting
multiple databases and using dbt for cross-DB transformations. [3]

Ultimately, our goal is to build a "data delivery network" \-- a single SQL
endpoint where you can query any databases connected to Splitgraph (whether
live, or cached using Splitgraph images).

[0] [https://www.splitgraph.com](https://www.splitgraph.com)

[1] [https://www.splitgraph.com/docs/ingesting-data/foreign-
data-...](https://www.splitgraph.com/docs/ingesting-data/foreign-data-
wrappers/introduction)

[2]
[https://www.splitgraph.com/docs/concepts/splitfiles](https://www.splitgraph.com/docs/concepts/splitfiles)

[3] [https://splitgraph.com/blog/dbt](https://splitgraph.com/blog/dbt)

~~~
dmpetrov
It is important to know what are A and B. Splitgraph\dbt\Dolt can work if
these are structured tables in SQL database while DVC works if A and B are
files or directories.

------
hartem_
It’s a really nice and useful project. It makes it easy to deal with large
files (models) and use familiar GitHub workflows for managing code and data
together. Their documentation is also pretty good.

~~~
ishcheklein
Thanks, documentation part warms my heart - it's been quite important part day
zero.

------
amitport
I got to comment about the intro video.

I think tells too much about the problems and not enough about what the
product is and _how_ it's going to solve these problems.

------
stochastastic
I have thought about using DVC at work, but I have held off in part because it
isn’t clear to me how the development is supported financially. I recall
seeing a job posting (almost put in a resume!), but I don’t see any pricing
info, so I’m a bit confused. Am I missing something obvious?

~~~
ishcheklein
Hi! Great question. It started as a pet project, but now is being supported
and belongs to a VC funded entity - Iterative. + it has independent
contributors, it's an Apache 2.0 project after all. In terms of business long
term goals, I would you can compare us to Hashicorp. There are no plans to
monetize or even build premium features in DVC or CML.dev project, rather
we'll be building enterprise layer on top of them. I really hope that clears
out your concerns. If not - I would be happy to answer any question.

~~~
stochastastic
Thanks!

------
spicyramen
Looks interesting, is there a similarity with MLflow from Data bricks?

~~~
ishcheklein
Hey! DVC maintainer here. I would say they overlap to some extent, but mostly
complement each other for now. And that's what we see across our user base.
MlFlow is used mostly as a ML logger - you insert some Python code into your
scripts and log to MlFlow everything related to an experiment. While DVC is
strong is managing and versioning data and models with a Git-like experience.

