Hacker News new | past | comments | ask | show | jobs | submit login
Data Version Control (dvc.org)
161 points by HerrMonnezza on Oct 2, 2022 | hide | past | favorite | 59 comments



DVC has had the following problems, when I tested it (half a year ago):

I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.

You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.

Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.

Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.

There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.

This sounds negative but I think it is currently the one of the best tools in this space.


You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...


> You are encouraged if you model your pipeline in DVC.

Encouraged to do what?

You might want to slow down on the use of parentheses, we are both getting lost in them.


I assume they meant to say "you are encouraged to use DVC to run your model and experiment pipeline". They want to encourage you to do this because they are trying to build a business around being a data science ops ecosystem. But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space. it could be improved in that regard, but that's just not what I use it for nor is it what I recommend it to other people for.

However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.


> But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space.

Would love your feedback what's missing there! We've been improving it lately - e.g.

- Hydra support https://dvc.org/doc/user-guide/experiment-management/hydra

- VS Code extension - https://marketplace.visualstudio.com/items?itemName=Iterativ...


Last I checked it wasn't easy to use something like optuna to do hyperparameter tuning with hydra/DVC.

Ideally I'd like the tool I use for data versioning (DVC/git-lfs/gif-annex) to be orthogonal to that which I use for hyperparameter sweeping (DVD/optuna/SageMaker experiments), and orthogonal to that which I use for configuration management (DVC/Hydra/Plain YAML), to that what I use for experimental DAG management (DVC/Makefile)

Optuna is becoming very popular in the data-science/deep learning ecosystem at the moment. It would be great to see more composable tools, rather than having to opt all-in into a given ecosystem.

Love the work that DVC is doing though to tackle these difficult problems though!


Big +1 about composability and orthogonality. I don't want one "do it all" tool, I want a collection of small tools that interoperate nicely. Like how you can use Airflow and DBT together, but neither tool really tries to do what the other one does (not that Airflow is "small", but still).


DVC is great for use cases that don't get to this scale or have these needs. And the issues here are non-trivial to solve. I've spent a lot of time figuring out how to solve them in Pachyderm which is good for use cases where you do need higher levels of scale or might run into merge conflicts with DVC. There's trade-offs though. DVC is definitely easier for a single developer / data scientist to get up and running with.


I think it's worth noting that DVC can be used to track artifacts that have been generated by other tools. For example, you could use MLFlow to run several model experiments, but at the end track the artifacts with DVC. Personally I think that this is the best way to use it.

However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).


The alternative tool you are referring to is `Dud` I believe

Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).

What alternatives would you recommend?


What’s best if parallel step processing is required?


Yeah we had a lot of problems with things getting out of sync and we just got tired of it


The package phones home. One has to set an env var or fix several lines of code to prevent that.



I think their plan was/is to make money on corporate licenses and support, as well as SaaS/cloud products.


They won't, they can make investor money back only from selling company to Amazon/Microsoft/Google but in this economy it won't happen.


Hey, yes, we've decided to keep it opt-out for now and it collects fully anonymized basic statistics. Here is the full policy: https://dvc.org/doc/user-guide/analytics .

It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.

Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.


This seems pretty anti user since most users prefer opt in. Seems pretty shady to keep in behavior that users don’t like and potentially harms them (you think it’s fully anonymized).

That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.


We are fully aware that it raises concerns. Trust me it hurts my feelings as well. E.g. on the websites (dvc.org, cml.dev, etc) - we don't use any cookies, GA, etc.

We've tried to make it as open as possible - code is available (its open source), we write openly about this at the very start, we have a policy online, made it easy to opt-out. If you have other ideas how to make it even more friendly, more visible, etc - let us know please.

Still, we've preferred so far to keep it opt-out since it's crucial for us to see major product trends (which features are being used more, product growth MoM etc). Opt-in at this stage realistically won't give us this information.


Yet there are many successful projects that don’t collect this information. So it’s not crucial for them but is crucial for you.

I think the challenge I have is that since you’re getting IP address that will be an opportunity to abuse. And there seems to be some rule that any data that can be misused will eventually be misused.

Since you’re not willing to make it opt-in, I think perhaps the only other way would be to support an automated distro that doesn’t include it so users are at least able to easily choose a version.

I admire you for responding to this thread and me as it’s definitely not easy. I just feel like one of the main benefits of open source is its alignment with user benefits so it’s discouraging when an open source project chooses code that users don’t want.


Right, many projects use opt-in, there are many that have opt-out though:

https://docs.brew.sh/Analytics https://docs.npmjs.com/policies/privacy#how-does-npm-collect... VS Code, etc

> I think the challenge I have is that since you’re getting IP address that will be an opportunity to abuse.

Yes! And we are migrating to the new package / infrastructure because of this - https://github.com/iterative/telemetry-python (DVC's sister tool MLEM is already on it and it's not sending (saving) IP addresses, nor using GA or any other third-party tools, data is saved into BigQuery and eventually we'll make publicly accessible - https://mlem.ai/doc/user-guide/analytics to be fully GDPR compatible). It's a legacy system that DVC had in place. There was no intention to use those IP addresses in some way.

> I think perhaps the only other way would be to support an automated distro that doesn’t include it so users are at least able to easily choose a version.

Thanks. To some extent brew-like policy (not sending anything significant before there is a chance to disable it and there is clear explicit message) should be mitigating this, but I'll check if it works this way now and if it can be improved.


I wonder what the GDPR implications of this are. I note other projects (for eg Cura) switched their telemetry to opt-in.

https://github.com/Ultimaker/Cura/issues/2810


If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.


I've used both extensively. Git-lfs has always been a nightmare. Because each tracked large file can be in one of two states - binary, or "pointer" - it's super easy for the folder to get all fouled up. It would be unable to "clean" or "smudge", since either would cause some conflict. If you accidentally pushed in the wrong state, you could "infect" the remote and be really hosed. I had this happen numerous times over about 2 years of using lfs, and each time the only solution was some aggressive rewriting of history.

That, combined with the nature of re-using the same filename for the metadata files, meant that it was common for folks to commit the binary and push it. Again, lots of history rewriting to get git sizes back down.

Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.

Also configuring the backing store was generally more painful, especially if you needed >2GB.

DVC was easy to use from the first moment. The separate meta files meant that it can't get into mixed clean/smudge states. If you aren't in a cloud workflow already, the backing store was a bit tricky, but even without AWS I made it work.


We resolve this in two ways

1. All git-lfs files are kept in the same folder

2. No one can directly push commits to one of the main branches, they need to raise a PR. This means that commits go through review and its easy to tell if they've accidentally commit a binary, and we can just delete their branch form the remote bringing the size back down.


I think the one thing that DVC does a bit better than git-lfs is that DVC doesn't keep the files directly in the repo. DVC puts a pointer file with a path and a hash of the file (to detect change). As far as I can tell, git-lfs only keeps them in the .git path of the repo.

For example, I think CodeOcean might use git-lfs under the hood but handles upload download separately from the UI. In the below sample, you can clone the repo from the Capsule menu but data and results are downloadable from a contextual menu available from each, respectively.

https://codeocean.com/capsule/2131051/tree/v1


I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey.

Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.

Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.

If you know of any such solution, please let me know!


Have you tested Weights & Biases Artifacts[1]?

It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.

You can also use your existing object store and link it for very large / sensitive data.[2]

Disclaimer: I work at W&B.

[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...


+1. git-lfs is sufficient for tracking binaries, including a ML model, at that cadence.

Thinking more abstractly, there is benefit for code and data to live "next" to each other, if possible. Atomically committed to a codebase and the latter loaded / used by the former without connecting to yet another workflow.


It seems to be the solution Hugging Face have picked too.


Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?

[https://www.datalad.org/]



Dolt is for tabular data. It's like SQLite but with branching, versioning of the DB level. DVC is file-based. It saves large files, directories, etc to one of the supported storages - S3, GCP, Azure, etc. It's more like Git-lfs in that sense.

Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.


Doesn't use git-annex like DataLad. That alone is a huge benefit given the state of that tool.


I'm curious, what's the problem with git-annex?

I've considered using it before as an alternative to Git LFS.


things that I don't like about it:

* git diff doesn't work in any sensible way

* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)

* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem

The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.

We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.


If you configure annex.largefiles, git add should work with the annex. I start with something like

    git annex config --set annex.largefiles 'largerthan=1kb and not (mimeencoding=us-ascii or mimeencoding=utf-8)'
> By default, git-annex add adds all files to the annex (except dotfiles), and git add adds files to git (unless they were added to the annex previously). When annex.largefiles is configured, both git annex add and git add will add matching large files to the annex, and the other files to git. —https://git-annex.branchable.com/git-annex/

Note that git add will add large files unlocked, though, since (as far as I understand) it’s assumed you’re still modifying them for safety:

> If you use git add to add a file to the annex, it will be added in unlocked form from the beginning. This allows workflows where a file starts out unlocked, is modified as necessary, and is locked once it reaches its final version. —https://git-annex.branchable.com/git-annex-unlock/


Yes it definitely serves a valid use-case, I feel like someone should try and bring some competition there. A modern equivalent with fewer gotchas, maybe in Rust/Go, maybe using a fuse mount and content-defined chunking (borg/restic/...-style) would be amazing.


I'd love to see a well-supported git-lfs compatible client/proxy (so you could more easily move backends) that could run on top of S3/object storage. Yes, and written in a modern language like golang/rust for performance / parallelism. There's some node.js and various other git-lfs proxies out there, but not well enough maintained that I could count on them being around and working in another 5 years. git-annex at least has been around for a while, even though it has its issues.

Huggingface uses git-lfs for large datasets with good success. git-lfs on GitHub gets very pricey at higher volumes of data. Would love the affordability of object storage, just with a better git blob storage interface, that will be around in the future.

Most of these systems do their own hash calculations and are not interchangeable with each other. I feel like git-lfs has the momentum at the momentum in data-science at the moment, but needs some better options for people who want a low cost storage option that they can control.

Huggingface is great, but it's one more service to onboard if you're in an enterprise. And data privacy/retention/governance means that many people would liek their data to reside on their own infrastructure.

If AWS were to give us a low cost git-lfs hosted service on top of S3 it would be very popular.

If anyone knows of some good alternatives, please let us know!


Did some more research to see if anything had changed in this space. I found two interesting projects (haven't used them myself yet though):

One in C# (with support for auth)

https://github.com/alanedwardes/Estranged.Lfs

One in Rust (but no Auth, have to run reverse proxy)

https://github.com/jasonwhite/rudolfs

Both seem interesting. Anyone use these?


I work with a lot of uncompressed structured binary files so I finally broke down and wrote my own system based on the Restic chunker: https://github.com/akbarnes/dupver It's pretty basic, but it works for me and will hopefully inspire someone to make a "real" data VCS based on content-defined chunking.


It lives in this weird wiki that seems to be read-only most of the time. I don't think it's alive. Its use of hard links also causes too many problems, of the silent corruption variety.


Ikiwiki’s definitely a bit weird, but I’ve been experimenting with git-annex recently and it worked fine every time I commented. Seems like it’s chugging along: https://git-annex.branchable.com/recentchanges/

When does it use hard links? As far as I remember it used symlinks unless you used something like annex.hardlink (described in the man page: https://git-annex.branchable.com/git-annex/)


Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage. The way the "check out" feature works is also weird, causing a change in the shared version history.


> Symlinks are just as problematic honestly, an app writing to it will change the object in the persistent "immutable" storage.

Well, anything stored by git-annex has read-only file permissions. Apps will follow the symlink, yes, but they will fail to write to the location if they try.

> The way the "check out" feature works is also weird, causing a change in the shared version history.

Unlocking a file changes it from a symlink to a git-annex pointer file from git’s perspective (git-annex accomplishes this via git’s smudge filter interface), but you don’t have to commit the unlock. You can unlock, modify locally, re-lock, and commit the new changed version in one go. It’s nice that you can commit the unlocking action itself if you want a file to be unlocked in all clones of the repository. You can choose whether to commit the unlock depending on if it fits your use case.

For curious readers, https://git-annex.branchable.com/tips/unlocked_files/ discusses these topics in more detail.


What's wrong with git-annex? My work has been using it for almost 10 years to manage 40TB+ of data. It's always been rock solid.


If you're looking for something that actually tracks tabular data there's https://kartproject.org. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.


Can it be used for large and fast changing datasets?

Example: 100 TB, write us every 10 mins.

Or, 1tb, parquet, 40% is rewritten daily.


DVC is expressly for tracking artifacts that are files on disk, and only by comparing their MD5 hashes. So it can definitely track the parquet files, but you are not going to get row or field diffs or anything like that.

Maybe Pachyderm or Dolt would be better tools here.


Why would you use MD5 in anything written in the last 5 years? The SHA family is faster on modern hardware and there aren't trivial collisions floating around out there.


It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

There's an ongoing discussion about replacing/configuring the hash function, and it looks like there might be some movement toward replacing the hash and other speedups in 3.0

https://github.com/iterative/dvc/issues/3069

> We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months. - https://github.com/iterative/dvc/issues/3069#issuecomment-93...


Don't quote me on the specific hash algorithm, maybe it's SHA. Point is that it's just comparing modification times and hashes.


What about Apache Iceberg for those?


I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.

I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.

And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.


How do you merge multiple versions of data using tensorboard? Or what other tool handles that for you?

What's the case for handling code and data separately? In my experience, the primary motivation for using such a tool are easy reproducibility through easy tracking of code, hyperparams, and data. It's not obvious to me how that goal would be advanced by tracking code and data separately.


Tensorboard doesn't do that, I was referring to things a dataset/model management tool should do. For us, Tensorboard tracks the datasets as hyperparams. The actual multiple versions of data end up being handled on the warehouse side. Prefect is what we use for running those DAGs to make the different versions.

Handling code and data separately is important, to allow easy updates to one or the other. They are loosely coupled to allow quicker updates, rather than having to increment versions on both as per DVC, and DVC is far heavier weight as it pulls the data referenced in the dvc files, and you have to pick out on the CLI which ones you want.

Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.


I forgot to say thanks regarding this!

> Tensorboard tracks the datasets as hyperparams.

Clever!

> Warehouse side .. Prefect

I'll have to checkout warehouse-side things and Prefect to see what you mean.

Appreciate all the pointers!


What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?


I think parquet won't help with images, video, ML models.

Also, one thing is to physically provide a way to version data (e.g. partitioned parquet files, cloud versioning, etc, etc), but another one is to also have a mechanism of saving / codifying dataset version into the project. E.g. to answer the question which version of data this model was built with you would need to save some identifier / hash / list of files that were used. DVC takes care of that part as well.

(it has mechanics to cache data that you download, make-file like pipelines, etc)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: