I have built version control for data, on top of git itself, that can commit and push incremental diffs. By tagging in git, a version snapshot can be created. S3 can be configured (a) for heavy files and diffs referenced by pointer objects and (b) for shareable snapshots, identified by repo, tag name and commit sha. The diffs operate at row, column, and cell level, not by block deduping. Datasets must have some tabular structure. The data will go wherever pushed, and to user S3 if configured.
The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are all on the table. Merge conflict resolution is in the works. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are git removed if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.
The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms that (+) operate on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (I know this sounds impossible), and (++) explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a collaborative system to manage data efficiently.
I am open to everything including thoughts, suggestions, constructive criticism, and use case ideas.
dolt came up when searching git for data, it seems great, though I have never used it. I know it works on prolly trees rather than on top of git. I am really curious to learn about that choice, why exactly not on git? How can you offer data removal from history without rebuilding repos? Especially here in the EU, an ongoing conversation is taking place about structured data documentation, collaboration, and tech design choices that affect privacy.
We wanted a solution that scales up to terabytes and has fast database access. Storing giant CSV files in git natively doesn't scale, and if you store them in S3 you lose the fast diff and all merge capability.
GDPR does require rebasing in some cases as near as we can figure. There are some creative ways to not require taking an outage during this rebase, or some other creative schemes for storing all PII in non-versoined tables. We haven't built any of that yet though, nobody has asked for GDPR support.
Important points. I aim for version control for data repositories with HDD efficiency, visualization of diffs for collaboration, and API accessibility of individual datasets from multiple identifiable versions from git. Datasets can then be streamed into a database individually. Migrating by processing diffs from commits, a future possibility. Fast direct database-like repository access with in-situ editing is not my initial goal, rather a clonable Git repository that is separate from the database working areas, can be connected to MySQL using i/o pipelines, and can easily export datasets versions as labeled snapshots.
On diffs: diffing workload is lazy and needed just once upon requesting a commit for a repo, the bidirectional diff results (incl. index and columns changes) - not the newly changed files - are then committed as csv or pointer objects, git natively supports seeing such objects as new. I had to write an engine to rebuild and serve snapshots from git history, now on localhost with posting to S3, later also in the cloud. Diffing a dataset (csv, Excel, SQL) against the current checked out version, which now resides in a gitignored "datasets" folder in the working directory, now takes ~20 seconds on a 1 GB CSV dataset with 10M rows. Diffing is not always needed, can be bypassed with incremental workflows (new data daily), committing just the diff. I can handle repos of 10s of GBs, with individual datasets of GBs each. Where to put the compute workload of diffing, checking out, and building snapshots is under careful consideration.
Merge will be assisted, with files in S3 or without: 3-way comparing commits boils down to reusing features from the snapshot rebuild engine, starting from the common ancestor and using only the diffs, and handling conflicts. Merging small changes on large datasets involves dealing with the small changes only. Data and diffs can come from S3 if not already on localhost, to merge data, not pointers. Visually presenting diffs in non-adjacent commits requires UI, current git tools would not interpret history correctly.
The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are all on the table. Merge conflict resolution is in the works. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are git removed if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.
The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms that (+) operate on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (I know this sounds impossible), and (++) explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a collaborative system to manage data efficiently.
I am open to everything including thoughts, suggestions, constructive criticism, and use case ideas.