
I Want Decentralized Version Control for Structured Data - pcr910303
https://jonas-schuermann.name/projects/dvcs-for-structured-data/blog/2020-03-22-manifesto.html
======
carapace
> ...there are great decentralized version control systems like
> Pijul/Git/Fossil and many more that check every requirement that I have, but
> they are built to work with textual data and are therefore unsuited to be a
> database backend of a graphical application.

> I basically want a DVCS that doesn't operate on text files, but on a proper
> data model like relational algebra or algebraic datatypes.

Prolog + git.

Storing a text file of Prolog rules defining your DB is not as wild as it
might sound. It's IMO much nicer to work with than SQL.

\- - - -

TerminusDB (in Prolog)

> TerminusDB started life as a quality control system and only later morphed
> into a full database because we couldn't find a storage layer that was fast
> enough to support quality controlled transactions at scale.

> Terminus uses the W3C's OWL language to define the schema of its databases.
> This gives Terminus a uniquely powerful and expressive way of defining rules
> for the shape and structure of the data that it stores.

> OWL is a very rich language based on first order logic and set theory.

[https://terminusdb.com/docs/schema](https://terminusdb.com/docs/schema)

\- - - -

CQL (in Java, I believe)

> The open-source Categorical Query Language (CQL) and integrated development
> environment (IDE) performs data-related tasks — such as querying, combining,
> migrating, and evolving databases — using category theory,

[https://www.categoricaldata.net/](https://www.categoricaldata.net/)

~~~
WorldMaker
Similarly, if you are merely destructuring databases into text formats, most
SQL database engines are pretty good about round-tripping their "dump"
formats. For certain sizes of, for instance, SQLite databases, it is entirely
possible to git store the SQL dumps and rebuild the database from them to
query/work with them. (Whether it is reasonable to do it that way is another
question.)

------
kmike84
We needed to solve a similar problem - version control & synchronize .json
files from different machines (annotations for ML models).

Writing a custom git merge driver was quite painless - a cmdline script
(written in Python), which has task-specific logic on how to merge data from
these .json files. Load these files, parse them, decide how to combine, detect
unresolvable conflicts, etc.

It seems one may need custom logic to merge structured data, there is not a
single best solution. This could make creation of a generic tool harder.

git is not a bad base technology for this. I'm not sure what other things are
we missing (e.g. better diffs for structured data?), because .json is still
text; it is just merges which are unreliable if you treat .json as text. There
are also caveats - e.g. you can't install a custom merge driver on github, so
"merge" button becomes dangerous. But overall for .json this approach works
fine.

~~~
elcritch
I wanted to write a git diff for files like KiCad or even Word. I didn’t know
custom git merges were a thing. Do you have a link for how to get started?

~~~
JNRowe
Both custom diff and merge drivers are described at a high-level in
gitattributes(5)¹. They're pretty useful even in really basic ways such as
adding a textconv with "jq -S ." or "xmllint --pretty 2" to pretty print JSON
or XML before calculating diffs.

Plus, if you've already dipped in to those docs to see the diff options be
sure to check the funcname attribute too. It allows you to add custom
diff(1)-style `--show-function-line` options. For example, you can use an ugly
regex such as `^\\\\[\\\\(.*\\\\)\\\\]$` to guess section names in .ini file
diffs. Or the wordRegex option to make CSV files break on fields with `git
diff --word-diff`. Or... well, thousands of other things. There are tonnes of
things you can do to improve diff and merges for textual data in addition to
the things you may want to do binary blobs.

1\. [https://git-scm.com/docs/gitattributes](https://git-
scm.com/docs/gitattributes)

------
jarofgreen
[https://www.dolthub.com/](https://www.dolthub.com/) ?

Never used it myself, but it's one of the many "git for data" projects I've
seen go past. I'm interested that none of them meet the authors need.

~~~
diggan
For being decentralized, dolt is really not good at getting the message
across, I don't find anything in the documentation how I can get the
collaboration/repository stuff to work without DoltHub, which seems to be
their centralized service.

One of the neat things with Git and others is that it doesn't really care
about what type of endpoint you're fetching data from, you can `git clone`
from a unix directory, NFS shared directory, over ssh, via Keybase and so on.
Dolt doesn't seem to support anything else than using DoltHub (if it does,
they really need to update the documentation).

~~~
ken
Where did you look? I clicked the "GitHub" link at the top, which is the
standard way to bypass marketing fluff and get to an actual program. The
README has instructions for installing from source, verifying the
installation, configuring, creating an empty local repo, adding data to it,
etc. It barely mentions DoltHub.

~~~
diggan
I think you misunderstood my message. I'm not asking how to install dolt, I'm
asking where the whole "decentralized" feature-set is described and
implemented in dolt, as I've scoured the docs, both marketing page and the
GitHub repository, and the collaboration features seems to be locked to
dolthub, which is very different from for example git.

So I create a local repo with dolt, and now I want someone else to collaborate
on this dataset with me, without using dolthub. How I do this?

The article in the submission has "decentralized" as the first requirement, so
I assume that dolt is mentioned here because it's somehow decentralized. But I
cannot for the life of me find where that's mentioned

~~~
bionoid
> So I create a local repo with dolt, and now I want someone else to
> collaborate on this dataset with me, without using dolthub. How I do this?

The best I could find is utils/remotesrv in the repository [0], which seems to
enable collaboration.. but also seems bit basic and it's not clear if https is
supported.

[0] [https://github.com/liquidata-
inc/dolt/tree/master/go/utils/r...](https://github.com/liquidata-
inc/dolt/tree/master/go/utils/remotesrv)

~~~
diggan
Seems to confirm my suspicion that dolt is not made with decentralization in
mind and their collaboration is mostly tied to dolthub and a specific protocol
that currently just work over http. Thanks for digging that out!

------
apostacy
I'm surprised nobody has mentioned datalad/git-annex. It might be relevant. It
stores references to binary data in a git repository. And depending on the
backend, like bup, it benefit from differential compression. And it is highly
peer to peer, and has git-annex-sync[1], which will synchronize branches with
its remotes.

Datalad is based on git-annex, and already being used for sharing large
scientific datasets[2].

[1]: [https://git-annex.branchable.com/sync/](https://git-
annex.branchable.com/sync/) [2]:
[https://www.datalad.org/](https://www.datalad.org/)

~~~
WhatIsDukkha
I think git-annex is pretty much the right answer if you are already a linux
saavy person that expects a mature battle tested tool that integrates well
with your existing linux workflow systems. It sits on top of git in a sensible
discernible way.

[https://www.datalad.org/for/git-users](https://www.datalad.org/for/git-users)

Peoples non text data is going to be some form of indiscernible binary blob to
whatever system is managing it. Think about what a diff looks like for a jpeg,
excel file, or an audio file.

[http://docs.datalad.org/en/latest/metadata.html](http://docs.datalad.org/en/latest/metadata.html)

What about git-annex vs git-lfs?

1\. """LFS Test Server is an example server that implements the Git LFS API.
It is intended to be used for testing the Git LFS client and is not in a
production ready state."""

2\. git-lfs seems to want to store your files in the github/microsoft cloud,
its not really ready to be deployed inside an existing systems workflow? If
you need to check a tickbox and just assume Atlassian/Microsoft has you
covered it might be a good choice.

3\. git annex has a ton of integrations with multiple backends -

[http://www.chiark.greenend.org.uk/doc/git-
annex/html/special...](http://www.chiark.greenend.org.uk/doc/git-
annex/html/special_remotes.html)

4\. Poke around the git-annex website. I think you'll see a tool that's being
used in a variety of workflows successfully for a decade plus. That's not true
of really any of the other tools others have noted in this thread.

~~~
apostacy
I've worked with both over the last few years. Git LFS is really not p2p like
git-annex is. And it is less flexible. But maybe more stable, and git works
with it out of the box.

You really have to pay for specialized LFS hosting, and you can't remove
hosted LFS objects. And while it is possible to only download some LFS
objects, it is not easy or intuitive. Once a repo gets to be beyond a certain
size, it an be really impractical.

Git annex lets you fairly easily get only the objects you want. And it is
strongly p2p. You can just set up another repo as a remote, and git-annex-copy
whatever files you want to directly over ssh. Or, you can use Google Drive or
DropBox or any number of other hosting services that git-annex knows, and
stash your files there. Sibling repos will be updated about where they can
find the objects. And it is really easy to push your binaries to as many
backends as you want.

Finally, you can find massive multi terabyte repos of medical image data[1].
There's no way that LFS could handle these terabyte repos. It is really easy
to fork one of these data repos, change it, put your changed files into your
own backend, and then reshare it.

[1]: [http://datasets.datalad.org/](http://datasets.datalad.org/)

~~~
WhatIsDukkha
Also of note (from my understanding) is that git lfs local repo is double the
size of the underlying data because it doesn't hardlink.

------
rgardaphe
You may want to have a look at qri "query" \- [https://qri.io](https://qri.io)
(disclosure, I work there).

Free & open-source tools for dataset versioning built on IPFS (the distributed
web). Qri datasets contain commit histories, immutable hashes for each version
(commit), schema info, metadata, readmes, & transform scripts, all of which
ride together with the data (or, body).

Latest versions of our CLI tools support SQL & version diffing. We also have
an electron app, Qri Desktop
([https://qri.io/desktop](https://qri.io/desktop))

~~~
carapace
PgUp PgDown Home and End keys don't work for me on your site. (I'm using
Firefox 75.)

~~~
chriswhong
Thanks for letting us know. I opened an issue here which we will address as
soon as possible: [https://github.com/qri-
io/website/issues/202](https://github.com/qri-io/website/issues/202)

~~~
carapace
Cheers!

------
adrianmonk
Is what they want just the same thing as a Conflict-free Replicated Data Type
(CRDT)?

[https://en.wikipedia.org/wiki/Conflict-
free_replicated_data_...](https://en.wikipedia.org/wiki/Conflict-
free_replicated_data_type)

The only thing on their list of requirements that might not be covered is the
last item, "collaborative". (If replication is a solved problem, it removes
some obstacles to that, even though it doesn't give you collaboration for
free.)

------
abathologist
[https://irmin.org/](https://irmin.org/) seems like it it may fit the bill:

> A distributed database built on the same principles as Git

I haven't used it myself, but it has been around for years and seems pretty
mature.

------
fouc
>A recent paper suggested a new mathematical point of view on version control.
I first found out about it from pijul, a new version control system (VCS) that
is loosely inspired by that paper.

[https://jneem.github.io/merging/](https://jneem.github.io/merging/)

~~~
dan-robertson
I think this is quite relevant to the problem.

Whatever your data is, you need a way to merge it, but the format of allowable
patches and the algorithms used must surely depend on the data structure.

Imagine first a set of strings. This may be merged more easily than the list
of lines a dvcs usually deals with but you still get conflicts (I’m not sure
what the merge of “add foo” and “add foo; remove foo” should be. It could
reasonably be “add foo; remove foo” or a conflict of maybe adding foo)

Now imagine some kind of recursive structure (say a unix filesystem structure
except the files are all empty). It seems hard to come up with a model for
patches that will work well with things like removing directories or moving
subtrees. Are these different patches/

move the contents of a directory too to another newly created directory and
then delete the first directory

move a directory to a new place

How would these merge with creating a new file under the directory before
things are moved?

For a set of relations, you want merging to preserve properties of primary and
foreign keys. I’m not sure what merging that would look like.

These all sound like interesting hard problems. I’m not so convinced that
their solution would be particularly useful.

~~~
User23
> Imagine first a set of strings. This may be merged more easily than the list
> of lines a dvcs usually deals with but you still get conflicts (I’m not sure
> what the merge of “add foo” and “add foo; remove foo” should be. It could
> reasonably be “add foo; remove foo” or a conflict of maybe adding foo)

My imagination is pretty confused by the application of a binary operator with
only one argument. Would you please clarify? I think you'll find that when you
explicitly think about the unstated arguments the problem makes sense, but I
could be misunderstanding your point.

~~~
dan-robertson
The point, as per the article mentioned in the parent comment, is to talk
about merging patches rather than snapshots. This allows you to force certain
nice properties of merges like associativity.

Sometimes you need information about deleted things in your patch to be able
to merge correctly though I don’t have an example from the top of my head

------
ghego1
That's exactly what we have developed at
[bohr.app]([https://bohr.app](https://bohr.app)).

As far as I can tell, our solution checks all the requirements mentioned in
the post. We have a decentralized data storage system that supports delta
syncing, sends/receives data over encrypted P2P communications between
multiple devices/users without a "central" data storage, and leverages some
concepts taken from blockchain technology to ensure data integrity and
immutability.

------
brynb
A few projects I'm involved with that might be worth sharing here:

\- Redwood
([https://github.com/brynbellomy/redwood](https://github.com/brynbellomy/redwood)),
a realtime, p2p database. Data is structured in state trees that evolve over
time. Merge algorithms are configurable at any point in a given state tree,
but the default is a CRDT algorithm called "sync9".

\- Braid ([https://github.com/braid-work](https://github.com/braid-work) and
[https://braid.news](https://braid.news)), a draft IETF spec that specifies
certain extensions to HTTP that make it much easier to build systems like
Redwood. The Braid spec is under active development on Github, and we welcome
input from anyone interested in the idea.

\- Axon ([http://axon.science](http://axon.science) and
[https://github.com/AxonNetwork](https://github.com/AxonNetwork)), marketed as
a platform for making it easier to collaborate on scientific research, but
under the hood, it's basically just some extensions to git that allow you push
commits over a peer-to-peer network bound together by a DHT.

I would also highly recommend getting involved in the Internet Archive's DWeb
meetups and conferences, where you'll find hundreds of people interested in
solving exactly these kinds of problems.

------
vinnyhaps
I went to a meetup on TerminusDB. It seemed like a cool project and quite
mature. [https://terminusdb.com/](https://terminusdb.com/)

------
Fiahil
I'm using "topic-based versioning"[0] and it solves most of the issues
presented here. The principles are quite similar to what you would find in a
database, with a central data repository and a write-ahead log.

Each "table" is organized in ordered, write-only topics (or folders)
containing immutable messages (or files). Each operation is also adding a
message into a special topic we use as log. Each participant have to remember
the message id ("cursor") at which it was on the last sync and fetch new
messages on the log first. Then, it's only a matter of applying the new ops
until the desired point-in-time.

\- decentralized, private, efficient: you own a complete copy of all versioned
datasets, and you can store deltas between versions since all messages are
ordered in topics. Each sync is done like a "git pull --rebase".

\- reliable: If you follow the principles, conflicts are not possible.

\- collaborative: Here's the slightly difficult requirement. You can choose to
defer message id ("cursor") allocation to another distributed service (like
zookeeper, for example) and accept to be online to register new write
operations on a topic ; or force the system to be linearizable.

[0]: I wrote an article describing this design pattern here :
[https://medium.com/bcggamma/topic-based-versioning-
architect...](https://medium.com/bcggamma/topic-based-versioning-architecture-
for-scalable-ai-application-c926ffa92c1) ! Any feedback welcomed !

------
keymone
Any immutable database? Datomic is one example.

On a related note: the idea of a database that loses data is perplexing, why
would anyone want that? Why isn’t data retention a default and limiting that
retention an opt-in choice? When was the last time you rewrote your git repo
to delete some old commits that you no longer need?

------
ex_amazon_fc
A proper "DVCSD" would help a lot with software packaging, configuration
packaging and deployment, immutable infrastructure.

It could allow fine-grained control and vetting of code and configuration.

(and overcome the security disasters called containers and "configuration
management")

It's sad that few people understand this.

------
barnabee
I'd like to see a system like this that treats the type system (aka schema)
for the data as just more data and puts _that_ under DVCS too.

Then I'd like to add a robust security/permissions model and build an OS
around it.

~~~
MathematicalArt
You might be interested in this:
[https://www.categoricaldata.net/](https://www.categoricaldata.net/).
Categorical Query Language (CQL).

~~~
barnabee
I’ll take a look, thanks!

------
RhysU
Wasn't this IBM Lotus Notes?

~~~
amitport
why use the past tense? It is still being sold and used.

------
nieve
I'm a little concerned that the author is basing their approach on on Pijul
given that its data loss issues are exactly what you wouldn't want for the
"reliable" part of their priorities, but hopefully that's just an issue with
Pijul's implementation of the ideas. It seems plausible that there aren't any
major pitfalls in the theory itself instead of the code.

------
cjbprime
It's super unhelpful to write a post like this while showing very little
evidence that you Googled the thing you want to see if it exists already. What
about [https://docs.dat.foundation/](https://docs.dat.foundation/) or
everything else in these comments?

~~~
cxr
I like Dat, but right now it fails on the author's reliability criterion. And
it's not clear[1] whether multi-writer is still a work in progress, which
would mean a failure on "collaborative", too.

1\. Looks like it's still not finalized, but it might be—and the uncertainty
about the answer is a failure in and of itself.

~~~
anchpop
Dat multi-writer already exists, here's a demo of it: [https://dat-shopping-
list.glitch.me/](https://dat-shopping-list.glitch.me/)

~~~
pfraze
Unfortunately that demo was built on a protocol prototype which was
deprecated.

Multiwriter is still being worked on, but it's not in the current release
schedule. The upcoming release focuses on performance, scaling, reliability,
and "mounts" (effectively symlinks across drives). Mounts can be used to
create a kind of multiwriter, as the mounts stay in the control of the author,
but we don't yet have "multiple authors of a shared folder."

In user-space, I've been able to create unioned folders like Plan9 did which
is a _serviceable_ multiwriter scheme. A more sophisticated EC approach would
use a vector clock in file metadata to track revisions but would need some
approach to tombstones, which I don't have solution for yet. It's solvable,
it'll just take time and performance tuning.

~~~
anchpop
I see, thank you

------
hernantz
You can take a look at this talk:
[https://www.youtube.com/watch?v=DEcwa68f-jY](https://www.youtube.com/watch?v=DEcwa68f-jY).
It describes how to build a dapp with sqlite + CRDT

------
lbayes
What about Noms?

"The versioned, forkable, syncable database"

[https://github.com/attic-labs/noms](https://github.com/attic-labs/noms)

~~~
neilpa
Unfortunately development has stalled out

[https://github.com/attic-
labs/noms/blob/master/README.md#sta...](https://github.com/attic-
labs/noms/blob/master/README.md#status)

> Nobody is working on this right now. You shouldn't rely on it unless you're
> willing to take over development yourself.

~~~
aboodman
FWIW, we (replicache.dev) have begun working on it again. We're part of the
original team that built Noms.

Unclear what the roadmap for Noms will be yet, which is why we've not updated
the README.

------
User23
I also want a widely adopted purely functional distributed data structure in
the Okasaki style. Persistence is wonderful.

------
pmarreck
Doesn’t this map nicely onto functional data structures, where each new
“version” is just a new top level reference?

~~~
dan-robertson
The hard part is merging different changes.

------
bjonnh
How would one merge updates from all those offline clients?

------
prpl
git-lfs+torrents (over SSH if you need authentication?)

------
pwpwp
IDGI, just store the structured data in Git.

