The code is strictly educational for the time being, but I'm working on a Swift library that can hopefully put this stuff to use in real apps. Hope to start dog-fooding in release software by the end of the year!
(Which is broadly the problem with posting enormous applied CS-research braindumps to HN. Don't know a better place to do it, though. And I see people are upvoting you. Also, the absence of thoughtless responses is itself a great response.)
The key point here, as far as I can tell, is really building a CRDT system as a true platform layer, rather than as a one-off solution for a special-purpose app. I think it's fairly clear that generic sync is a pretty essential part of a modern decentralized computing environment, and I don't think it's clear at all what the best way to solve it is.
But... it seems to me that you've solved a large piece of the problem but not the whole thing. Because the most important document sync and collaboration platform is, of course, source code control. If you have a document sync and collaboration model that doesn't at least generalize to classic revision control, why not?
Now, there's a sensible reason to separate these problems -- CRDT and OT solutions tend to specialize in the zero-maintenance case where a user-resolved merge is just impossible. Whereas if it was possible to build a zero-maintenance revision-control system, which automatically resolved all merges and conflicts, someone would have done so already.
This certainly suggests that the two are different problems. But generalizing across slight differences is what system software does. Maybe one is a special case of the other?
A generalized layer for lightweight collaboration is pretty powerful. But it certainly seems like the case that if you could generalize across lightweight collaboration and heavyweight revision control, you would have something that would be incredibly powerful.
Or is this too ambitious? I don't know so I'm asking you.
If edits carried all the semantic information of what they were doing to the source: "rename symbol <x> to <y> everywhere" "perform step <x> after step <y>", then we could probably build a zero-maintenance revision-control system.
A git repository is nearly a CRDT that represents a file system hierarchy -- the only thing that's missing is a deterministic merge algorithm. If you allow files to be committed in a conflicted state that gets resolved by a later edit, you have a full CRDT that automated systems can work freely with.
The catch, of course, is that despite pushing the resolution to an arbitrary point in the future, these conflicts still have to be resolved by hand eventually.
Have you checked out Noms (https://github.com/attic-labs/noms)?
> But studying Differential Sync further, I realized that a number of details made it a non-starter. First, though the approach seems simple on the surface, its true complexity is concealed by the implementation of diff and patch. This class of functions works well for strings, but you basically need to be a seasoned algorithms expert to design a set for a new data type. (Worse: the inherent fuzziness depends on non-objective metrics, so you’d only be able to figure out the effectiveness of your algorithms after prolonged use and testing instead of formal analysis.) Second, diff and patch as they currently exist are really meant for loosely-structured data such as strings and images. Barring conversion to text-based intermediary formats, tightly structured objects would be very difficult to diff and patch while maintaining consistency.
Noms is a structured database in the spirit of Git. It includes efficient (cost proportional to size of diff, not size of input) diff and merge for sets, lists, maps, and user-defined structs. The built-in merge strategies are commutative and idempotent, so if your operations can be merged using it, then your schema is a (state-based) crdt.
We didn't initially build Noms for this type of p2p, masterless architecture, but more and more recently, I find myself thinking it's a great fit.
> Next, there are some issues with using Differential Sync in an offline-first environment. Clients have to store their entire diff history while offline, and then, on reconnection, send the whole batch to their peers for a very expensive merge. Assuming that other sites had been editing away in the meantime, distantly-divergent versions would very likely fail to merge on account of out-of-date context info and lose much of the data for the reconnected peer. Finally, Differential Sync only allows one packet at a time to be in flight between two peers. If there are network issues, the whole thing grinds to a halt.
I may be misunderstanding here, but I think these issues do not afflict Noms:
1/ Noms can compute the diff between any two states efficiently. It doesn't matter how many local changes you make while offline - what is transmitted is just the diff between two states.
2/ Merge performance is relative to the size of the diff being merged, not the number of changes.
3/ I don't think one-packet-in-flight-at-a-time applies to Neil Fraser's differential sync. It definitely doesn't apply to Noms.
Given your interest in CRDTs, I thought you might be interested to take a look.
There's still more work that needs to be done to make Noms work really well for this kind of use case, but I think it's one that makes a lot of sense.
As somebody who's just spent a ton of time looking at all the options, I'd love to hear your thoughts on how Noms fits into your thinking.
I wonder, how much metadata noms needs to achieve those features?
CRDTs brute-force all those distributed/concurrency issues by adding some metadata. With CRDTs, two hard topics are (1) metadata size and (2) gc.
So, how can we estimate the overhead?
P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.
What is your perspective on that?
I hadn't seen RON It looks really interesting - I'll definitely take some time to play with it sometime soon. It's so great how much work is happening in this space right now.
> 1-3 is generally true for ORDTs (in archagon's terms).
That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)
> P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.
> What is your perspective on that?
We don't actually transmit the hash of a chunk if we transmit the chunk itself since that's duplicate information. So the overhead would be half your calculation.
My perspective is that this frequently an entirely acceptable tradeoff. Your calculation is the absolute worst case, and there are many applications for which that worst case occurs infrequently. Today's collaborative text editors, for example, usually damp the liveness on purpose for ux reasons. It's cool to see someone type live, until you are that person, and realize that the other side is watching you misspell in realtime :).
The minimum size of an update in Noms is currently one "chunk". Chunks are set to average 4kb in the code currently, though that is tunable (the cost is that cost of access and write locally will go up).
If you wanted to reduce the size of updates in Noms further, you could do that transmitting patches to know chunks rather than new chunks.
It looks like the identifiers in RON are 16 bytes, so I think if you continue to push this direction in Noms you end up with wire formats that cost about the same. It just hasn't been something we've worked on yet.
In exchange for this tradeoff, what you get in Noms is:
- No need to ever accumulate a history of changes in order to know the current state. Each snapshot directly and efficiently represents the current state. It looks like in RON you can transmit a state, but it's not clear to me when nodes would choose to do that vs a patch.
- Local storage properties comparable to a traditional database. Store large amounts of data with efficient queries, sorted scans, mutations, etc.
I'm not sure how this all compares to RON, I'm looking forward to digging into it.
It is like the Lamport event model married to a LSMT database. And their children are CRDTs.
> That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)
Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.
Thanks, now I understand your trade-offs better.
What happens in the case where the op log represents changes that went back and forth a bunch between two states. Does that ever get compacted down locally?
The difference here is that Noms will never have to read or compact those intermediate states. If a character gets inserted and deleted a bunch of times, and then that set of changes get synced with a peer, the source peer diffs the beginning and end state and ignores the intermediate ones. The local work done will just be proportional to the actual change between the two states (in this case finding a diff of a single char). This work is always proportional to the change in the two states, not the amount of ops that transpired between them.
> I guess, in noms you leverage immutable data structures for that.
That, and content-addressing.
But yeah, you have to store history for as far back as the amount of time you want to be able to be remain disconnected and then resynchronize with peers.
So few apps from their sample set were using CloudKit Sharing that they didn't even bother using a percentage! That's quite unusual for an Apple framework.
This was such an enjoyable read!
> The user should never be faced with a “pick the correct revision” dialog box.
These two goals cannot be reconciled.
If we both synchronize a document with content and then we both go offline and we make conflicting edits then someone will have to resolve the conflict.
I'm still reading TFA, but this is probably the biggest reason I won't use this system as described. (I will copy any useful ideas, though, and keep it in mind for future projects.)
For my current project, it just doesn't make sense to blindly merge documents. The result will be broken, or worse -- valid but silently incorrect. Having changes you made and saw applied silently vanish is not acceptable to me, and I think having users resolve merge conflicts might be. Not sure, I'm aiming at non-technical users...
We wouldn't accept this for code, though. If that is a principled stance, and not just a side-effect of traditional code syntax being brittle, we should wonder whether our users deserve the same.