Hi HN! I wanted to research more elegant ways to enable document sync and collaboration in my apps sometime last year, and ended up discovering a new class of data structure that made it possible to build collaboration into documents right on the data level, completely separate from the network architecture. (Maybe you're already familiar with CRDTs. Well, these aren't just ordinary CRDTs, but almost meta-CRDTs that can represent a variety of convergent data types in a generic way. They also make it super easy to implement things like past revision viewing and delta patches.) As a proof of concept, I built a basic real-time collaborative text editor for iOS that works equally well over regular iCloud sync, CloudKit Sharing, and offline, as well as a simple mesh network simulator for macOS that has support for collaborative text editing and simple Bézier editing.
The code is strictly educational for the time being, but I'm working on a Swift library that can hopefully put this stuff to use in real apps. Hope to start dog-fooding in release software by the end of the year!
Unfortunately I wouldn't expect a lot of substantive technical discussion for a post like this on HN -- this post will expire before anyone has time to read it properly and produce thoughtful responses.
(Which is broadly the problem with posting enormous applied CS-research braindumps to HN. Don't know a better place to do it, though. And I see people are upvoting you. Also, the absence of thoughtless responses is itself a great response.)
The key point here, as far as I can tell, is really building a CRDT system as a true platform layer, rather than as a one-off solution for a special-purpose app. I think it's fairly clear that generic sync is a pretty essential part of a modern decentralized computing environment, and I don't think it's clear at all what the best way to solve it is.
But... it seems to me that you've solved a large piece of the problem but not the whole thing. Because the most important document sync and collaboration platform is, of course, source code control. If you have a document sync and collaboration model that doesn't at least generalize to classic revision control, why not?
Now, there's a sensible reason to separate these problems -- CRDT and OT solutions tend to specialize in the zero-maintenance case where a user-resolved merge is just impossible. Whereas if it was possible to build a zero-maintenance revision-control system, which automatically resolved all merges and conflicts, someone would have done so already.
This certainly suggests that the two are different problems. But generalizing across slight differences is what system software does. Maybe one is a special case of the other?
A generalized layer for lightweight collaboration is pretty powerful. But it certainly seems like the case that if you could generalize across lightweight collaboration and heavyweight revision control, you would have something that would be incredibly powerful.
Or is this too ambitious? I don't know so I'm asking you.
I suspect you could build such a thing if you could more rigidly define edit operations on source code files. The way we edit code now is roughly "insert/edit/delete string <x> at position <y>", which doesn't really carry enough context to auto-merge.
If edits carried all the semantic information of what they were doing to the source: "rename symbol <x> to <y> everywhere" "perform step <x> after step <y>", then we could probably build a zero-maintenance revision-control system.
I've had some thoughts along the same lines, but haven't taken the opportunity to flesh them out nearly as far.
A git repository is nearly a CRDT that represents a file system hierarchy -- the only thing that's missing is a deterministic merge algorithm. If you allow files to be committed in a conflicted state that gets resolved by a later edit, you have a full CRDT that automated systems can work freely with.
The catch, of course, is that despite pushing the resolution to an arbitrary point in the future, these conflicts still have to be resolved by hand eventually.
> But studying Differential Sync further, I realized that a number of details made it a non-starter. First, though the approach seems simple on the surface, its true complexity is concealed by the implementation of diff and patch. This class of functions works well for strings, but you basically need to be a seasoned algorithms expert to design a set for a new data type. (Worse: the inherent fuzziness depends on non-objective metrics, so you’d only be able to figure out the effectiveness of your algorithms after prolonged use and testing instead of formal analysis.) Second, diff and patch as they currently exist are really meant for loosely-structured data such as strings and images. Barring conversion to text-based intermediary formats, tightly structured objects would be very difficult to diff and patch while maintaining consistency.
Noms is a structured database in the spirit of Git. It includes efficient (cost proportional to size of diff, not size of input) diff and merge for sets, lists, maps, and user-defined structs. The built-in merge strategies are commutative and idempotent, so if your operations can be merged using it, then your schema is a (state-based) crdt.
We didn't initially build Noms for this type of p2p, masterless architecture, but more and more recently, I find myself thinking it's a great fit.
> Next, there are some issues with using Differential Sync in an offline-first environment. Clients have to store their entire diff history while offline, and then, on reconnection, send the whole batch to their peers for a very expensive merge. Assuming that other sites had been editing away in the meantime, distantly-divergent versions would very likely fail to merge on account of out-of-date context info and lose much of the data for the reconnected peer. Finally, Differential Sync only allows one packet at a time to be in flight between two peers. If there are network issues, the whole thing grinds to a halt.
I may be misunderstanding here, but I think these issues do not afflict Noms:
1/ Noms can compute the diff between any two states efficiently. It doesn't matter how many local changes you make while offline - what is transmitted is just the diff between two states.
2/ Merge performance is relative to the size of the diff being merged, not the number of changes.
3/ I don't think one-packet-in-flight-at-a-time applies to Neil Fraser's differential sync. It definitely doesn't apply to Noms.
Given your interest in CRDTs, I thought you might be interested to take a look.
There's still more work that needs to be done to make Noms work really well for this kind of use case, but I think it's one that makes a lot of sense.
As somebody who's just spent a ton of time looking at all the options, I'd love to hear your thoughts on how Noms fits into your thinking.
1-3 is generally true for ORDTs (in archagon's terms).
I wonder, how much metadata noms needs to achieve those features?
CRDTs brute-force all those distributed/concurrency issues by adding some metadata. With CRDTs, two hard topics are (1) metadata size and (2) gc.
So, how can we estimate the overhead?
P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.
I hadn't seen RON It looks really interesting - I'll definitely take some time to play with it sometime soon. It's so great how much work is happening in this space right now.
> 1-3 is generally true for ORDTs (in archagon's terms).
That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)
> P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.
> What is your perspective on that?
We don't actually transmit the hash of a chunk if we transmit the chunk itself since that's duplicate information. So the overhead would be half your calculation.
My perspective is that this frequently an entirely acceptable tradeoff. Your calculation is the absolute worst case, and there are many applications for which that worst case occurs infrequently. Today's collaborative text editors, for example, usually damp the liveness on purpose for ux reasons. It's cool to see someone type live, until you are that person, and realize that the other side is watching you misspell in realtime :).
The minimum size of an update in Noms is currently one "chunk". Chunks are set to average 4kb in the code currently, though that is tunable (the cost is that cost of access and write locally will go up).
If you wanted to reduce the size of updates in Noms further, you could do that transmitting patches to know chunks rather than new chunks.
It looks like the identifiers in RON are 16 bytes, so I think if you continue to push this direction in Noms you end up with wire formats that cost about the same. It just hasn't been something we've worked on yet.
In exchange for this tradeoff, what you get in Noms is:
- No need to ever accumulate a history of changes in order to know the current state. Each snapshot directly and efficiently represents the current state. It looks like in RON you can transmit a state, but it's not clear to me when nodes would choose to do that vs a patch.
- Local storage properties comparable to a traditional database. Store large amounts of data with efficient queries, sorted scans, mutations, etc.
I'm not sure how this all compares to RON, I'm looking forward to digging into it.
It is like the Lamport event model married to a LSMT database. And their children are CRDTs.
> That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)
Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.
> Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.
What happens in the case where the op log represents changes that went back and forth a bunch between two states. Does that ever get compacted down locally?
The difference here is that Noms will never have to read or compact those intermediate states. If a character gets inserted and deleted a bunch of times, and then that set of changes get synced with a peer, the source peer diffs the beginning and end state and ignores the intermediate ones. The local work done will just be proportional to the actual change between the two states (in this case finding a diff of a single char). This work is always proportional to the change in the two states, not the amount of ops that transpired between them.
> I guess, in noms you leverage immutable data structures for that.
The history is stored such that shared data is mostly shared between snapshots. So it generally grows much more slowly than the logical size.
But yeah, you have to store history for as far back as the amount of time you want to be able to be remain disconnected and then resynchronize with peers.
The author is wondering why iCloud Sharing is used by so few developers; I will point out / remind that, in addition to fundamentally being an API whose usage implies "I will never be able to build an Android app which interoperates with this data" (which is already pretty damning for most developers as a door they don't want to permanently close), the first few releases of iCloud were so bad that documents and even entire clients would end up in permanently wedged states that developers could not fix for users and were so bad that at WWDC one year I remember the Apple iCloud person on stage during the "developer state of the union" talk apologizing for iCloud being so broken and begging the audience for another chance.
While true, I don't think this explains the disparity between regular CloudKit use and CloudKit Sharing use. For example, this snippet from Apple's CloudKit paper[1] particularly struck me: "We identified the top apps using CloudKit, based on their number of active users in the past month, and examined their use of private and public databases. We found that 20% and 49% of the apps use only the public or the private database, respectively, and 31% of apps use both databases (20 apps use the shared database)."
So few apps from their sample set were using CloudKit Sharing that they didn't even bother using a percentage! That's quite unusual for an Apple framework.
> I admit that this revelation made some wily political thoughts cross my mind. Could this be the chance to finally break free from the shackles of cloud computing? It always felt like such an affront that our data had to snake through a tangle of corporate servers in order to reach the devices right next to us. We used to happily share files across applications and even operating systems, and now everything was funneled through these monolithic black boxes. What happened? How did we let computing become so darn undemocratic? It had gotten so bad that we actually expected our content and workflows to regularly vanish as companies folded or got themselves acquired! Our digital assets—some our most valuable property—were entirely under management of outside, disinterested parties.
> Most obviously, users should be able to edit their documents immediately, without even touching the network [...]
> The user should never be faced with a “pick the correct revision” dialog box.
These two goals cannot be reconciled.
If we both synchronize a document with content and then we both go offline and we make conflicting edits then someone will have to resolve the conflict.
I think the author is ok with "one of the edits wins," so long as all users agree on the one that will win (without communicating.)
I'm still reading TFA, but this is probably the biggest reason I won't use this system as described. (I will copy any useful ideas, though, and keep it in mind for future projects.)
For my current project, it just doesn't make sense to blindly merge documents. The result will be broken, or worse -- valid but silently incorrect. Having changes you made and saw applied silently vanish is not acceptable to me, and I think having users resolve merge conflicts might be. Not sure, I'm aiming at non-technical users...
We wouldn't accept this for code, though. If that is a principled stance, and not just a side-effect of traditional code syntax being brittle, we should wonder whether our users deserve the same.
Nice overview of the state of the art on CDRTs. Like the op, I agreed that we are more likely than not to see large clarity brought to the field thanks to the latest compsci research. In the future we may wonder how we lived without the ability to do distributed sync of data structures.
Wonderful article, appreciate the detail. I've had great success in the past using event sourcing, so anything that is even remotely similar is always a fun read. Thanks!
The code is strictly educational for the time being, but I'm working on a Swift library that can hopefully put this stuff to use in real apps. Hope to start dog-fooding in release software by the end of the year!