Hacker News new | past | comments | ask | show | jobs | submit login

Hi HN! I wanted to research more elegant ways to enable document sync and collaboration in my apps sometime last year, and ended up discovering a new class of data structure that made it possible to build collaboration into documents right on the data level, completely separate from the network architecture. (Maybe you're already familiar with CRDTs. Well, these aren't just ordinary CRDTs, but almost meta-CRDTs that can represent a variety of convergent data types in a generic way. They also make it super easy to implement things like past revision viewing and delta patches.) As a proof of concept, I built a basic real-time collaborative text editor for iOS that works equally well over regular iCloud sync, CloudKit Sharing, and offline, as well as a simple mesh network simulator for macOS that has support for collaborative text editing and simple Bézier editing.

The code is strictly educational for the time being, but I'm working on a Swift library that can hopefully put this stuff to use in real apps. Hope to start dog-fooding in release software by the end of the year!




Unfortunately I wouldn't expect a lot of substantive technical discussion for a post like this on HN -- this post will expire before anyone has time to read it properly and produce thoughtful responses.

(Which is broadly the problem with posting enormous applied CS-research braindumps to HN. Don't know a better place to do it, though. And I see people are upvoting you. Also, the absence of thoughtless responses is itself a great response.)

The key point here, as far as I can tell, is really building a CRDT system as a true platform layer, rather than as a one-off solution for a special-purpose app. I think it's fairly clear that generic sync is a pretty essential part of a modern decentralized computing environment, and I don't think it's clear at all what the best way to solve it is.

But... it seems to me that you've solved a large piece of the problem but not the whole thing. Because the most important document sync and collaboration platform is, of course, source code control. If you have a document sync and collaboration model that doesn't at least generalize to classic revision control, why not?

Now, there's a sensible reason to separate these problems -- CRDT and OT solutions tend to specialize in the zero-maintenance case where a user-resolved merge is just impossible. Whereas if it was possible to build a zero-maintenance revision-control system, which automatically resolved all merges and conflicts, someone would have done so already.

This certainly suggests that the two are different problems. But generalizing across slight differences is what system software does. Maybe one is a special case of the other?

A generalized layer for lightweight collaboration is pretty powerful. But it certainly seems like the case that if you could generalize across lightweight collaboration and heavyweight revision control, you would have something that would be incredibly powerful.

Or is this too ambitious? I don't know so I'm asking you.


I suspect you could build such a thing if you could more rigidly define edit operations on source code files. The way we edit code now is roughly "insert/edit/delete string <x> at position <y>", which doesn't really carry enough context to auto-merge.

If edits carried all the semantic information of what they were doing to the source: "rename symbol <x> to <y> everywhere" "perform step <x> after step <y>", then we could probably build a zero-maintenance revision-control system.


I've had some thoughts along the same lines, but haven't taken the opportunity to flesh them out nearly as far.

A git repository is nearly a CRDT that represents a file system hierarchy -- the only thing that's missing is a deterministic merge algorithm. If you allow files to be committed in a conflicted state that gets resolved by a later edit, you have a full CRDT that automated systems can work freely with.

The catch, of course, is that despite pushing the resolution to an arbitrary point in the future, these conflicts still have to be resolved by hand eventually.


Thank you for posting this -- it was a great walkthrough, and very clear. Cool stuff. :-)


Hi archagon,

Have you checked out Noms (https://github.com/attic-labs/noms)?

> But studying Differential Sync further, I realized that a number of details made it a non-starter. First, though the approach seems simple on the surface, its true complexity is concealed by the implementation of diff and patch. This class of functions works well for strings, but you basically need to be a seasoned algorithms expert to design a set for a new data type. (Worse: the inherent fuzziness depends on non-objective metrics, so you’d only be able to figure out the effectiveness of your algorithms after prolonged use and testing instead of formal analysis.) Second, diff and patch as they currently exist are really meant for loosely-structured data such as strings and images. Barring conversion to text-based intermediary formats, tightly structured objects would be very difficult to diff and patch while maintaining consistency.

Noms is a structured database in the spirit of Git. It includes efficient (cost proportional to size of diff, not size of input) diff and merge for sets, lists, maps, and user-defined structs. The built-in merge strategies are commutative and idempotent, so if your operations can be merged using it, then your schema is a (state-based) crdt.

We didn't initially build Noms for this type of p2p, masterless architecture, but more and more recently, I find myself thinking it's a great fit.

> Next, there are some issues with using Differential Sync in an offline-first environment. Clients have to store their entire diff history while offline, and then, on reconnection, send the whole batch to their peers for a very expensive merge. Assuming that other sites had been editing away in the meantime, distantly-divergent versions would very likely fail to merge on account of out-of-date context info and lose much of the data for the reconnected peer. Finally, Differential Sync only allows one packet at a time to be in flight between two peers. If there are network issues, the whole thing grinds to a halt.

I may be misunderstanding here, but I think these issues do not afflict Noms:

1/ Noms can compute the diff between any two states efficiently. It doesn't matter how many local changes you make while offline - what is transmitted is just the diff between two states.

2/ Merge performance is relative to the size of the diff being merged, not the number of changes.

3/ I don't think one-packet-in-flight-at-a-time applies to Neil Fraser's differential sync. It definitely doesn't apply to Noms.

Given your interest in CRDTs, I thought you might be interested to take a look.

There's still more work that needs to be done to make Noms work really well for this kind of use case, but I think it's one that makes a lot of sense.

As somebody who's just spent a ton of time looking at all the options, I'd love to hear your thoughts on how Noms fits into your thinking.


1-3 is generally true for ORDTs (in archagon's terms).

I wonder, how much metadata noms needs to achieve those features?

CRDTs brute-force all those distributed/concurrency issues by adding some metadata. With CRDTs, two hard topics are (1) metadata size and (2) gc.

So, how can we estimate the overhead?

P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.

What is your perspective on that?


Hi gritzko!

I hadn't seen RON It looks really interesting - I'll definitely take some time to play with it sometime soon. It's so great how much work is happening in this space right now.

> 1-3 is generally true for ORDTs (in archagon's terms).

That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)

> P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago. > What is your perspective on that?

We don't actually transmit the hash of a chunk if we transmit the chunk itself since that's duplicate information. So the overhead would be half your calculation.

My perspective is that this frequently an entirely acceptable tradeoff. Your calculation is the absolute worst case, and there are many applications for which that worst case occurs infrequently. Today's collaborative text editors, for example, usually damp the liveness on purpose for ux reasons. It's cool to see someone type live, until you are that person, and realize that the other side is watching you misspell in realtime :).

The minimum size of an update in Noms is currently one "chunk". Chunks are set to average 4kb in the code currently, though that is tunable (the cost is that cost of access and write locally will go up).

If you wanted to reduce the size of updates in Noms further, you could do that transmitting patches to know chunks rather than new chunks.

It looks like the identifiers in RON are 16 bytes, so I think if you continue to push this direction in Noms you end up with wire formats that cost about the same. It just hasn't been something we've worked on yet.

In exchange for this tradeoff, what you get in Noms is:

- No need to ever accumulate a history of changes in order to know the current state. Each snapshot directly and efficiently represents the current state. It looks like in RON you can transmit a state, but it's not clear to me when nodes would choose to do that vs a patch.

- Local storage properties comparable to a traditional database. Store large amounts of data with efficient queries, sorted scans, mutations, etc.

I'm not sure how this all compares to RON, I'm looking forward to digging into it.


> I hadn't seen RON It looks really interesting

It is like the Lamport event model married to a LSMT database. And their children are CRDTs.

> That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)

Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.

Thanks, now I understand your trade-offs better.


> Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.

What happens in the case where the op log represents changes that went back and forth a bunch between two states. Does that ever get compacted down locally?

The difference here is that Noms will never have to read or compact those intermediate states. If a character gets inserted and deleted a bunch of times, and then that set of changes get synced with a peer, the source peer diffs the beginning and end state and ignores the intermediate ones. The local work done will just be proportional to the actual change between the two states (in this case finding a diff of a single char). This work is always proportional to the change in the two states, not the amount of ops that transpired between them.

> I guess, in noms you leverage immutable data structures for that.

That, and content-addressing.


To compute diffs, you have to store the entire history. Is there any point in time where you prune the history?


The history is stored such that shared data is mostly shared between snapshots. So it generally grows much more slowly than the logical size.

But yeah, you have to store history for as far back as the amount of time you want to be able to be remain disconnected and then resynchronize with peers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: