Hacker News new | comments | show | ask | jobs | submit login
A Conflict-Free Replicated JSON Datatype (arxiv.org)
132 points by DanielRibeiro 402 days ago | hide | past | web | 59 comments | favorite

> Our principle of not losing input due to concurrent modifications appears reasonable, but as illustrated in Figure 4, it leads to merged document states that may be surprising to application programmers who are more familiar with sequential programs.

That decision is a bit odd to me. Not only do they show that their merge system can produce data that the application considers invalid, it can even convert a string to a list without having any operation creating a list explicitly.

> Moreover, garbage collection (tombstone removal) is required in order to prevent unbounded growth of the datastructure

That is always the painful part of a CRDT system. Until that part is done, it cannot really be used in production. That said, previous work on CRDTs give me confidence that it can be implemented with minimal overhead.

> In fact, we wrote the LATEX source text of this paper using an experimental collaborative text editor that is based on our implementation of this CRDT. We are making the character-by-character editing trace of this document available as supplemental data of this paper

String editing is brushed over in the paper. I'd love to see their implementation. Does anyone know where it all is?

>> In fact, we wrote the LATEX source text of this paper using an experimental collaborative text editor that is based on our implementation of this CRDT. We are making the character-by-character editing trace of this document available as supplemental data of this paper

If anyone else wants to feel at the cutting edge of science, you can write your LaTeX paper collaboratively at https://www.sharelatex.com :P

I can't pretend we've got any original algorithms (we lean on ShareJS a lot, which is operational transform based), but we've got a few years and billions of key strokes of experience in making it robust and practical.

We've been developing ShareDB (https://github.com/share/sharedb), a JSON OT system, and using it in production at Lever for 3 years now.

CRDT has some strong advantages in decentralized applications with a need to avoid a central server. But in practice, web apps today do generally have a central server.

I think the core advantages of OT over CRDTs are actually these practical concerns: being able to design transformations that are more intuitive to application developers, easily cleaning up old operation history, and real world production use and examples of the technology already.

One thing OT has enabled us to do that you cannot do as easily in a CRDT is to support turning off publishing of operations when we do large migrations and create huge numbers of ops at once. Because ops have a strict ordering and versioning, clients can lazy get the ops that they have missed if needed later because they submitted a modification on a document or see an even newer op and need the intermediate ops. In a CRDT, you have to make sure that all the clients get all the ops in order to converge. It sounds simple but can be complex in practice.

Another practical issue I'd add is that these algorithms sometimes place the burden of resolving the outcome of concurrent edits at read time vs. write time. With OT, you generally do a lot of complex work to figure out how to transform ops at write time, but when you read, you just read the already fully updated document in its native format. CRDT systems often store the document in a format that can be written to very cheaply as it is commutative, but the work gets pushed to read time when you have to collapse a tree full of all its history into a different data structure, such as a string for text editing. One is not strictly better than the other, but many production systems are vastly more read heavy than write heavy, so less obvious tradeoffs like this can become extremely important in production use.

Have you read the SwiftCloud paper? https://pages.lip6.fr/Marek.Zawirski/papers/SwiftCloud-RR-83...

It's about using a centralized cloud infrastructure in conjunction with CRDTs. This allows to fix some of the issues you mention (pruning of operation history, stricter versioning, no need to push operations, ...).

Of course it complicates the implementation but it's quite interesting.

That last part is a good summary of CRDT versus OT.

Also nice project. I think a section on conflict handling would be a good addition to your README on github. It is the first thing I look for when looking at systems in this space :)

I've been doing some tinkering in this area lately, and I was also looking for an explanation of the string editing mechanics in this algorithm.

The hardest part there is retaining the intent of the simultaneous edits -- the I in the "PCI Consistency model" as described by the WOOT paper. Previous algorithms (LSEQ, WOOT, Doctree, Logoot, etc.) devoted to simultaneous conflict-free editing of text have solved this through relatively complex key generation strategies. I've only done a cursory read of this paper, but this algorithm doesn't seem to implement an alternative solution.

It's not trying to do string editing at all.

It's operating on JSON de-serialized to maps and lists. String editing isn't a valid operation on the data model they specify. You can replace a string value with another.

Section 3.3 outlines the allowed operations, which basically allows you to obtain an iterator, move it, insert, delete and retrieve keys and values.

You could certainly model string editing in it, by putting each character into it as a separate value, but then as you point out it'd probably not do very well because it doesn't contain heuristics to model intent of text editing, but to model changes to a JSON object hierarchy while avoiding loss of data.

You could use this algorithm combined with one intended for string-editing to improve on it for values where partial edits of the values matters, certainly.

wouldn't a string just be a list of chars?

> That decision is a bit odd to me. Not only do they show that their merge system can produce data that the application considers invalid, it can even convert a string to a list without having any operation creating a list explicitly.

It's hard to see good alternatives there. If you move from a single editor to concurrent edits without very strict ordering guarantees or a transactional system, you will need to deal with conflicts or lose data. Creating output that would be considered invalid unless the application is fixed to explicitly deal with it (or explicitly throw it away) seems quite reasonable.

For this you would want a type / schema that explicitly represented the conflicts at the right level of granularity and helped applications resolve them.

Whereas producing an object the pretends to be conflict free, but has some fields that might or might not change type is the worst of both worlds.

Exactly, if there are conflicts that have to be resolved then it's not conflict-free.

It is conflict free in this sense if no change is rejected, and if the result eventually converges (e.g. if two parties receive the full set of changes, they will end up with the same result, but they may diverge locally until all changes have been synchronised to all replicas).

This does mean sometimes producing something that have "application level conflicts" or something that a user will consider a conflict even if the application is perfectly fine with it. Depending on how good the heuristics for determining which changes to keep are, this may be obnoxious or appear perfectly reasonable.

E.g. two users start with the same empty document. They each type a sentence. Merge happens. Did they intend the document to have just their sentence, or do they want both? User A wrote that their mutual boss is an idiot; user B wrote that their mutual boss is a genius. Now what do they want to remain? There's no way for software to resolve that automatically which doesn't have disturbing implications.

But there certainly are some alternatives that are better than others. E.g. you'd probably prefer to retain both sentences, so you can agree on which to keep. You'd certainly prefer to keep at least one to deleting both.

They are presenting an algorithm, not a product, so any implementation would be free to make it available in a different way.

They specifically do address this point about how developers would deal with it in the conclusion of the paper.

> [Garbage collection] is always the painful part of a CRDT system.

This is not unique to CRDTs: a temporal datastructure will accumulate garbage. Blockchain (e.g. Git); MVCC (e.g. Postgresql).

About garbage collection, there is some research done in that direction: Logoot [0] is a collaborative text editor that doesn't depend on tombstones.

[0]: https://hal.archives-ouvertes.fr/inria-00336191/

> String editing is brushed over in the paper.

LaTeX source code is basically just a string. It would have been more interesting to see this applied to a hierarchical data-structure, such as HTML, where e.g. elements are nested inside markup nodes.

LaTeX is also hierarchical, though. Some is explicit (environments) and some is implicit (\chapter, \section etc).

This is true. To preserve document validity and user intent, an algorithm designed for collaborative string editing will not always work correctly when applied to a markup document such as LaTeX, HTML, or Markdown directly.

It is similar to how you need a JSON type to represent the kinds of edits you can do to a JSON document rather than editing JSON text as a string. The kinds of edits done in text better map to editing a markup language like LaTeX or Markdown than JSON, so you are less likely to notice these issues right away. But if the type is designed specifically for the markup language, user intent will be better maintained through concurrent edits and the document will never end up in an invalid format.

The same holds for LaTeX stored in git, yet a lot of people use this combination without problems.

It is a hack, in a sense, but one without serious consequences.

Agreed. It is likely to work pretty well with a format that is very close to plain text like LaTeX most of the time.

Git's resolution strategy is diff-match-patch of text, so it is a good analogy. However, the thing that Git does that CRDTs and OT string types generally do not is create conflicts requiring user intervention before proceeding.

If you are realtime editing concurrently, this might be acceptable, because the users might be able to see the conflict and resolve it. However, I'd say a clear UI for conflicts is the key reason why Git is able to more intuitively and safely deal with concurrent edits of non-plain text using a plain text algorithm.

You already indicate that the LaTeX model is not a "clean" hierarchy. So I highly doubt they modeled it like that in JSON, but it could be possible.

It's a hierarchy like html, where p tags have implicit close tags.

Reminds me of the operational transformation used in Wave.


I see they mention it in the paper, saying the difference is that Wave requires a central server to impose a total ordering of operations.

I developed a "general purpose" JSON-based CRDT[1] as a part of my content addressing system project. The idea was to have an "append-only tree", where subtrees could be used by individual applications for different purposes, for example as counters or mutable sets.

In retrospect, it didn't need to support recursion. A single-level append-only set is enough to be fully general and easier to perform indexing on. Using JSON was also overkill, since too much flexibility is bad for content-addressing.

[1] https://github.com/btrask/stronglink/blob/master/client/READ...

It is impossible to have accurate conflict-free replicated data. Unless the users are given the opportunity to manually resolve conflicts, it will never be 100% accurate - Incorrect data will creep in from time to time (but at least the incorrect data will be consistent across all nodes).

The root of the problem is that the system cannot understand the collaborative intent of concurrent users - If you have 2 users who made a change at the same time (without being aware of each other); you have to account for the fact that maybe UserB would have behaved differently if they had been aware of UserA's input (which happened at the same time while one or both users were offline). If the user has been offline for a while - Many such conflicts could arise (maybe by the time the internet comes back on, the user is looking at a completely different page than the one they made the change on) and it's tedious to make the user resolve them all manually.

Also with this approach, it tends to force you to keep a copy/cache of all the data in your entire app (for that logged-in user) on the frontend - If you have a big app with lots of pages, that could consume a lot of memory.

There are cases where the best solution is to simply tell the user "Sorry, you do not have an internet connection at the moment, so you cannot modify this data" rather than giving them a false sense that the data are correctly backed up in the cloud. I think with CRDTs, it's really important to inform the user when they are offline and when their data are not synced/backed up in the cloud (so they don't get any bad surprises when the internet suddenly comes back on).

CRDTs are good where the accuracy of the data is not critical (E.g. bank transactions). One could argue that they improve the user experience, but at the core, developers like them because they make life easier.

"It is impossible to have accurate conflict-free replicated data. Unless the users are given the opportunity to manually resolve conflicts, it will never be 100% accurate [...] The root of the problem is that the system cannot understand the collaborative intent of concurrent users"

CRDTs are usually designed to model user intent. You just need to pick the right CRDT for your use-case. From there, the CRDT will resolve conflicts automatically and accurately. Your statement regarding impossibility would be true of operational transformation, but certainly not of CRDTs.

To the extent that CRDTs are conflict-free it only means that two edits can commute (be applied in any order) and every party that has applied the same set of edits will produce the same result.

At the low-level that the CRDT is operating on there will be no conflicts.

But that does not mean that the user never perceives there to be conflicts. No matter what the consensus system used be it CRDTs or OT at some point the converge operation has to impose a total ordering to pick a "winner" in the case of conflicting user edits.

If editing a block of text and two users try to replace the same word with another word there are a number of possible outcomes

1) One of the edits "wins" 2) The word is replaced by the concatenation of each user's replacement. 3) Nobody wins and the edit is reverted.

In all cases at least one party perceives to themselves to have "lost". But it generally doesn't matter because humans doing the editing will make repairs to nonsensical edits in real time.

There has to be a tie-breaker when multiple users try to make different edits to the same region of text. At least that's my current understanding.

There are several CRDT algorithms (LSEQ, Logoot, WOOT, Treedoc) that try to properly solve the merging solution (your outcome #2) while still retaining the intent of the edits of each user. Their implementations differ, but the general idea is that each character (or chunk of characters) is assigned a key that can be ordered. When new text is added, it's given a key that is derived from the key of some adjacent text. As a result, the merge is the best-possible approximation of the intent of the edits.

Yes, I think we are in agreement. The algorithm can use heuristics to produce a best-effort pleasing result that captures user intent but if two users want opposite things to happen it has to fall back to a tie breaker.

In an extreme example, if the CRDT state-space was 1-bit and user A wants to make it a 0 and user B wants to make it a 1 a choice must be made by the algorithm.

Right, exactly. The key question is whether the algorithm's resolution heuristic takes into account the intent of the edit, and I'm not sure the algorithm in the article makes an effort to do that. (Not saying it doesn't, I just don't understand how it does, if it does.)

It's main consideration appears to be to make changes in a way that makes sense for a tree consisting of maps and lists without ever losing user input. If you look at their diagram with examples you get the gist of their resulting heuristics, which basically boils down to "only delete if a user explicitly deleted, otherwise keep both values if users try to set the same value to two different things", "if two complex values (maps,list) have been updated, recursively merge them", pretty much.

It will cause stuff that users will perceive as conflicts, and that may even appear to be totally illogical (one of their examples leaves an object that appears to be in a broken state, because one side deleted it, and the other side updated a single attribute, leading the map to continue to exist, but with only the one updated attribute) so there's probably room for improvement, though the rules are simple enough that many of these could be resolved at application level by just carefully deciding what operation to provide.

"To the extent that CRDTs are conflict-free"

There is no "extent" to which CRDTs are conflict free, CRDTs are by definition always 100% conflict free.

"No matter what the consensus system used be it CRDTs or OT"

CRDTs and OT are worlds apart. CRDTs are guaranteed to be conflict free, whereas OT is generally too complicated and unproven to offer that guarantee at all.

"If editing a block of text and two users try to replace the same word with another word there are a number of possible outcomes"

There are actually many more possible outcomes than those you listed, so that with CRDTs designed explicitly for collaborative string editing you can provide perfect intent-preserving merges to the user. There is an excellent paper on preserving intent, see "Replicated abstract data types: Building blocks for collaborative applications" (http://dl.acm.org/citation.cfm?id=1931272")

> whereas OT is generally too complicated and unproven to offer that guarantee at all.

Citation needed.

I've built several production-level OT-based systems on top of ShareJS's JSON OT[1] code. The set of operations supported is guaranteed to be conflict-free and correct. We don't have AGDA proofs but we've used fuzzers to ferret out correctness bugs and its been about 2 years since a bug was found in the transform code. All in all, I'm very happy with the implementation.

Meanwhile, I don't believe a more generic JSON OT / CRDT system can be made conflict-free. (Well it can be conflict-free, but you'll lose data if it is). If you support arbitrary tree-level moves, you have the User A moves x into y's children, user B moves y into x's children problem. There are simply no good ways to resolve these operations without user intervention, or a lot more knowledge of the data structures at play.

[1] https://github.com/ottypes/json0

See: https://en.wikipedia.org/wiki/Operational_transformation#Cri...

"Due to the need to consider complicated case coverage, formal proofs are very complicated and error-prone, even for OT algorithms that only treat two characterwise primitives (insert and delete)" - Du Li & Rui Li - "An Admissibility-Based Operational Transformation Framework for Collaborative Editing Systems"

There's also an interesting comment there from the author of ShareJS, I think that might be you?

The other critical difference between CRDTs and OT is that CRDTs work offline, in a distributed setting, whereas OT cannot. OT requires a central online server to coordinate, which as far as I understand is the cause of the classic UI freeze in Google Docs whenever the network goes.

Yes thats me! For what its worth, I no longer believe that wave would take 2 years to implement now - mostly because of advances in web frameworks and web browsers. When wave was written we didn't have websockets, IE9 was quite a new browser. We've come a really long way in the last few years.

Working offline and working in a distributed setting are two different properties.

OT can work very well offline - the client and server just buffer ops and you do reconciliation. Some OT algorithms experience O(n*m) performance when the client comes back online (n client ops, m server ops), though the constant is quite low, you can make the client do the work. But you can do better - a better designed text reconciliation system can do O(nlog(n) + mlog(m)) if I remember correctly - which is quite usable in real systems and very competitive with CRDTs.)

P2P is much harder. I've got a sketch for a system that would use OT in a P2P setting, but in many ways its a poor man's CRDT. That said, the downside of CRDTs is that they grow unbounded as you make more modifications. Which would work great for documents that are written then discarded (like a HN thread). But it would work much less well for long lived, frequently written documents (like a server status dashboard). You can fix that using fancy coordination algorithms, but if you're going down that path you're back to sketching out an overcomplicated OT system.

There's lots of fun space to explore there - but not enough actual useful systems being made for production use.

Designer & author of the JSON OT algorithm that underlies ShareJS/ShareDB here. I do not have a formal proof for any of the properties of the algorithm, but we have extensively fuzzed it and are confident in its correctness. I believe a proof of correctness is possible, merely tedious.

FWIW, Google Docs can be used offline (both in its Chrome-based and native-app-based incarnations).

> There is no "extent" to which CRDTs are conflict free, CRDTs are by definition always 100% conflict free.

CRDTs are always conflict-free. I am not disagreeing with that. I was trying to draw attention to the fact that just because the CRDT always commutes does not mean that a user does not perceive there to have been a conflict - out of a multiple contradictory edits the commute operation will ensure only one succeeds. So yes, the CRDT itself is conflict free, but it wasn't my point.

They aren't magic. The deal with tie breakers when they have to. I've seen posts where suggestions of building a CRDT-based source control system are being bandied around with claims that there will be no conflicts - sure, there won't be. Will the code make sense or even compile - probably not.

A conflict to a CRDT is one thing and a conflict to a user is something else entirely.

> so that with CRDTs designed explicitly for collaborative string editing you can provide perfect intent-preserving merges to the user.

I don't for a second believe this is event theoretically possible, given that if two people write collaboratively, it is not uncommon for the intent to change "mid stream" as a result of seeing the other person edit.

Add in some lag, and watch confusion ensue.

Figure 4: replica p had the intent of deleting the node while replica q wanted to tick it. The final outcome is not what either human intended. Replica p would see their action partially reverted, replica q might not be able to discern their action in the list (or outright undone if the application skips invalid records).

Consider storing a single value in a document:

    { "A": 1 }
Q updates it to 2, P simultaneously updates it to 3. What is the correct outcome? Interactive consensus would be required even if humans were doing this by hand; unless quantum, there exists no accurate merge. That's not to say that CRDTs are useless, just not as grandiose as "automatic and accurate."

It seems in this case it might be

   {"A" : [2, 3]}

This kind of stuff is useful if users explicitly understand and design their application to expect this kind of stuff. It might not be possible always.

But note, that there is still a good value in all the servers which got the same conflicts ending with the same result. As opposed to say server1: picking {"A" : 2} and server2 {"A" : 3}. Now something can be built on top of it given it understand and accounts for how merges happen.

The problem is that the user's data itself needs to be conflict-free. For example, no programming language models its source code as a CRDT, so DVCSes will always produce merge conflicts (or worse, silently broken merges).

CRDTs are a fundamentally leaky abstraction. That doesn't mean they're bad, and the payoff of offline modification is very tempting. It just means they're hard to use.

In what sense are CRDTs "leaky"? Do you mean that they aren't composable in the way linearizable registers ostensibly are? It turns out that partially commutative monoids combined with invariants can form "logically atomic specifications" that emulate linearizability, and are thus composable: http://plv.mpi-sws.org/iris/paper.pdf. And there are ways to get consistent reads out of CRDTs, too: http://www.cs.indiana.edu/~lkuper/papers/threshold-queries-d.... I'm not saying that all CRDTs fall into this framework, but it's at least nonobvious to me.

"The problem is that the user's data itself needs to be conflict-free"

The user's data needs to support being modeled by a series of commutative operations.

"CRDTs are a fundamentally leaky abstraction"

Are non-commutative data types less fundamentally leaky than commutative data types?

CRDTs are good where the accuracy of the data is not critical (E.g. bank transactions).

It's interesting you say that given that banking invented eventual consistency with manual reconciliation of exceptions, in the form of the cheque.

If you have an API on a shared data structure, either you have distributed locking or you have conflicts. Distributed locking sucks, but CRDTs only prevent conflicts from messing up your data structure, not from existing.

The most important thing is to display the content to every user and allow them to update the intent if the CRDT rules are incorrect for that use case. Therefore the most important thing is not the fact that it's conflict-free, it's that it builds exactly the same document for all users no matter what order the operations are applied in (if it's operations based, rather than merge based).

"the most important thing is not the fact that it's conflict-free"

CRDT is actually an acronym for both "Conflict-free replicated data type" and "Commutative replicated data type". Both properties are provided by a CRDT by definition. Both properties are equally important.

"(if it's operations based, rather than merge based)"

CRDTs are commutative by definition irrespective of whether they are operation based or state based.

As I understand it, sometimes they do this via techniques roughly similar to inserting change markers (that is, remember both versions). The change may technically be "conflict free" according to the algorithm's definition, but from the user's point of view, the merge has not been completed until you pick which version to keep. Informally, each place we need to manually do a merge can be considered an unresolved conflict according to an alternate definition of "conflict".

If you need to pick a version, then it's not a CRDT.

Normally, you would use the right CRDT to fit the problem space, provided the problem space can be modeled using commutative operations. CRDTs require the problem space to be defined in terms of commutative operations. That's just a requirement of the data structure, just like binary search requires a sorted list. If that requirement is met, then there would be no conflicts and no merges by design. That's the beauty of CRDTs. They are perfect for multi-master distributed settings.

You wouldn't use a CRDT to store the raw contents of an Excel document or a Javascript file, since those file formats were not designed to be modeled using commutative operations. I think that's the use case you have in mind? For that you would need manual merges, regardless of the data structure or binary string you use to represent the document.

If the system is based on merging, then in general it is not a good solution. It works in a lot of cases (see git), but essentially, on every merge the algorithm does a "hmm, how did these changes happen?" instead of taking into account the semantics in a rigorous way. But in many practical situations, merging can be considered a very useful "hack".

I am the author of a CRDT database that can store JSON.

Here are Kyle Kingsbury's (Aphyr) thoughts on it: https://twitter.com/aphyr/status/646302398575587332 .

In response to the OT comments here in the thread, people should be aware that OT is not P2P or decentralized. The algorithm in the paper is, however, and so is GUN (https://github.com/amark/gun) .

Probably more important to the discussion, is that it IS possible to implement collaborative rich text editing on top of P2P CRDT systems. Although I do not know of any yet, I am working towards one which uses a linked-list DAG, early demo: https://www.youtube.com/watch?v=rci89p0o2wQ .

so from reading the comments, am I correct that with an object like this


if one user changes it to be


while the other at the same time changes it to be


and another user changes it to

{"key":[2, 3]}

it's impossible to tell if there was a conflict between the first 2 users or the third user just made an update

Related: does anyone know any other good resources for architecting and building OT/CRDT for nested data structures like HTML or JSON? Ideally projects with working examples even?

Author of ShareDB (https://github.com/share/sharedb) here. ShareDB is a stable production JSON OT system that powers the entire backend for Lever (https://www.lever.co/). All of Lever's apps, backend services, migration scripts, etc. go through ShareDB. It scales horizontally; over 1000 companies use Lever, including companies with 1000s of employees, like Netflix and Yelp.

There are some simple examples in the repo to get you started. Please let us know if you have any questions getting going!

We do OT on HTML (or rather the DOM) in the Webstrates (http://www.webstrates.net) project. We use ShareJS (now ShareDB) and represent and store the DOM as a JSON document.

I'm looking for a good 3-way merge for strings implemented in Javascript.

Can't read the paper - is this an implementation of or just related to CRDTs?

"The JSON datatype described in this paper is a kind of CRDT"

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact