
A Conflict-Free Replicated JSON Datatype - DanielRibeiro
http://arxiv.org/abs/1608.03960
======
espadrine
> _Our principle of not losing input due to concurrent modifications appears
> reasonable, but as illustrated in Figure 4, it leads to merged document
> states that may be surprising to application programmers who are more
> familiar with sequential programs_.

That decision is a bit odd to me. Not only do they show that their merge
system can produce data that the application considers invalid, it can even
convert a string to a list without having any operation creating a list
explicitly.

> _Moreover, garbage collection (tombstone removal) is required in order to
> prevent unbounded growth of the datastructure_

That is always the painful part of a CRDT system. Until that part is done, it
cannot really be used in production. That said, previous work on CRDTs give me
confidence that it can be implemented with minimal overhead.

> _In fact, we wrote the LATEX source text of this paper using an experimental
> collaborative text editor that is based on our implementation of this CRDT.
> We are making the character-by-character editing trace of this document
> available as supplemental data of this paper_

String editing is brushed over in the paper. I'd love to see their
implementation. Does anyone know where it all is?

~~~
nateps
We've been developing ShareDB
([https://github.com/share/sharedb](https://github.com/share/sharedb)), a JSON
OT system, and using it in production at Lever for 3 years now.

CRDT has some strong advantages in decentralized applications with a need to
avoid a central server. But in practice, web apps today do generally have a
central server.

I think the core advantages of OT over CRDTs are actually these practical
concerns: being able to design transformations that are more intuitive to
application developers, easily cleaning up old operation history, and real
world production use and examples of the technology already.

One thing OT has enabled us to do that you cannot do as easily in a CRDT is to
support turning off publishing of operations when we do large migrations and
create huge numbers of ops at once. Because ops have a strict ordering and
versioning, clients can lazy get the ops that they have missed if needed later
because they submitted a modification on a document or see an even newer op
and need the intermediate ops. In a CRDT, you have to make sure that all the
clients get all the ops in order to converge. It sounds simple but can be
complex in practice.

Another practical issue I'd add is that these algorithms sometimes place the
burden of resolving the outcome of concurrent edits at read time vs. write
time. With OT, you generally do a lot of complex work to figure out how to
transform ops at write time, but when you read, you just read the already
fully updated document in its native format. CRDT systems often store the
document in a format that can be written to very cheaply as it is commutative,
but the work gets pushed to read time when you have to collapse a tree full of
all its history into a different data structure, such as a string for text
editing. One is not strictly better than the other, but many production
systems are vastly more read heavy than write heavy, so less obvious tradeoffs
like this can become extremely important in production use.

~~~
arj
That last part is a good summary of CRDT versus OT.

Also nice project. I think a section on conflict handling would be a good
addition to your README on github. It is the first thing I look for when
looking at systems in this space :)

------
btrask
I developed a "general purpose" JSON-based CRDT[1] as a part of my content
addressing system project. The idea was to have an "append-only tree", where
subtrees could be used by individual applications for different purposes, for
example as counters or mutable sets.

In retrospect, it didn't need to support recursion. A single-level append-only
set is enough to be fully general and easier to perform indexing on. Using
JSON was also overkill, since too much flexibility is bad for content-
addressing.

[1]
[https://github.com/btrask/stronglink/blob/master/client/READ...](https://github.com/btrask/stronglink/blob/master/client/README.md#meta-
files)

------
jondubois
It is impossible to have accurate conflict-free replicated data. Unless the
users are given the opportunity to manually resolve conflicts, it will never
be 100% accurate - Incorrect data will creep in from time to time (but at
least the incorrect data will be consistent across all nodes).

The root of the problem is that the system cannot understand the collaborative
intent of concurrent users - If you have 2 users who made a change at the same
time (without being aware of each other); you have to account for the fact
that maybe UserB would have behaved differently if they had been aware of
UserA's input (which happened at the same time while one or both users were
offline). If the user has been offline for a while - Many such conflicts could
arise (maybe by the time the internet comes back on, the user is looking at a
completely different page than the one they made the change on) and it's
tedious to make the user resolve them all manually.

Also with this approach, it tends to force you to keep a copy/cache of all the
data in your entire app (for that logged-in user) on the frontend - If you
have a big app with lots of pages, that could consume a lot of memory.

There are cases where the best solution is to simply tell the user "Sorry, you
do not have an internet connection at the moment, so you cannot modify this
data" rather than giving them a false sense that the data are correctly backed
up in the cloud. I think with CRDTs, it's really important to inform the user
when they are offline and when their data are not synced/backed up in the
cloud (so they don't get any bad surprises when the internet suddenly comes
back on).

CRDTs are good where the accuracy of the data is not critical (E.g. bank
transactions). One could argue that they improve the user experience, but at
the core, developers like them because they make life easier.

~~~
jorangreef
"It is impossible to have accurate conflict-free replicated data. Unless the
users are given the opportunity to manually resolve conflicts, it will never
be 100% accurate [...] The root of the problem is that the system cannot
understand the collaborative intent of concurrent users"

CRDTs are usually designed to model user intent. You just need to pick the
right CRDT for your use-case. From there, the CRDT will resolve conflicts
automatically and accurately. Your statement regarding impossibility would be
true of operational transformation, but certainly not of CRDTs.

~~~
jamesdsadler
To the extent that CRDTs are conflict-free it only means that two edits can
commute (be applied in any order) and every party that has applied the same
set of edits will produce the same result.

At the low-level that the CRDT is operating on there will be no conflicts.

But that does not mean that the user never perceives there to be conflicts. No
matter what the consensus system used be it CRDTs or OT at some point the
converge operation has to impose a total ordering to pick a "winner" in the
case of conflicting user edits.

If editing a block of text and two users try to replace the same word with
another word there are a number of possible outcomes

1) One of the edits "wins" 2) The word is replaced by the concatenation of
each user's replacement. 3) Nobody wins and the edit is reverted.

In all cases at least one party perceives to themselves to have "lost". But it
generally doesn't matter because humans doing the editing will make repairs to
nonsensical edits in real time.

There has to be a tie-breaker when multiple users try to make different edits
to the same region of text. At least that's my current understanding.

~~~
jorangreef
"To the extent that CRDTs are conflict-free"

There is no "extent" to which CRDTs are conflict free, CRDTs are by definition
always 100% conflict free.

"No matter what the consensus system used be it CRDTs or OT"

CRDTs and OT are worlds apart. CRDTs are guaranteed to be conflict free,
whereas OT is generally too complicated and unproven to offer that guarantee
at all.

"If editing a block of text and two users try to replace the same word with
another word there are a number of possible outcomes"

There are actually many more possible outcomes than those you listed, so that
with CRDTs designed explicitly for collaborative string editing you can
provide perfect intent-preserving merges to the user. There is an excellent
paper on preserving intent, see "Replicated abstract data types: Building
blocks for collaborative applications"
([http://dl.acm.org/citation.cfm?id=1931272"](http://dl.acm.org/citation.cfm?id=1931272"))

~~~
josephg
> whereas OT is generally too complicated and unproven to offer that guarantee
> at all.

Citation needed.

I've built several production-level OT-based systems on top of ShareJS's JSON
OT[1] code. The set of operations supported is guaranteed to be conflict-free
and correct. We don't have AGDA proofs but we've used fuzzers to ferret out
correctness bugs and its been about 2 years since a bug was found in the
transform code. All in all, I'm very happy with the implementation.

Meanwhile, I don't believe a more generic JSON OT / CRDT system can be made
conflict-free. (Well it can be conflict-free, but you'll lose data if it is).
If you support arbitrary tree-level moves, you have the User A moves x into
y's children, user B moves y into x's children problem. There are simply no
good ways to resolve these operations without user intervention, or a lot more
knowledge of the data structures at play.

[1] [https://github.com/ottypes/json0](https://github.com/ottypes/json0)

~~~
jorangreef
See:
[https://en.wikipedia.org/wiki/Operational_transformation#Cri...](https://en.wikipedia.org/wiki/Operational_transformation#Critique_of_OT)

"Due to the need to consider complicated case coverage, formal proofs are very
complicated and error-prone, even for OT algorithms that only treat two
characterwise primitives (insert and delete)" \- Du Li & Rui Li - "An
Admissibility-Based Operational Transformation Framework for Collaborative
Editing Systems"

There's also an interesting comment there from the author of ShareJS, I think
that might be you?

The other critical difference between CRDTs and OT is that CRDTs work offline,
in a distributed setting, whereas OT cannot. OT requires a central online
server to coordinate, which as far as I understand is the cause of the classic
UI freeze in Google Docs whenever the network goes.

~~~
josephg
Yes thats me! For what its worth, I no longer believe that wave would take 2
years to implement now - mostly because of advances in web frameworks and web
browsers. When wave was written we didn't have websockets, IE9 was quite a new
browser. We've come a really long way in the last few years.

Working offline and working in a distributed setting are two different
properties.

OT can work very well offline - the client and server just buffer ops and you
do reconciliation. Some OT algorithms experience O(n*m) performance when the
client comes back online (n client ops, m server ops), though the constant is
quite low, you can make the client do the work. But you can do better - a
better designed text reconciliation system can do O(nlog(n) + mlog(m)) if I
remember correctly - which is quite usable in real systems and very
competitive with CRDTs.)

P2P is much harder. I've got a sketch for a system that would use OT in a P2P
setting, but in many ways its a poor man's CRDT. That said, the downside of
CRDTs is that they grow unbounded as you make more modifications. Which would
work great for documents that are written then discarded (like a HN thread).
But it would work much less well for long lived, frequently written documents
(like a server status dashboard). You can fix that using fancy coordination
algorithms, but if you're going down that path you're back to sketching out an
overcomplicated OT system.

There's lots of fun space to explore there - but not enough actual useful
systems being made for production use.

------
marknadal
I am the author of a CRDT database that can store JSON.

Here are Kyle Kingsbury's (Aphyr) thoughts on it:
[https://twitter.com/aphyr/status/646302398575587332](https://twitter.com/aphyr/status/646302398575587332)
.

In response to the OT comments here in the thread, people should be aware that
OT is not P2P or decentralized. The algorithm in the paper is, however, and so
is GUN ([https://github.com/amark/gun](https://github.com/amark/gun)) .

Probably more important to the discussion, is that it IS possible to implement
collaborative rich text editing on top of P2P CRDT systems. Although I do not
know of any yet, I am working towards one which uses a linked-list DAG, early
demo:
[https://www.youtube.com/watch?v=rci89p0o2wQ](https://www.youtube.com/watch?v=rci89p0o2wQ)
.

------
cwmma
so from reading the comments, am I correct that with an object like this

{"key":1}

if one user changes it to be

{"key":2}

while the other at the same time changes it to be

{"key":3}

and another user changes it to

{"key":[2, 3]}

it's impossible to tell if there was a conflict between the first 2 users or
the third user just made an update

------
ianstormtaylor
Related: does anyone know any other good resources for architecting and
building OT/CRDT for nested data structures like HTML or JSON? Ideally
projects with working examples even?

~~~
nateps
Author of ShareDB
([https://github.com/share/sharedb](https://github.com/share/sharedb)) here.
ShareDB is a stable production JSON OT system that powers the entire backend
for Lever ([https://www.lever.co/](https://www.lever.co/)). All of Lever's
apps, backend services, migration scripts, etc. go through ShareDB. It scales
horizontally; over 1000 companies use Lever, including companies with 1000s of
employees, like Netflix and Yelp.

There are some simple examples in the repo to get you started. Please let us
know if you have any questions getting going!

------
jackweirdy
Can't read the paper - is this an implementation of or just related to CRDTs?

~~~
zbyte64
"The JSON datatype described in this paper is a kind of CRDT"

