Hacker News new | past | comments | ask | show | jobs | submit login
Why CRDT didn't work out as well for collaborative editing xi-editor (github.com/xi-editor)
457 points by UkiahSmith on May 11, 2019 | hide | past | favorite | 83 comments

Relevant discussion from a couple of days ago: https://news.ycombinator.com/item?id=19845776

I don't have much experience in this area, but I'd be interested in an overview of how different pieces of sofware handle the concurrent / multiplayer editing problem, like:

- Etherpad

- Google docs

- Apache / Google Wave (open sourced: http://incubator.apache.org/projects/wave.html)

- repl.it https://repl.it/site/blog/multi

- figma https://www.figma.com/blog/multiplayer-editing-in-figma/ (image editing rather than text editing)

So is the gist of it that OT relies on central servers and they all use OT rather than CRDT? That was not entirely clear to me.

Looking at the xi analysis, it sounds like "IME" is specific to a desktop application using X (?), so it doesn't apply to any of this web software.

The rest of them seem like they do apply?

Is the problem that you have to pay a "CRDT tax" for every piece of state in the application? I thought the same was true of OT. Don't you have to express every piece of state within those constraints too?

repl.it doesn't seem to have a problem with syntax highlighting or brace matching (try it, it's pretty slick). So is it just that they paid that tax with a lot of code or is there something else fundamentally different about xi vs. repl.it ? Or maybe xi is going for a lot lower latency than web-based editors?

recent thread about OT vs. CRDT that might be interesting: https://news.ycombinator.com/item?id=18191867

TL;DR CRDT is completely irrelevant to any of the highlighting/etc stuff

Most highlighters are lexers. Advanced highlighters/folders are parsers.

The lexing/parsing that is required for highlighting is easy to make incremental for all sane programming languages.

for LL(star) grammars, adding incrementality is completely trivial (i sent patches to ANTLR4 to do this)

for LR(k) grammars, it's more annoying but possible (tree-sitter does this)

For lexing, doing it optimally is annoying, doing it near optimally is very easy.

optimal incremental lexing requires tracking, on a per token basis, how far ahead in the character stream the recognizer looked (easy), and computing the affected sets (annoying)

Most real programming languages have a lookahead of 1 or 2. Near-optimally requires tracking only the max lookahead used, and assuming all tokens need that max-lookahead. In a world where min token length is 1, that means you only need to re-lex an additional (max lookahead) tokens before the changed range. In a world where min token length is 0, it's all zero length tokens + (max lookahead) tokens before the changed range. This does not require any explicit per-token computation.

Again for basically all programming languages, this ends up re-lexing 1 or 2 more tokens total than strictly necessary.

Tree-sitter does context-aware on-demand lexing. i have patches on the way for ANTLR to do the same.

The only thing CRDT helps with in this equation at all is knowing what changed and producing the sequence of tree-edits for the lexer/parser.

The lexer only cares about knowing what character ranges changed, which does not require CRDT. The typical model for this kind of edit is what vscode's document text changes provide (IE For a given text edit, old start , old end, new start , new end)

The parser only cares about what token ranges changed, which does not require CRDT.

Oh wow, cool!

I made a simple proof-of-concept realtime PEG parser a couple years ago, which ingests text OT/CRDT operations ("insert at position X", etc) and invalidates & recalculates the compiler output by invalidating all overlapping ranges and recalculating from the root. My implementation is way slower than I expected it to be - though I'm sure you could use a lot of tricks from well optimized parsers to speed it up. I agree - CRDT / OT / whatever is sort of orthogonal, although its really nice being able to feed the parser with the same operation format, and have it update its output.

https://home.seph.codes/public/miniohm/ / https://github.com/josephg/miniohm if you're curious.

I'd love to see this sort of thing applied at a larger scale in a compiler. For example, I could imagine editing a C program with a compiler running live. Instead of batch compiling artifacts to disk like its 1970, instead as I type each character the compiler recompiles just the parts of my program that could be affected by my change. The compiler passes code changes to the linker (maybe with a function level granularity). And the linker could then (live) manage writes to an executable, allocating new byte ranges for changed functions and updating references on the fly.

Even on very large C++ project like Chrome, there's no reason why incremental updates should take more than 1-2ms. If all you're doing is changing a single function in the binary file, why do our linkers rewrite the whole thing?

> Instead of batch compiling artifacts to disk like its 1970 [...]

I am delighted to see people discussing this. Batch orientation made sense in a very resource-constrained era. But at this point we have more RAM and CPU than we know what to do with. It seems so obvious to me that the correct solution is to prioritize developer experience and keep everything hot.

My single biggest barrier to developing faster is the latency between making a change and seeing what that does to the test results. I would like that to be well under a second. Given the cost of programmer time, I could happily throw money at RAM and CPU. But standard, batch-oriented tools won't take advantage of it 99% of the time.

I think there's a revolution in developer experience out there, and I really want to see it happen.

The perfect place to start would be support in LLVM backends, maybe they are already working on this.

I don’t know that that’s true. I doubt you could do this sort of redesign simply in an LLVM backend because a “hot” compiler would work differently at so many levels of abstraction. I think an easier place to start would be a self contained compiler with a little less scope than llvm. There we could figure out the patterns, find a clean set of internal APIs and make some sweet demos. And then with that approach the same task in LLVM.

I have done exactly this in the Markdown editor I developed: Stylo: https://www.textually.net if you're curious.

I used what I call "partial compilation" for each edit made in the Markdown editor that "re-compiles" only affected parts of the text. A function is responsible for computing the ranges to be recompiled based on the edits made in the text. In Stylo it was absolutely essential since CSS is used for syntax highlighting and a complete DOM is kept in memory for the complete Markdown text. The necessity to keep the complete DOM in memory could have been avoided without "following sibling" selectors handling but I really wanted "complete" CSS support. Since Markdown can be used for texts ranging from 1 word to 2000 pages, complete recompilation was unthinkable, so partial one with some exceptions was implemented.

The main difficulty with this kind of approach is that it's possible that the compilation, for whatever range determined, does not stop. Meaning that the effect of the edit goes far beyond the determined range, which function can not know this without compiling... It can happen for example if you edit in C and type "{" then all the scopes below are affected by this opened (and not closed yet) scope. In Markdown it can happen when you open a fence code block (using the ``` opening fence). So you need to have a proper way to handle these edge cases. The other problem is to handle the state associated with these regions and handle the non-stopping compilation case talked before.

All this, as I have discovered, make this kind of compilation and it's associated syntax highlighting really difficult to implement properly and fast enough.

In order to be fast, this process needed to be as asynchronous as possible, because there was some tiny but noticeable lags on the main thread. Even partial compilation was more around 8ms to 12ms on my computer (an old macbook pro 2012). But there is also some cases where asynchronous is not suitable, in which cases synchronous processing is used, but, only when absolutely necessary. So, a mechanic which allows to switch between asynchronous and synchronous processing was needed.

All this to say, it's possible but really difficult to do properly. Every bit of the CPU power have to be extracted to make it comfortable to use. I can just imagine with the complexity of C++ what a nightmare it would be to manage, but it should be possible with a proper compiler architecture that modularise partial compilation and encapsulates the inherent complexity of this approach.

Thanks for writing this - its really interesting to read and I'm glad other people are trying this approach.

Is your markdown editor opensource? I'd love to play with it.

"Even on very large C++ project like Chrome, there's no reason why incremental updates should take more than 1-2ms. If all you're doing is changing a single function in the binary file, why do our linkers rewrite the whole thing"

Because it requires N^3+ dataflow problems to track what is possibly changed with significant accuracy.

Does it?

Let’s say I edit a function, and my editor (attached at the hip to my compiler) knows which bytes have been changed. So it knows my change was isolated to a function in a .cpp file. The compiler has cached the compilation environment from the last build, and with that in mind it could recompile just this one function. Then we need to binary patch the function - which essentially requires allocating disk space inside the binary executable. If the previous compiler pass left some free space, we could allocate out of that and mark the old function memory as reusable. Then we need to update CALL pointers - either store them in a list or have every function do indirect calls through a table, and then you just need to update the table entry. If you want to get fancy you could do better than function level granularity, but I think that alone would give us sub-ms builds.

I wouldn’t want to change the process for release builds - but for incremental debug builds I don’t see where you’re getting an N^3 from.

Yes, it does.

"The compiler has cached the compilation environment from the last build, and with that in mind it could recompile just this one function."

First, this is actually the most trivial possible case, and even there, as you'll see, this is not correct.

Outside of a function it does not handle changing types, changing access permissions in classes, changing between virtual and non, etc. It does not handle changing globals, things with attributes of various sorts, etc.

All of those require understanding the uses, or overestimating them.

As for whether function-local changes can have non-local effects, the answer is flatly "yes". You don't even have to be that creative, since you can do things like force template function output and other things that affect global state from a function in C++.

The assumption that because i changed text in one function, it means the possible changes are limited to that function, is not correct in C++ :)

There are languages in which it is, mind you, just C++ is 100% not one of them.

Honestly, I would strongly suggest you start to read papers on this, this is an incredibly well studied area, and the costs are well known.

If it was easy and cheap, we'd still be doing it ;)

It used to be a pretty standard feature of editing environments for things like LISP, etc.

Do you have any recommendations for places to start reading? I didn't realise it was a well studied area - naively it seems like it would be possible in many languages. It would certainly be trivial for 90% of the transpiled / polyfilled stuff that we do in modern typescript / javascript.

I completely agree that doing this to live-update an optimized build would extremely expensive and difficult. And I'll happily acknowledge that (for example) adding or moving fields in a struct would be hard to implement.

But adding a compiled specialization of a template function / class should be "easy" to handle. Its non-local, but I can imagine a dependancy tree of functions needed for the compilation. On update, we can figure out a diff of that tree, free all the now-unused function bodies in the binary and then add (compiling if necessary) all the added / updated functions.

C++ might be one of the worst languages to implement this for - .NET, JVM, typescript or Go might all be much easier. But yeah, if this has been extensively looked in to I'd love to learn why we don't do it for modern development. It seems like an obvious win, and if we don't do it I hope its for good reasons.

But, this is a hard thing to do, the compiler code will have to be organized in a different way, probably the language should be designed specifically towards this end. And the compiler code will be more brittle for sure and will probably be slower for batch compiles.

Most languages do already support ok-fine grained recompiles (at the module level) and it's just many linkers that are slow (I keep hearing from people that this need not be). And actually some of them already do something that is called "incremental linking".

Even C compilers regularly compile 100K+ LOC/second, so I think there's not much to be gained here. Yes, complicated C++ can be slow to compile. Maybe we shouldn't write compliated C++ code to begin with? In the end, batch compile performance also matters.

Very interesting and informative. But I suppose what the user really wants is the next step: not just incremental parsing, but also incremental compilation (or at least the error-checking part of the compilation process).

The basic error checking comes with incremental parsing. A lot of semantic error checking is also not that bad to make incremental (though some languages are just really really bad here).

Incremental compilation depends on the constraints on environment (IE are you trying to incrementally produce normal object files or what)

The more you relax the environmental constraints the easier it gets.

It also depends on just how non-local an effect the programming language allows. For most programming language, general incremental compilation requires either strongly overestimating possibly-modified sets, or global interprocedural dataflow + pointer analysis.

Most choose the former, or don't do general incremental, but instead something like "edits to functions are incremental, edits to global data/types/etc are not"

(This is even without compiler optimization on).

The Skip language (which I mentioned in another comment) is a language based entirely on incremental computation. Every piece of code you write in it becomes incremental, as far as I know.

AND the compiler is self-hosted!!! So that means they have a fully incremental compiler.

Although there is an LLVM back end, so maybe only the front end of the compiler is incremental? That's still important. It comes from the same people as the Hack language, which I believe has a fast "online" typechecker -- which I think is incremental (?)

I believe you could write a C compiler in Skip the same way?

If you peek at the source code, it has a 6000 line recursive descent parser, and 6000 line type checker, much in the style of OCaml (but in Skip).

I think this deserves a blog post because it's not clear to me that the message of Skip got through (7 months ago on HN). Their website has some examples, but not too many.

Of course, I haven't tried it yet, and it's a super ambitious project, so there could be some big problem with it.

But if the goal is to make an entire suite of compiler and devtools that are incremental, then Skip seems like it's ahead of the state of the art and something to borrow from.

Salsa was also mentioned, and I believe it's based on some of the same academic work like Adapton. But I don't know if they have an entire self-hosted compiler as a demo like Skip does. It seems to be "query-oriented" rather than supporting arbitrary code, which is what I would expect for most incremental computing frameworks. Skip seems to be more ambitious.


I believe that the Skip language does this, because it's compiler is self-hosted. It's http://skiplang.com/blog/2017/01/04/how-memoization-works.ht...


Skip – A programming language to skip the things you have already computed https://news.ycombinator.com/item?id=18077612


This is a great comment!

Yeah I think I get what you're saying -- basically that if you have incremental algorithms then you don't need to take a dependency on CRDT or OT to make the UI updates fast?

In other words, you can do things synchronously BUT still quickly, since you have an incremental algorithm. In any case, I basically asked that same question here:


I've seen Treesitter and am impressed by it, which I wrote a little about in that comment.


Since you speak of "sane" languages, I wrote a parser for an "insane" language by hand, with a principled regex-based lexer:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

In fact the first thing I tried was to use ANTLR to translate the POSIX shell grammar (which only covers about 1/4 of the shell language). This drove home the limitations of grammar-based approaches. I found its model of lexing to be very limited (along with insisting on materializing a parser tree rather than providing semantic actions in ANTLR v4).

So it's good to hear that it's being updated for "context-aware lexing", and incremental lexing. Are you using ANTLR for IDE-like tools? I wasn't aware that it was good or commonly used for that kind of use case.


I briefly talked with the author of Treesitter on Github about his bash syntax highlighter with Treesitter, which is of course a parser and not just a lexer, and can be made incremental.

I wasn't sure if the context-aware lexing mechanisms here could really handle the shell language:


I think his response was basically that it doesn't have to be 100% correct, which I agree with for an editor. Trading off some corner cases for speed is probably the right thing.

But I have thought about what it would take make a parser for a "crazy" language incremental and 100% correct (which you would want for further semantic analysis).

I wonder if you can write recursive descent parser in the Skip language and then get incrementality for free? The language is designed so that every algorithm can be made incremental, and it's not limited to pure functional constructs like other systems. Supposedly it goes to some lengths to provide you with a more imperative / mutable state modelk, so it should be able to express an "arbitrary" parser.


In other words, instead of having different algorithms for making LL(star) and GLR parsing incremental, Skip might give you a more expressive model for the parser. A bunch of my blog posts are about the limitations of grammar-based approaches.


FWIW the style of context-aware lexing I use is explained here:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html

And the other post lists the 14 different modes or contexts that the shell front end has:


It's a very simple and powerful mechanism, but my parser only cares about batch mode for now. It's not clear to me how it differs in expressiveness from what Treesitter offers (or ANTLR). The current lexer mode basically depends on the control flow of the parser.

Just to answer the IME question, it refers to "Input Method Editor" and is an important problem to solve for all platforms, desktop, mobile, and Web. The API's for IME are often crufty and it's easy to get edge cases wrong. These days, lots of people care because emoji, but formerly it was something that English language speakers tended to ignore.

An anecdote that goes against the conventional wisdom about IMEs vs. ”English” and emoji: Right now in Firefox and Chrome, the events that get fired are more correct if you enter emoji using the Windows 10 Pinyin IME’s emoji palette than if you enter emoji using the Windows 10 on-screen English keyboard’s emoji palette.

Edited to elaborate: Keyboard APIs aren't well exercised for astral characters, but IME APIs deal with strings anyway.

Figma has a blog post about it, they use Fractional Indexing.


Stern Brocot is a better way to do fractional indexing (pathological cases are not aligned with common uses cases like putting something to the front).


OT doesn’t have to depend on central servers, but it is much simpler and less resource-intensive to do it that way. This is what Etherpad, Google Wave and Google Docs do.

As you say, both OT and CRDT come with a “tax” in that you must structure your application’ edits in a way that they can interpret. However, this is easier with OT for the text editing case, as OT uses position based addressing, whereas CRDT is identifier based (I’m assuming this is where the canonical IME issue mentioned comes from - somebody please correct me if I’m wrong).

The interesting thing about OT is that almost every non-trivial editor uses that as its model, because it makes undo/redo very simple.

xi-editor pioneers a hybrid where "identifiers" are actually the tuple of revision and position. The motivation is to buy back some of the performance edge that OT has because of its use of position. I think if you are going to do CRDT it's pretty compelling, but that's also somewhat orthogonal to the complexity concerns raised in my original post.

Interesting - that’s definitely compelling. In a nutshell what exactly is the issue with IME?

Robert Lord's comments on the issue are probably the best explanation, written with the benefit of hindsight.

[0]: https://github.com/xi-editor/xi-editor/issues/1187#issuecomm...

I'm pretty certain that these are OT: Etherpad, Docs, Wave. I've done quite a bit of investigation into these too and I'm convinced that OT is the mathematically inferior, but practically superior model (it is far easier to reason about, which is important because not everyone on your team might read compsci books).

You can abstract OT/CRDT behind a step-locked view model, which should be fine for in-process. IME can be solved in this view model, too, by halting updates to the CRDT/OT between IME entries.

The problem is that, with xi, this would necessitate implementing that view model in each front-end, so there's certainly the "CRDT tax" there. xi is a bit of an oddball.

Another OT/CRDT thread from just a few days ago: https://news.ycombinator.com/item?id=19845776

I don’t think Google docs is distributed at all. I think all edits happen in a single process on a single machine. That’s why the limit of concurrent editors is so low.

It’s distributed but it’s not decentralized. Concurrent edits are possible. This is how Google Docs is able to work offline.

It's distributed because there's communication with the clients going on, which requires some kind of coordination to achieve agreement on the result of two simultaneous edits.

I think the assertion is that the edits are not distributed. Instead, an intent to edit is sent to a central server which forces them to be serialized or rejects them. It is not merging distributed edits.

I'm not sure if this is a) the assertion or b) correct/relevant.

This is incorrect. Edits are made locally and sent to the server. The server merges the edit with any edits it received in the meantime from other clients. So the server serializes the rebased edits.

I'm not sure this is disagreeing with me. If a single central server/service is serializing and rebasing the edits, that is what I took the assertion to be.

Decentralized world be the clients communicating their edits to each other.

It can be confusing, as github is a centralized service for git, a decentralized protocol. So, much of the terminologies get blurred.

> So is the gist of it that OT relies on central servers and they all use OT rather than CRDT?

You are right. I'd take shot at blowing out the why part. Practically, real-time collaboration apps can be categorized into two:

#1 The document editors, which stores information on the cloud. I'll refer these as document editors henceforth.

#2 The ones that doesn't need a thick server, except for broadcasting edits. Like plain text editors and code editors. I'll refer these as code editors for simplicity sake.

Both have a slightly different set of problems.

Document editors, usually have a slightly richer document representation schema (since most of them are rich text documents anyway). But they have the luxury of a central server. A central server means, now they have the power for good versioning, rollbacks to a consistent state, picking winners and ensuring ordering of edits when syncing large edits.

Code editors, although their doc representation is usually arrays of characters (it could be strings, ropes, or whatever but essentially just a set of characters), these don't have the luxury of a central server that could decide a winner when concurrent edits take place. This means the clients have to be very intelligent and converge to the same representation no matter how out-of-order the edits arrive.

Now coming back to the two algorithms camps - OT & CRDT:

The general outlook is that, OT - while being simple, they have this classic TP2 problem that would make it harder to arrive at a consistent state eventually - without a help from a central server. The alternative is to have a complicated system, almost as complicated as CRDTs. (More about this can be read through the Internet. If anybody's interested I'll post links to stuff I've read through)

CRDT - CRDT have (or maybe had?) a strong reputation of being efficient at arriving at a consistent state with out of order edits. They can do it, because basically no characters in a CRDT document is deleted anyway (they are just marked deleted and are termed tombstones). So no information loss, which means a clever algorithm can write a function to easily arrive at a consistent state at each site. This means, a central server is now only optional.

If I didn't lose you until now, you'd have intuitively guessed why document editors, tended to always pick OT and code editors prefer CRDT over OT.

This is complete oversimplifications and certainly there's a ton of things I left out for simplicity sake.

Source: I'm in the business of building a powerful Google Docs alternative, a word processor that runs on the browser: https://writer.zoho.com (Zoho Writer)

> The general outlook is that, OT - while being simple, they have this classic TP2 problem that would make it harder to arrive at a consistent state eventually - without a help from a central server. The alternative is to have a complicated system, almost as complicated as CRDTs.

Well this characterization isn't quite right. With regard to TP2, this is a solved problem, with or without using a server. Most of the conversations around TP2 seems to miss the crucial point that TP2 is not needed for convergence. CRDT academic literature has largely been responsible for propagating the notion that TP2 is a must for every OT system - without TP2, all is lost -- this is a myth. I wrote about it extensively here [1].

This article [2] gives an analysis of 7 coeditors based on OT, including the likes of Google Docs, showing that they avoid having to design transformation functions that require TP2. Some of the systems are fully de-centralized [3], others require a server.

It is also worth making a finer distinction between systems that do need a central server -- there are "transformation servers" and "messaging severs". Google Docs, for example, demands that its servers perform some portion of the transformation logic. Many other solutions rely on using the server only for message transmission/serialization, thus the server side logic is significantly simpler.

[1] https://arxiv.org/pdf/1905.01302.pdf

[2] Achieving convergence in operational transformation: conditions, mechanisms, and systems.ACM CSCW (2014)

[3] A generic operation transformation scheme for consistency maintenance in real-time cooperative editing systems ACM GROUP (1997)

> Indeed, the literature of CRDT does specify a mathematically correct answer. But this does not always line up with what humans would find the most faithful rendering of intent.

This is a very salient point that anyone thinking of using CRDTs to "solve" synchronization in an user-facing application needs to take into consideration. Yes, CRDTs will guarantee that clients converge to an identical, mathematically "consistent" state eventually, but there's no guarantee whether or not that mathematically "consistent" state would make any sense to the application business logic that needs to consume that state, or to the human that needs to reason about the rendered result. That is a completely different can of worms that we'll still have to tackle to build a usable application.

Here's great example to illustrate this from Martin Kleppmann's talk on this topic: https://youtu.be/yCcWpzY8dIA?t=2634

Rest of the talk is also highly recommended for anyone interested in an approachable primer on CRDTs.

The trade-offs to CRDTs mentioned by the author in the context of text-editors make sense, but I would be curious to hear from the Xray team on what their current thinking on the topic is, given that they have collaborative editing as an explicit core objective (which might shift the value prop in favor of using CRDTs relatively speaking since in Xi it seems to be only an aspirational goal), and that their approach to implementation was similar but not quite identical to Xi's:

> Our use of a CRDT is similar to the Xi editor, but the approach we're exploring is somewhat different. Our current understanding is that in Xi, the buffer is stored in a rope data structure, then a secondary layer is used to incorporate edits. In Xray, the fundamental storage structure of all text is itself a CRDT. It's similar to Xi's rope in that it uses a copy-on-write B-tree to index all inserted fragments, but it does not require any secondary system for incorporating edits.


> but I would be curious to hear from the Xray team on what their current thinking on the topic is

Xray is dead:


This really is the most important thing to get from all of this. In an editor the users can see any bad outcomes and correct them. In, say, a database system, bad outcomes can be much more problematic.

@dang, can we edit the title? It's inaccurate. This post is about how the CRDT didn't work for asynchronous editing by automated tools like syntax highlighting, automatic bracket balancing, etc. The post explicitly contrasts those use cases with collaborative editing as a use case that the author didn't implement, but where they think "the CRDT is not unreasonable".

Maybe first build a capable editor, with plugins, etc (xi-editor is not that yet) and worry about "collaborative editing" later?

And even for that, I think simply "taking turns" (where users share an editor session, can chat with each other, and can switch on sequentially who gets to actively edit) is enough for 99% of cases, and is not more difficult than mere single-person editing (since there are no conflicts).

Collaborative editing is very difficult to implement in a pre-existing codebase. Decisions made during the initial design will be prohibitive to just bolting on collaborative editing in the future.

To me, xi seems like an aspirational project, so this seems like the perfect place to take some time and design for these features up front.

I think (from reading the documentation) an implicit aim of Fuchsia is to treat a device no longer as a holder of documents (a storage medium for photos / messages / candy crush scores) but as a view into a universal storage space.


Collaborative editing is not something you can just bolt on after the fact if you want it to actually work well. Things like building a robust server that can be exposed to the internet can certainly wait, but how you are you supposed to develop a plugin ecosystem when you haven't even settled on a conceptual model for how to store and manipulate text yet?

>Collaborative editing is not something you can just bolt on after the fact if you want it to actually work well.

Yes, but my comment also alludes to the opinion that "well working collaborative editing" is not a real problem people have, and a much easier (and easier to bolt on) use case, of consecutive (serialized with "locks") collaborative editing should cover most people's needs...

Start by redoing everything that the mature alternatives do is an advice for creating neither successful not useful things.

By all means, focus on creating a kick-ass collaborative editor, and add just the editing capabilities needed to make it good at collaborative editing.

>Start by redoing everything that the mature alternatives do is an advice for creating neither successful not useful things.

It's the best advice in order to see any update.

There are plenty of programs that do some unique things very well, but fail on doing "everything that the mature alternatives do", so they fail to ever get mainstream traction themselves.

People want a complete solution that ALSO does X unique thing, if they are to drop their existing editors. Not something that they'll have to use alongside them for that special case.

(Joel on Software has written some nice posts about this idea, and why "minimal" competitors, who don't do "everything that the mature alternatives do" frequently fail, though I can't find the link right now)

Without at least doing at least one thing substantially better what is your chance of getting used by anyone.

At the risk of asking a stupid question: is there a reason other than offline support why we bother with conflict resolution algorithms?

Every time concurrent editors come up, one of the main points of discussion is the pros and cons of different possible conflict resolution algorithms. People seem to be spending a lot of time on debating and implementing that. The way I see it, whichever packet reaches the server first gets applied first. Send something like "line 9 column 19: insert <Enter>", and when another client whose cursor is on line 15 receives that, it moves the cursor down to line 16 and scrolls the view down one line. Because you can see each other's cursors and selections, it shouldn't be hard to avoid typing in the same place. Unless you have round trip times of multiple seconds (satellite uplinks maybe?), and unless you edit continuously with more than, say, one people per ten sentences, you should hardly ever need it, and if it happens, the person editing will notice within two seconds and just wait a second for the other to finish. It's not as if you can reliably apply edits anyway: as the article already describes, changing a line from ABC to EFG concurrently with someone modifying B to D, does not really have a good outcome. In a more realistic example, it would be changing "its" to "it's" concurrently with changing the word to "that". There is no good solution (the server wouldn't know which person to ignore: the apostrophe inserter or the replacer), so someone will have to resolve it manually anyway, so why bother with complex resolution algorithms? Heck, I'd be fine if my editor would do exclusive locks for the line I'm on before I can start typing.

For slow things like the customer report, internal documents, code, etc., I use something like git. Collaborative editing is (to me) for realtime things like jotting down notes about what I'm working on and looking at what others are working on right now, where even a proper revision control system is too cumbersome (git pull, vim notes.txt, small edit, :wq, git commit, git push, repeat) because someone might be working on the same thing. In such a case, where I need to work together on a file in real time, I'm not working offline, so this conflict resolution is by definition unnecessary. Is that different from the majority of people that use collaborative editing?

Well (strokes grey beard), before we talk about offline support, we should consider that there are two kinds of "online editing."

The first kind of "online editing" is where you make a request to a server, and nothing happens until the server acknowledges it and sends you an approval. That's synchronous.

The second type of "online editing" is where you have an independent process in your browser or client, and it communicates asynchronously with a server, while simultaneously allowing you to edit a local model of the data.

In the first case, we really need an editor to send every keypress to the server and wait for a response.

In the second case, we really are talking about the browser or local app or whatever being offline, it just synchronizes very frequently with the server.

I think that the moment you want to support typing faster than the round trip to the server, you are already talking about offline editing.

And in most cases, yes you do want to support typing faster than teh round trip to the server, because what works in an urban area with high-speed internet becomes comically unusable in low-bandwidth and low-latency environments.


So... I suggest that we almost always want to design for "offline editing," it's just that we don't always imagine someone writing an essay on a weekend camping trip, then synchronizing when they get back to the office on Monday.

Mosh has this idea that you can keep typing and sending asynchronously even though you need the ACKs to find out what really happened, then when you get them you just redraw accordingly. Humans won't type too fast for too long, so eventually there will be time to catch up and let the user see what actually happened. The key is to distinguish client-side speculative outcomes from actual outcomes. Imagine that the text you're typing is made reverse-video, or a different color (subject to color blindness constraints) to indicate speculative (as-yet-unacknowledged) text.

That is, I think humans can be part of the async system and understand that what they're typing isn't committed yet.

Sometimes when I type I don't even look at the screen or the keyboard for a bit -- entire sentences even. Less so with code, naturally. When I do this I do have to eventually look at what I actually entered, because I might not have noticed some typo, say. I just did that for this entire paragraph. I want to believe that I'd handle speculation in the UI just fine.

You are describing classic branch-and-merge. “Here’s my branch that I created offline.”

“Sorry, there are merge conflicts, please resolve them before resubmitting,” or, “I resolved them without consulting you.”

Yes, but I'm saying that that can work very well for a visual, interactive application.

> we almost always want to design for "offline editing,"

Fair enough, a one-size-fits-all solution is ideal. Offline support would be great, especially when traveling in a country like Germany. But I would already be happy if we had a good collaborative editor without offline support before we have a good collaborative editor and offline support. Right now, all the options I know of are all either web-based and licensed (Confluence/CKEditor5), just web based and really basic (etherpad), really basic and occasionally unstable (gobby+infininote), or some tech demo alpha version.

> it shouldn't be hard to avoid typing in the same place [...] if it happens, the person editing will notice within two seconds and just wait a second for the other to finish

But if it does happen, it needs to result in a consistent state in both people's editors, right?

> In a more realistic example, it would be changing "its" to "it's" concurrently with changing the word to "that". There is no good solution (the server wouldn't know which person to ignore: the apostrophe inserter or the replacer)

Discarding the apostrophe seems like a good solution to me, what's wrong with it?

As long as, if the replacer hits Undo, "it's" is what is restored. (Not sure if any existing algorithms do so, but I believe it's possible and am intending for the collaborative editor I'm designing to do so. Haven't implemented it yet though, so who knows.)

> it needs to result in a consistent state in both people's editors

Ah, that makes sense :). Stupid question indeed.

I figured the server is just the authority and whatever order it arrives there is the truth. But we only send deltas (we must, because how else will the client still know where to put the cursor after applying a new version) and so we can't use that. Makes sense.

It's not a stupid question if you learned something from it.

I never understood this either.

Why not simply lock the file automatically when someone types and unlock it after a few seconds pause.

What do I get from parallel edits? When I finish my stuff and my collaborator isn't finished, we will end up with broken code until they have finished editing.

Sure it is nice to be able to write comments at some place while someone else edits code at another place, but then row-based locking would do the trick too.

Sure, you can do this but you’re basically sacrificing collaborative editing and you’re going to end up with a very laggy system. You’re also going to have to implement distributed locking which isn’t as easy as it looks.

Also if the connection drops then you’re going to loose any pending edits because you’ve lost your lock.

> When I finish my stuff and my collaborator isn't finished, we will end up with broken code until they have finished editing.

You can’t actually do this under the model you’ve suggested because your collaborator will be locked out of editing the document while you edit it. So if they’re editing then you just have to sit there staring at your screen.

In our company, we work with a small team on each pentest, and we do have concurrent edits in the same file quite regularly. One example is going through the network scan results at the beginning of a test: one person works from the top, the other from the bottom, and frequently, we will be editing the file at the same time because we are just getting an overview of the infrastructure and picking interesting targets. Locking the whole file would actually be cumbersome for us.

But I agree that locking sounds like a fairly low-tech solution to reliably solve the issue, though I would do it on row level (or perhaps the current row plus one above and below). The way I see this work is that your client will automatically fetch a lock on the row where your cursor is placed, and you can type at will. If you are inactive, your client could automatically release the lock. Most of the time you're not working on the exact same line as someone else anyway (two persons one sentence, feels a little like trying to type with two persons on one keyboard).

Edit: In a sibling thread, u/espadrine raised a good point:

> what if concurrently someone inserts Enter on line 3? Executing "line 9 […] insert <Enter>" must not be done on line 9.

I guess my idea of line locking may not work after all. Or at least, we would have to change "line number" to some identifier that can not be modified by an edit.

> why we bother with conflict resolution algorithms? […] The way I see it, […] Send something like "line 9 column 19: insert <Enter>", and when another client whose cursor is on line 15 receives that, it moves the cursor down to line 16 and scrolls the view down one line.

What you describe is (an aspect of) the algorithm you want to do away with!

To make it more obvious: what if concurrently someone inserts Enter on line 3? Executing "line 9 […] insert <Enter>" must not be done on line 9.

Worse, divergence is often invisible! Proof of convergence is hard!

More generally, the problem of client-side sync is a hard one, whatever the product: the client has a partial view of the data, and its UI must not show contradictory information. You want your online video game to extrapolate positions to avoid lag. You want your bank account page to have the balance match that after the list of transactions listed below. As we move away from webpage-based navigation (aka "reload everything on every operation") to PWAs, authors get more complex requirements.

I think CRDTs would make much more sense in a projectional editor than a text one. When the changes are mutations to the abstract syntax tree its more well defined how a merge would end. Also, the merge results don't have the opportunity to be invalid syntax.

It is really crazy that we go through these massive hoops to simulate what would be trivial to do in an AST editor. I recently read through some of the literature on projectional editors, and while they have historically had some usability issus I really hope that this will change in the future.

"For syntax highlighting, any form of OT or CRDT is overkill; the highlighting is a stateless function of the document, so if there's a conflict, you can just toss the highlighting state and start again."

I first became interested in CRDTs in a case where this wasn't really true. I was writing an IDE for a custom in-house DSL - think of the application as a special language for interacting with a gigantic and very strange database. Basically, the problem was that the use case really stretched the bounds of what is normally done with syntax highlighting. Some requirements:

- It had syntax and semantic highlighting, where the visual feedback associated with a term would depend on the results of queries to the remote database

- It had to be able to handle documents of several megabytes (and many thousands of terms) fairly smoothly, with as little noticeable lag or flicker as possible

- It couldn't swamp the database with unnecessary requests

- The document itself had implicit procedural state (e.g. if you wrote a command that, if evaluated, would alter the state of a term on the database, appearances of that term later in the document needed to be highlighted as if those changes had already been applied)

So I definitely couldn't throw out metadata and start over with every change. I ended up with a kind of algebraic editing model that allowed me to put bounds on what needed to be updated with every edit and calculate a minimal set of state changes to flow forward. It was extraordinarily complicated. I never got around to learning enough about CRDTs to determine if they'd be simpler than the solution I came up with, but they do seem to target some similar issues.

Yeah definitely. I would consider this a form of semantic analysis, of the kind provided by Language Server implementations rather than the kind of syntax highlighting provided by syntect. Also note, the syntax highlighting done in xi-editor is both incremental and async[11]. This actually worked out well, and I would preserve it going forward. What I wrote above actually overstates the case, I think the ideal solution is an extremely simple form of OT so you can reuse as much as possible of the already-computed highlighting state, but you certainly throw away the highlighting result for the region being edited. Preserving "before" is trivial, and preserving "after" should be a simple matter of fixing up line counts.

[11]: https://xi-editor.io/docs/rope_science_11.html

I'm assuming CRDT refers to conflict free replicated data type: https://en.m.wikipedia.org/wiki/Conflict-free_replicated_dat...

OT, operational transformation: https://en.m.wikipedia.org/wiki/Operational_transformation

That's a correct assumption.

Thank you for writing this up Raph. I've been following CRDT usage in Xray/Xi and am curious to see where collaborative editing goes. I appreciate you thinking about it upfront.

Having a bit of difficulty following this, so I'll break down my understanding of CRDTs and see if someone can help me out.

A CRDT can be thought of as an algebraic structure, consisting of data type D, and a join function. So for all a, b, c in D, it's:


  join(a, join(b, c)) == join(join(a, b), c)

  join(a, b) == join(b, a)

  join(a, a) == a
Partially ordered:

  if join(a, b) == b then a <= b
  a <= a == true
  if (a < b) and (b > a) then a == b
  if (a <= b) and (b <= c) then (a <= c)
So given all of that, I am not sure why the example in the article holds. I assume it's a consequence of the partial ordering, but I don't know what the partial ordering is. What's the join operation and what's the data type?

I'm just learning this, but it seems to be that in some CRDTs "a" and "b" represent two different edits to the common state made by two different users and the join operations is how you combine those two edits into a single joint edit that will produce the same results regardless of which order the edits arrive at the "server".

In other CRDTs "a" and "b" each represent the new states after two users have made different edits to the same common source, and the "join" function is how you combine those independent edited docs/states into a single state again.

Ie if you think of fit branches, set of rules on edits to ensure that no matter the order you merge branches back together conflicts can be automatically resolved and will always reach the same end state with the changes "appropriately" incorporated.

I replied with my thoughts to the github issue, but they might be of interest to people reading along here too. I've got some experience on these systems (wave, sharejs, sharedb, etc).

> As a side note, I've heard an interesting theory about why CRDT-type solutions are relatively popular in the cloud. To do OT well, you need to elect a centralized server, which is responsible for all edits to a document. I believe the word for this is "server affinity," and Google implements it very well. They need to, for Jupiter-style OT (Google Docs) to work.

You don't need to do this. (Although I'm not sure if we knew that on the wave team). You can implement an OT system on top of any database that has a transactional write model. The approach is to enter a retry loop where you first try to apply the operation (but in a way that will reject the operation if the expected version numbers don't match). If an error happens, fetch the concurrent edits, transform and retry. Firepad implemented this retry loop from the client, and it worked much better than I expected. Here is a POC of a collaborative editor on top of statecraft - https://home.seph.codes/edit/test . The only OT code on the server is this middleware function:

https://github.com/josephg/statecraft/blob/b6a82f34268238c90... .

In my experience the reason why semi- or fully- centralized systems are popular in products like google docs is that they're easier to implement. Access control in a decentralized system like git is harder. Gossip networks don't perform as well as straight offset-based event logs (kafka and friends). And if you have a canonical incoming stream of edits, its easier to reason about.


> I have a stronger conclusion: any attempt to automate resolving simultaneous editing conflicts that, e.g., git merge could not resolve, will fail in a way that fatally confuses users.

I think you have to act with intent about what you want to happen when two users edit the same text at the same time. There are basically 2 approaches:

1. Resolve to some sort of best-effort outcome. (Eg "DE F G" or "E F GD")

2. Generate an error of some sort (eg via conflict markers) and let the user explicitly resolve the conflict

As much as it pains me to say, for code I think the most correct answer is to use approach (1) when the code is being edited live and (2) when the code is being edited offline / asyncronously. When we can see each other's changes in realtime, humans handle this sort of thing pretty well. We'll back off if someone is actively editing a sentence and we'll see them typing and let them finish their thought. If anything goes wrong we'll just correct it (together) before moving on. The problem happens when we're not online, and we edit the same piece of code independently, "blind" as it were. And in those cases, I think version control systems have the right approach - because the automated merge is often wrong.

(More: https://github.com/xi-editor/xi-editor/issues/1187#issuecomm... )

From these comnents it seems that OT requires a central server while CRDT can have a far more flexible topology. Is this true? And don’t we have robust implementations of CRDT for simple trees?

No, OT can handle decentralization without issues. It’s just usually far more desirable to centralize it.

I think it would be interesting to let the language mode control the rope, or delegate subtrees of the rope to a mode. This way, you could represent things like lexical scope in the tree of the rope, and a language-specific tokenizer could further reduce the complexity of syntax formatting.

Emacs has the concept of "faces", and many Emacs major modes have proper parsers, lexers, and even some static analyzers that they use to apply the faces. If the rope resembled the AST, then many of the issues Raph talks about could be greatly reduced by localizing edits to their area of influence. If you edit inside a token, and somebody else deletes that whole token, then it is pretty clear how to resolve that. You could conceive of natural language modes which produce humanistic hierarchies, or modes with internal formats other than text (which may have a cached text view on them) like spreadsheets or debuggers.


CRDTs cannot be "bolted ontop".


I really don't like this answer, but it is sadly true - even as an expert in the space (my database https://github.com/amark/gun is one of the few CRDT-based systems out there). And there is a simple reason for this:

Distributed systems are composable, they can be used to build higher-level strongly consistent systems on top. (Note: Sacrificing AP along the way, but then you can have a "tunable" system where each record you save you decide what consistency requirement you need, fast or slow.)

However centralized systems are not composable, you can't go "down" the latter of abstraction by adding more stuff.

What you say is certainly true, but also not a fair characterization of what I wrote; we deliberately designed xi-editor from the early days to be consistent with the CRDT model (see the "rope science" series for thinking from the very early days).

But yes, if you have an existing editor or application and you try to just add CRDT, there are a lot of things that can go wrong.

That is what I thought, though when I read your sections titled "Actual collaborative editing is a lot harder" and "CRDT is a tradeoff" it seemed to suggest otherwise particularly with this comment:

"The flip side of the tradeoff is that you have to express your application logic in CRDT-compatible form"

Previously I assumed this was speaking of xi, but it sounds like this was meant generically (not specific to xi)?

I am curious now, it seems like the decision isn't a matter of CRDTs not working out (technically), and more a matter of the amount of effort not being worth it compared to other more synchronous approaches?

Absolutely the right call to make. Though, the last thing we want is people giving up on CRDT research which is how the hackernews title reads "CRDTs not working". So I was just trying to clarify things for future audience.

For sure. If you're doing collaborative editing, then CRDT is a reasonable choice, but it comes with tradeoffs compared with other approaches such as OT, as I hope I've explained. (The comparison with differential synchronization is not as well understood as it should be - I suspect it works pretty well in practice but doesn't lend itself to academic analysis.)

On the other hand, the idea of using CRDT for something other than collaborative editing (or certain types of databases where eventual consistency is a reasonable consistency model) is almost certainly not worth the complexity. That's what I wish I had known when I started.

Ah, yes. Very well said!!

Hey, I'm in Bay Area also, we should meet up. I've seen you've commented here on HN on Rik's makepad WebGL stuff and Joseph Gentle's OT (chatting with him on Monday!), which are both people & projects I'm fascinated by too.

Parsing & syntax highlighting before CRDT might give better results

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact