Hacker News new | past | comments | ask | show | jobs | submit login
Faster CRDTs: An Adventure in Optimization (josephg.com)
686 points by xnx 52 days ago | hide | past | favorite | 151 comments

About a decade ago, I implemented the Causal Tree CRDT (aka RGA, Timestamped Insertion Tree) in regular expressions using a Unicode string as a storage. Later we made a collaborative editor for Yandex based on that code. It used many of the tricks as described in the text, even the optimization where you remember the last insertion point. So I am terribly glad it all gets rediscovered.

The code is on GitHub [1] There is also an article which might be a tough reading, so no link. Recently I greatly improved CTs by introducing the Chronofold data structure [2]. Regarding that benchmark article, I spent many years in academia, so the quality of content problem is familiar to me. In other words, I don't take such articles seriously. CRDTs are fast enough, that is not a concern.

[1]: https://github.com/gritzko/citrea-model

[2]: https://arxiv.org/abs/2002.09511

I remember seeing that (regex CTs) and immediately thinking "wtf, why would anyone want to do that". Took me quite a while to understand that it's actually a pretty clever way to write fast state machines in browserland. So thank you for this work!

And let’s face it, anything is better than writing JavaScript.

That is nice!

A couple of questions: Do you have released a CT implementation in top of Chronofold? Have you any plans to benchmark it against other algs?

Thanks. No, I didn't release it, although there are implementations on GitHub. Rust and Kotlin, at least. The RON proof-of-concept wiki runs on the (unreleased) C++ implementation [1]. I benchmarked it at some point, everything was like "k nanoseconds" [2].

[1]: http://doc.replicated.cc/%5EWiki/ron.sm

[2]: https://github.com/dmonad/crdt-benchmarks/issues/3#issue-599...


I googled only 1 impl in Rust: https://github.com/dkellner/chronofold which seems to produce invalid results on some inputs. Actually the hard part (integration) is made of hacks there...

That PoC Wiki sounds really interesting and the whole replicated.cc project! Any plans on releasing it?

Here is Kotlin https://github.com/decentralized-hse/collab-edit

libron will be released, yes.

May I ask which inputs produced invalid results for you and which parts you consider hacky? I'd very much like to improve the implementation, so a reply here or an issue on e.g. GitHub would be highly appreciated. Thanks!

Wow, thanks for reply!

Some time ago (~7-8months) I tried to implement a chronofold-based app myself in not so fancy language (c#). The data structure itself is very nicely described in the paper, but for merging strategy there are only references to other papers... So I tried to find it implemented in other langs, most of repos were bare bones, yours was the most complete, so I took it as a reference. I lifted the merging strategy verbatim from it, and it passed all the tests in repo plus some additional tests. But when faced a real user input strange things happened.

I will try to find exact cases and open an issue, but not today, unfortunately :(

Edit: AFAIR, I had problems with this func: https://github.com/dkellner/chronofold/blob/16773193b2d21f81... and the problem manifested itself for concurrent deletes. Plus it has reversed order for consecutive deletes regarding to original paper. I have not succeeded with fixing it and just scratched the whole thing and used a simpler (yet not as performant) merging strategy...

Don't worry, your details already help a lot! I will look into it later this week.

The published version indeed has one bug regarding inserts referencing a deleted element. Maybe that is (part of) what you saw as well. I've fixed that locally already, but unfortunately only after I've made signifant other changes which are still to be refactored & published ;-). Probably time to just backport the fix and release it.

Well, that was a productive lunch break ;-). I've just published version 0.2.1, fixing the bug mentioned above.

On a meta-level, does anyone else think that the whole idea of writing a peer reviewed paper that is just a benchmark of different algorithms should be really rigorously reviewed before being accepted? Writing good benchmarks is hard, and so highly contextual that writing fair comparisons beteen algorithms (or data structures) is almost impossible unless you're an expert in all of the algorithms involved.

Yeah, I've also seen several academic papers on performance or "optimization" of existing algorithms which just demonstrate a complete lack of knowledge about how those algorithms are implemented in practice.

For example, there was a paper explaining how you could optimize the GJK algorithm by reducing the number of distance checks required, and in turn the number of square-roots... Despite the fact that everyone (including the authors of the original GJK algorithm) knows that you don't actually need to do a square-root to compare distances...

> Despite the fact that everyone (including the authors of the original GJK algorithm) knows that you don't actually need to do a square-root to compare distances..

Academia's purpose is to produce research, typically measured in publications per unit time. Optimizing one paper leads to a global reduction in the size of the literature by pruning opportunities for subsequent research, harming the overall performance of the system.

how is less and more accurate (read: not done wrong) studies pruning opportunities for future research?

from my view, it becomes easier to make research.

less, more condensed literature is easier to cite and read, as it's "unified".

from there, many other researches can easily base the work off the paper for novel research.

less doesn't necessarily mean worse.

Problem is that academics are rarely experts at programming or have knowledge of computer architectures as much as someone in the industry. There are various tricks that are never taught at college, therefore academics have no idea some stuff even exists.

Best example is discrete optimization research (traveling salesman, vehicle routing and its variants, schedule rostering etc.). Stuff you find in the papers there achieves state-of-the-art results very slowly (using integer linear programming or some rarely optimized heuristics) making you believe these instances of a general NP-hard problem can't be solved quickly.

When you start tinkering, you either find that data structures can be added that reduce the complexity significantly or that there are regularities in instances that, when exploited, support massive speedups.

I would say that TSP research is an exception but most of stuff coming out that has a lot citations is way too slow and is never as brilliantly implemented as Lin Kernighan heuristic or other stuff from the age of insanely slow computers.

> Problem is that academics are rarely experts at programming or have knowledge of computer architectures as much as someone in the industry. There are various tricks that are never taught at college, therefore academics have no idea some stuff even exists.

I want to push back on this generalization a bit. The academics that are focused on pushing the mathematical boundaries of discrete optimization are focused, no surprise, on only that. If theoretical improvements is not what you want, don't read those papers, read ones what people more focused on applications. Read literally any databases paper, or stuff like hyperscan, simdjson. I'd argue that a non-trivial amount of these are vastly /ahead/ of what's standard in industry, but industry is slow to adapt new ideas and catch up because of legacy software and existing codebases. Very similar stuff in the programming languages sphere, it took ages for industry to adapt ideas that were popular in academia for a long time (eg: Rust, LINQ). The idea that academia (at least in CS) is an ivory tower, far from the concerns of industry, is not very true as of recent. There's a lot of cross pollination and a revolving door of people going back and forth from one to the other spreading ideas.

100x +1. The whole field of query optimization has been devoted to improvement of efficiency and speed of database systems for decades. [1] Academic results are studied quite closely in industry. Also, "academic" includes people like Mike Stonebraker and Andy Pavlo. They aren't exactly slouches regarding issues of performance.

More generally, major waves of performance innovation in the IT field have been driven by advances in storage, compression, vectorization, virtualization, etc. Academic results have led to entirely new products like Vertica (from C-Store [2]) or VMware (from the Disco system [3]).

[1] https://en.wikipedia.org/wiki/Query_optimization

[2] https://w6113.github.io/files/papers/cstore-vldb05.pdf

[3] http://www.cs.cmu.edu/~15712/papers/bugnion97.pdf

edit: clarity

I must say that when it comes to discrete optimization, the genetic/ant/simulated annealing/etc. stuff is more popular in academia than in industry (at least the industry that doesn't heavily include academics).

Works like Lin-Kernighan heuristic are extremely rare and a bunch of knowledge exists in industry only. Even the mentioned heuristic was for decades being implemented incorrectly until one individual came and demonstrated its superiority (K. Helsgaun).

I mean, most of the code that gets released today, doesn't even handle time windows constraint in the best way possible (optimal updates, constant time feasibility evaluations etc.). I believe all open source vehicle routing libraries do not have any data-structure relevant optimizations for handling time windows, task precedence, dynamic workshift starts etc. All are fairly simple implementation tricks that just got lost under giant amounts of population based heuristics or mathematical linear programming approaches.

> Even the mentioned heuristic was for decades being implemented incorrectly until one individual came and demonstrated its superiority (K. Helsgaun).

Does this mean that the Linkern program in the Concorde TSP suite is also implemented incorrectly?

Concorde is fine. The LK heuristic was published in 1973. After that, until the mid 90s, no one could outperform the original published results with the same heuristic.

Honestly, I found the algorithm description cryptic. It just may be me not "in the know" when it comes to details, with your "bunch of knowledge [that] exists in industry only" being the details the authors didn't care to elaborate on in the algorithm description. Maybe a part of the reason for the failure to replicate is that other people found it cryptic as well and didn't understand crucial details.

BTW, in your opinion, is there any readable text on the LK algorithm? I'm afraid that all I've read so far seems to be suffering from this problem. I have a very specific optimization need that doesn't seem be covered even by LKH-3 (it seems to be that my problem could be treated either as asymmetric TSP with time windows on a small number nodes and a non-metric distance matrix, or alternatively from the other side as VRP with asymmetric distances, unknown number of vehicles, and bounded trip duration for every vehicle, but either of the two needs an additional constraint that all vehicle trips should get close to the number of working hours within a day with the exception of one trip which can be shorter - basically I need an "ATSP with sleeping breaks") so I may need to roll out my own implementation of something and for an outsider, there doesn't seem to be a good text on implementing these things realistically without having some prior knowledge already.


Helsgaun's reports are pretty descriptive when it comes to LK algorithm.

But yes, I do agree that the original writing and the way the algorithm is taught in universities is a little bit cryptic and sometimes a completely different algorithm.

Simple explanation of the algorithm is:

1. 2-opt swap (A .. B -> .. -> C .. D ===> A .. C <- .. <- B .. D) - constant time compute of everything to check feasibility and new cost,

2. recursive 2-opt - now between C .. B with bounds (you can figure out which swaps are inferior and not recurse).

That's all there is to the algorithm. Helsgaun improves it a bunch (with preprocessing, deeper and more efficient k-opt moves, more efficient datastructures) but keeps the same idea.

There's another very flexible method called "ejection chains" started by Fred Glover and is very effective if you couple it with efficient data structures but also, very cryptic. It easily supports time windows, breaks, flexible work hours, flexible number of vehicles etc. Of course, the latest papers in "ejection chains" show no such thing :D

Sometimes you can stumble upon PhDs publishing their code and the tricks do not exist at all.

Just recently there's been a nice categorization of "timing problems" https://onlinelibrary.wiley.com/doi/abs/10.1002/net.21587

and there they try to categorize the time complexity of solving these problems and the tables are just not as up-to date with the industry. Some O(n log n) are O(n) or even better (depending on the local moves you do).

But it's a good start. A lot of these subproblems can be implemented really fast on a CPU and I guess the moment they are explicitly solved in a software library that's when the heuristics research will improve.

Do you have any links for efficient ways to handle time window constraints etc ?


Constant time feasibility check of inserts and linear time update after insert.

Of course, there are tricks that support multiple simultaneous inserts and tricks that minimize the update of the data structure after insert.

But, if you want to do 2-opt (k-opt) with time windows, then you have to work out the details because no one in academia did.

Helsgaun did describe some things in his LKH-3 technical report, although it's quite terse and does not really go into details: http://webhotel4.ruc.dk/~keld/research/LKH-3/LKH-3_REPORT.pd... . You may be better served by just looking at the source code in http://webhotel4.ruc.dk/~keld/research/LKH-3/LKH-3.0.6.tgz if you're interested how some specific constraint is handled.

From what I saw (I worked in an academia-consulting partnership to solve scheduling problems in transports), the OR researchers very frequently work in association with OR consulting shops, and they're well-aware of the difference between academic datasets and industrial datasets. In the conferences I saw, it was not infrequent to see different tracks for industrial and academic papers, both attended by the same crowd.

The point I agree with, though, is that this is not reflected in the papers. Academic papers focus on academic instances because they are more general (and usually harder, as you said) and because optimizations of specific instances of the problem are not that useful from an academic pov.

It's hard to know who works with who and who has experience with what if you're not an insider, though.

> Stuff you find in the papers there achieves state-of-the-art results very slowly (using integer linear programming or some rarely optimized heuristics) making you believe these instances of a general NP-hard problem can't be solved quickly.

Yes, or looking for things like mathematical purity, linear problem statement, etc

In practice: you don't need the best solution and you can get a great solution in a multitude of ways.

You do realize there is a whole area of the research field dedicated for heuristic algorithms? They have a proper academic basis just as much as the “correct” solutions.

Since he mentions the L-K heuristic, he obviously does.

On the one hand, peer review takes long enough already. On the other... I saw an influential paper that published obviously-wrong timing data, essentially saying that time(A) + time(B) < time(B). It seems they were including the Python interpreter startup time (~0.1s) in a profile of a ~0.01s algorithm.

Benchmarking papers are inaccurate when the original algorithms are not open sourced, and the grad student needs to rewrite the algorithm from scratch. They can easily create different implementation details, and wind up with an algorithm that's slower than the original.

I do think that the original algorithm authors should have the opportunity to correct the benchmarking code, or to release their original implementation as open source to be benchmarked.

In some sense, the benchmarking paper with a slower implementation is more "correct," since an engineer evaluating which algorithm to use is just as likely to implement the algorithm in a slower way than the original paper. The incentives are right too: the original paper author should be giving enough details to recreate their work, and the benchmarker is showing that really their published algorithm is slow.

It would be nice if pure benchmark papers were a thing. Most of the time system papers get accepted for some new idea. The evaluation section is often biased towards the new idea. Independent benchmarks could fix this.

Only the holy experts can divulge the realm of gods that is benchmarking and reveal to us the results. How—to do we beseech them o wise one?

By the way, as someone who has published academic papers, if you're ever bothered about a paper or have some comments, don't hesitate to mail the authors. (Their e-mail addresses are always on the paper; especially target the first author because they have normally done the work.) We are happy to hear when someone has read our work and I at least would've liked to have known if someone found a problem with my papers.

> especially target the first author because they have normally done the work

As someone living with a recently promoted? (is that the correct term?) PhD in social sciences, this surprises me. Is that something specific for my country, for social sciences or my wife simply landed in a case full of rotten apples?

In computer science the first author does the work and is usually a PhD student. The last author is usually the professor that pushed and helped develop the idea, provided funding, and probably wrote or was heavily involved in writing the paper’s abstract, intro and conclusion sections — the bulk of “framing” the work.

But there are exceptions. Some profs are less student-oriented or don’t like delegating so much, and remain “individual contributors” deep in their careers. Those tend to publish nonzero number of first- and single-author papers.

Edit: I’ve noticed that in Theory and Algorithms, profs tend to take first author even though the student slaved out the proofs. That field is kind of an outlier in that it’s close pure math, and I think borrows cultural artifacts from math research.

In a lot of math disciplines, the papers follow the Hardy-Littlewood rule, so the author names are ordered alphabetically [1]. Maybe, that's what you've been noticing. In my area (programming languages, which may be sometimes theoretical but it's mostly a mixed bag), I noticed only one group follow that convention. Others follow the "first author is the main contributor, last author is the advisor" convention you described.

[1]: https://en.wikipedia.org/wiki/Academic_authorship#Authorship...

I double-checked and it looks like you're right, that the Theory papers where I thought the proof took first-author are actually alphabetical.

At least in the fields I've published in, first author does the work (or at least most of it, almost always including writing the paper) and last author secured the funding. Sometimes authors are listed in alphabetical order, in which case one or two are usually listed as "corresponding author". The less-senior corresponding author usually did the work, and the more-senior corresponding author is usually either the one who secured the funding, or a long-term member of the research group who is likely to actually be around to answer correspondence in the future (and who also probably helped out enough to be worth corresponding with).

It's definitely more normal in the social sciences for the senior author to be first, so your wife's experience is probably not strange for her field. What people assume about author order really varies a ton between fields.

I think it's pretty field specific. In a lot of CS for example the first author did most of the work and the last author got the funding. The people in the middle may have done varying degrees of work ranging from helping substantially with the implementation, evaluation or writing of the paper to once having read the paper abstract :)

I think the term you're looking for might be "postdoc" (postdoctoral researcher)? I've seen "recent PhD" a few times as well, seems pretty clear.

In my uni in Belgium, this is the case too. First author does most of the work.

Hello HN! Post author here. I’m happy to answer questions & fix typos once morning rolls around here in Australia

Have you seen my Xi CRDT writeup from 2017 before? https://xi-editor.io/docs/crdt-details.html

It's a CRDT in Rust and it uses a lot of similar ideas. Raph and I had a plan for how to make it fast and memory efficient in very similar ways to your implementation. I think the piece I got working during my internship hits most of the memory efficiency goals like using a Rope and segment list representation. However we put off some of the speed optimizations you've done, like using a range tree instead of a Vec of ranges. I think it also uses a different style of algorithm without any parents.

We never finished the optimizing and polished it up, so it's awesome that there's now an optimized text CRDT in Rust people can use!

Oooohhhh no I haven’t read that - thanks for the link! I feel embarrassed to say this but I knew about Xi editor years ago but I totally forgot to go read & learn about your crdt implementation when I was learning about Yjs and automerge and others. I’ll have a read.

And thanks for writing such an in depth article. It’s really valuable going forward. Maybe it’s addressed in your write up but are there any plans for that code, or has everyone moved on? I’d love to have a zoom chat about it and hear about your experiences at some point if you’d be willing.

Out of curiosity, what do you use to make those diagrams?

https://www.figma.com/ and putting a lot of effort into them

Hi josephg, I'm a CRDT researcher. This is great to see so much work around CRDT nowadays!

Some optimizations whom you discuss are already proposed by some papers and implementations.

For instance, LogootSplit [1] proposes an implementation based on an AVL tree with extra metadatas to get a range tree. LogootSplit proposes also a block-wise approach that stores strings instead of individual characters. Xray [2], an experimental editor built by Github and written in Rust, uses a copy-on-write B-tree. Teletype [3] uses a splay tree to speedup local insertions/deletions based on the observation that a user performs several edits on the same region.

[1] https://members.loria.fr/CIgnat/files/pdf/AndreCollabCom13.p... [2] https://github.com/atom-archive/xray [3] https://github.com/atom/teletype

Cool! It'd be interesting to see those CRDT implementations added to Kevin Jahns' CRDT Benchmarks page[1]. The LogootSplit paper looks interesting. It looks like xray is abandoned, and I'm not sure about teletype. Though teletype's CRDT looks to be entirely implemented in javascript[2]? If the authors are around I'd love to see some benchmarks so we can compare approaches and learn what actually works well.

And I'm not surprised these techniques have been invented before. Realising a tree is an appropriate data structure here is a pretty obvious step if you have a mind for data structures.

To name it, I often find myself feeling defensive when people read my work and respond with a bunch of links to academic papers. Its probably totally unfair and a complete projection from my side, but I hear a voice in my head reword your comment to instead say something awful like: "Cool, but everything you did was done before. Even if they didn't make any of their work practical, usable or good they still published first and you obviously didn't do a good enough literature review if you didn't know that." And I feel an unfair defensiveness arise in me as a result that wants to find excuses to dismiss the work, even if the work might be otherwise interesting.

Its hard to compare their benchmark results because they used synthetic randomized editing traces, which always have different performance profiles than real edits for this stuff. Their own university gathered some great real world data in an earlier study. It would have been much more instructive if that data set was used here. At a glance their RAM usage looks to be about 2 orders of magnitude worse than diamond-types or yjs. And their CPU usage... ?? I can't tell because they have no tables of results. Just some hard to read charts with log scales, so you can't even really eyeball the figures. So its really hard to tell if their work ends up performance-competitive without spending a couple days getting their enterprise style java code running with a better data set. Do you think thats worth doing?

[1] https://github.com/dmonad/crdt-benchmarks

[2] https://github.com/atom/teletype-crdt

Yes, xray was abandoned and teletype is written in JS.

I understand your point and as a researcher and engineer I know your feeling. I took some cautions by using "Some optimizations". I value engineering as much as research and I'm bothered when I heard any side telling the other side that their work is worthless. Your work and the work of Kevin Jahns are very valuable and could improve the way that researchers and engineers do benchmarks.

This is still hard for me to determine when position-based list CRDT (Logoot, LogootSPlit, ...) are better than tombstone-based list CRDT (RGA, RgaSplit, Yata, ...). It could be worth to assess that.

3 year ago I started an update of LogootSplit. The new CRDT is named Dotted LogootSplit [1] and enables delta-synchronizations. The work is not finished: I had other priorities such as writing my thesis... I have to perform some benchmark. However I'm more interested in the hypothetical advantages of Dotted LogootSplit regarding synchronization over unreliable networks. From an engineering point-of-view, I'm using a partially-persistent-capable AVL tree [2]. Eventually I would like to switch to a partially-persistent-capable b-tree. Unfortunately writing a paper is very time consuming, and time is missing.

I still stick with JS/TS because in my viewpoint Wasm is not mature yet. Ideally, I would like to use a language that compiles both to JS and Wasm. Several years ago I welcomed Rust with a lot of enthusiasm. Now I'm doubtful about Rust due to the inherent complexity of the language.

[1] https://github.com/coast-team/dotted-logootsplit/tree/dev [2] https://github.com/Conaclos/cow-list

Cool! What do you think is missing from wasm for maturity? It seems great for something like CRDTs, since the code is reasonably self contained. I hear you about rust - I'm not convinced it'll ever be as popular as java / C# / JS for exactly that reason. But rust doesn't need to be that popular for me to enjoy it, or for the people who use my software to reap the speed & safety benefits.

I'll have to take a read of LogootSplit. I suspect most / all list CRDTs can work with this approach (using a list internally and doing an insertion sort). But I don't know enough about how logoot / logootsplit works to know!

And I really hear you about writing papers taking time. That blog post we're talking about here took nearly a month of time in aggregate to write. I wrote the initial draft in about 2 days, but editing and adding diagrams and everything was exhausting. There's still more work I could have put into it before publishing - I anticipated some of the things people were confused by in this thread. But at the end of the day, published > perfect and I have code to write as well!

> To name it, I often find myself feeling defensive when people read my work and respond with a bunch of links to academic papers. Its probably totally unfair and a complete projection from my side, but I hear a voice in my head reword your comment to instead say something awful like: "Cool, but everything you did was done before. Even if they didn't make any of their work practical, usable or good they still published first and you obviously didn't do a good enough literature review if you didn't know that." And I feel an unfair defensiveness arise in me as a result that wants to find excuses to dismiss the work, even if the work might be otherwise interesting.

I've followed your work for a longtime (since chipmunk-js days), and that is a very honest self assessment

When you write:

> Yjs does one more thing to improve performance. Humans usually type in runs of characters. So when we type "hello" in a document, instead of storing ['h','e','l','l,'o'], Yjs just stores: ['hello']. [...] This is the same information, just stored more compactly.

Isn't this not just the same information when faced with multiple editors? In the first implementation, if I pause to think after typing 'hel', another editor might be able to interject with 'd' to finish the word in another way.

In my view, these data structures are only "the same information" if you provide for a reasonably-sized, fixed quantum of synchronization. The merging makes sense if e.g. you batch changes every one or two seconds. It makes less sense if you would otherwise stream changes to the coordinating agent as they happen, even with latency.

I didn't explain this well but the transformation is lossless. No data is lost from compressing like this. It has no impact on the concurrency protocol or network protocol; it just impacts how the data is stored locally.

If we need to, we could split the run back out again into individual characters without losing information. And that does happen - we do that if something later gets inserted into the middle of the run of characters. Pausing doesn't help. Even the same user could later type something in the middle of a word they'd typed previously.

This limits what we can run-length encode. The code in diamond only run-length encodes items when they are:

- Typed sequentially in time

- And sequentially in insert location (position 10, 11, 12, 13, ...)

- And either all of the characters have been deleted or none of them have been deleted

Notice the other fields in that example in the post. Each subsequent character has:

- An id field = the previous id + 1

- A parent field = the id of the previous item

- The same value for isDeleted

If any of this stuff didn't match, the run would be split up.

Instead of storing those fields individually, we can store the id and parent of the first item, an isDeleted flag and a length field. Thats all the information we actually need to store. With yjs style entries (which is what diamond-types actually implements), the code is here if you can make heads or tails of it. The poorly named "truncate" method splits an item in two. Spans are only extended if can_append() returns true: https://github.com/josephg/diamond-types/blob/c4d24499b70a23...

With real data from Martin's editing trace, 182 000 inserted characters get compacted down to 12 500 list items. Which is still pretty good - thats 14x fewer entries. And it means performance doesn't stutter when large paste events happen.

I have a related question to this, if you’re storing [“hello”] as one chunk, what happens when you perform an edit to say adding an extra [“e”] after the [“e”]? In the unoptimised structure I know you can just add the new [“e”] as a child of the original [“e”]. So here would you then delete the chunk [“hello”] and split it into two halves like [“he”] and [“llo”]?

Yes exactly. You replace the chunk with “he” and “llo” then insert the extra item in the middle. The result is [“he”, “e”, “llo”]. The code to do all this inline in a b-tree is pretty hairy but it works pretty well!

Ah, I see. I had thought that the consolidation gave a batched update with a single ID, so 'h' + 'ell' + 'o' would have IDs of 1, 2, and 3 respectively. That would have made an editing conflict in the middle of 'ell' impossible.

Ah that makes sense! Concurrent changes are one thing, but concurrent changes are rare. The big problem if we did it that way is that it would become impossible to insert in the middle of "ell", because we wouldn't be able to name any of those internal positions.

To get around that we assign a sequential ID for every inserted character, regardless of how they're typed or stored. Typing "hello" would assign IDs 1-5 even if you paste "hello" from the clipboard.

The chunking should depend entirely on the network synchronization. If synchronization is editing-debounced, then chunking could be applied rather safely.

Wait it doesn't look like you used the performance branch of automerge (which is now merged into master). It is significantly faster.


I did use the performance branch. And I had a chat with a few people in the automerge community about the performance numbers I was seeing long before I published to see if I was doing anything wrong. I tested a few different versions of automerge but in this test there wasn’t much performance difference between 0.12, 1.0.x-preview versions (which are built from the merged performance branch) and I tried the unreleased automerge-rs. When I ran my tests timing numbers for automerge ranged from about 5m with the old non performance branch down to about 4m20s or so with automerge-rs. Still far from Yjs’s 0.9 seconds.

I just checked and it looks like automerge 1.0.1-preview-4 has landed. I wrote the post benchmarking preview-2. I’ve been knee deep in diamond types lately and haven’t been watching. Fingers crossed there’s some more performance improvements in the pipeline. I’d love to do a follow up in 6 months showing much improved performance.

Great post! I had no idea that list CRDTs could actually be fast because I read the papers showing how they were impractical. Thanks for investigating and writing this up — please accept my offer of a legitimate academic credential.

It seeems that the issue of reproducibility in computer science where no gigantic/proprietary datasets are needed should not be a problem by simply publishing repository with the code. Are there any forces present that make it so rare in practice?

Credit where its due, the academics did publish the code they wrote on github. But I don't know if anyone - reviewers or readers - actually took the time to read it. Let alone understand why it throws doubt on the paper's conclusions.

Usually, (at least in my specific niche of the computer science field,) if the code is published it's only published after the paper has been reviewed. This is partly to preserve anonymity during the review process, and also because usually the code isn't seen as "part of the paper" (i.e. "the paper should stand on its own"). Although I agree that you could argue that for papers about benchmarks, the code should definitely be considered an essential part of the paper.

Thank you for writing this piece Joseph.

Just want to make sure if something's a possible typo or I'm getting it all wrong :)

Quote: "But how do we figure out which character goes first? We could just sort using their agent IDs or something. But argh, if we do that the document could end up as abcX, even though Mike inserted X before the b. That would be really confusing."

Since the conflict is only between the children of (seph, 0) the only possibilities are, either ending up with "aXbc" or "abXc" right? Or is there a legitimate possibility of ending up with "abcX" ?

I'm assuming we'll apply a common sorting logic only to clashing siblings.

Good question. That part of the article could probably use another diagram to explain it.

The resulting document is generated by doing a depth-first prefix traversal of the tree. The ambiguity comes because "b" and "X" are both direct children of "a". So its not clear how they should be ordered relative to each other. Because "c" is a child of "b" in this example, the "X" can't appear between the "c" and "b". The only valid orderings are, as I said, "aXbc" or "abcX". But without knowing how "b" and "X" should be ordered, its ambiguous which one to use.

Let me know if thats still confusing! This stuff is hard to explain without a whiteboard.

Clear enough Joseph. Thanks :)

I've been following your work for years (and I'm actually neck-deep in a ShareDB project right now) so I just want to say thank you for all of your contributions! I especially enjoyed this post.

I love high level systems languages like C/++ and Rust… but everything you said about JavaScript being slow is the same thing assembly programmers experience when optimizing high level systems languages.

In general, when I see C code and I’m asked to speed it up, I always use “100x” as my target baseline.

Whoa thats a really impressive baseline to reach for when optimizing C code! I'd love to hear some war stories.

As you can probably tell from my article, most of my skill at this stuff is from hard won tricks I've picked up over the years - like reducing heap allocations and packing memory for cache coherency. There's probably lots of things I just haven't learned because I haven't discovered it on my own.

Do you have a blog, or any recommendations for stuff to read by you or others?

I learned low level programming while at Intel, working with one of their performance teams; so, nothing written. In general, though, assembly is more expressive than higher level languages; it lets you “tighten up” the code the computer executes, reducing work & bandwidth.

Specifically, a guy named Mike Abrash taught me some of this stuff, and he’s got some books explaining the theory in terms of practice.

I believe that I understood the code tagged as follow

> (But don't be alarmed if this looks confusing - we could probably fit everyone on the planet who understands this code today into a small meeting room.)

and the follow up reading confirm what I believed about this code

should I be worried about myself ?

Terminology nit: cache coherence refers to CPU cache implementation behaviours at hw level in presence of concurrent access from multiple cores. Data locality or cache friendly data layout could work better here.

Good point - thanks. I'll tidy that up!

That’s an impressive optimisation! Out of curiosity, what do you think are the most interesting or useful possible applications for an optimised CRDT?

When you’re approaching an optimisation like this, do you mind me asking how you think about it and approach it?

The most useful application I see is moving toward building local first software[1].

And as for optimizations, my approach is surprisingly non systematic. I think about the program on the whole, and think through & investigate where all the time is being spent. There's usually ways to restructure things so the hottest code path runs faster. Sometimes that involves changing data structures or languages. Sometimes it just needs early returns for common cases, or inlining string libraries and things like that to avoid allocations.

Sometimes it helps to imagine yourself going through by hand the drudgery the computer is doing. If I was actually doing that boring work by hand, there's almost always ways I'd start taking short cuts. All thats left then is programming those shortcut into the computer.

[1] https://www.inkandswitch.com/local-first.html

Thanks for ShareDB. It’s dope. I extended it to support collaborative voxel editing (https://jel.app) and works great.

Oh that’s cool!! Did you use json-ot for that? I haven’t touched that code in years and it’s delightful people are actively maintaining it and using it to make cool stuff.

I use ot-json for other stuff, but wrote my own ot-vox which deals with voxel grid cells that can be assigned a color.

This was a great read, thank you. I wish there were more explanations of the "black magic" part of Yjs. I'll have to dig into that.

If you’re interested in learning more about Yjs’s internals, I interviewed Kevin for 3 hours on zoom and got him to take me through Yjs’s code. He’s a great guy.

The video of that conversation is here: https://youtu.be/0l5XgnQ6rB4

Thanks a lot. Thats on my video queue.

There's a series videos on YJS and whitepapers etc. Check out the YJS web site and search Youtube for details.

Have you used CRDTs to solve any practical problems?

If so, how does the CRDT solution compare to a non-CRDT solution? If a non-CRDT solution is feasible at all?

Drifting off-topic but I've wondered this myself - I've been interested in CRDTs in a "read the news" way but not a "this rises to something I'm going to implement" way.

Perhaps it's blindingly obvious to all here, so no one mentions it: Thinking about more practical and real-world problems seems like collaboration on on more complex/structured data. Today, it seems one would still have to transform that data to a large flat string underneath, and implement an editor that only performs edits that maintain the integrity of the higher-level object, while the flat string provides collaboration features. Perhaps it's possible to implement an XML schema known to the editor so all insertions/deletions keep the document syntactically correct.

I wonder if there's some other way to either generalize on top of "big string" or create another base model (somehow?)

> Today, it seems one would still have to transform that data to a large flat string underneath, and implement an editor that only performs edits that maintain the integrity of the higher-level object, while the flat string provides collaboration features.

Lots of people think this and have mentioned it over the years, but its a dangerous idea. The way concurrent edits are handled makes it really easy for the automatic merging algorithms to mess up the syntax of your XML / JSON content. And thats really hard for non-programmers to fix.

The right answer for this stuff is to just make CRDTs which support other data types, the same way we did for OT with ot-types[1]. We need CRDT code for lists + text (working now - text is just a list of characters). And rich-text and JSON. And that'd cover 99% of the use cases. I'm tackling strings first in diamond-types because its probably the hardest case; but I want to have native support for other data structures soon too.

Yjs and automerge already do this. They have support for plain text, XML and arbitrary JSON structures.

The simplest implementation for JSON structures + tuples is probably shelf[2], which is so simple you can implement it in about 25 lines of javascript. Shelf doesn't support lists, but combining shelf with the list code I already have in diamond-types is probably going to be good enough for most applications.

[1] https://github.com/ottypes/

[2] Shelf's description + code is here: https://braid.org/algorithms/shelf . Or you can watch the video of Greg (shelf's author) discussing the algorithm here: https://braid.org/meeting-8 . Kevin Jahns (Yjs's author) is in that recording too, and he's super jazzed about how simple and beautiful shelf is.

First of all, thank you for the amazing read! I thoroughly enjoyed the entire article, and it gave me a new perspective on the feasibility of CRDTs for real world applications performance-wise.

Though I am curious now to hear your thoughts on the conflict resolution side of the equation for complex data structures like deeply nested JSON.

The biggest takeaway I got from Martin's talk on the topic from a few years ago was that while CRDTs are theoretically guaranteed to eventually converge, the resulting converged state might not make any sense to applications that need to consume and act on that state [1].

It seemed like a pretty fundamental challenge to using CRDTs to store arbitrary application state to me at the time, but I imagine the field has progressed immensely since then, so would love to hear any insights you might have around strategies that could be used to alleviate or at least work around this challenge if I wanted to build a CRDT-based app today.

[1] https://youtu.be/8_DfwEpHE88?t=2232

I’m not sure how much the field has improved - good chance there’s some new papers I haven’t read. But I think it’s pretty doable. For all the talk of concurrent editing, the reality is that having multiple users edit the same value at the same time in most applications is incredibly rare. It’s rare enough that concurrent editing is just basically broken in most web apps and nobody seems to mind or talk about it. For structured / database data, the best effort merges of current systems (or doing simple last writer wins stuff) is a fine solution in 95% of applications.

But ideally we want something like the semantics of ot-json-1 [1] which supports arbitrary move operations. This is necessary if you wanted to implement a program like workflowy on top of a crdt. Martin thinks this is possible in a crdt by sort of embedding part of an OT system and doing transform, but I don’t feel very satisfied with that answer either.

The other thing I would love to see solved is how you would add git style conflicts into a crdt. The best effort merging strategy of most OT & CRDT systems is fine for real-time editing but it isn’t what you want when merging distant branches.

Automerge today supports arbitrary json data, inserts, deletes and local moves. I think that’s plenty for the data model in 99% of software. I think most software that fits well into a classical database model should be reasonably straightforward to adapt.

I’m not sure if that answers your question but yeah, I’m thinking about this stuff too.

[1] https://github.com/ottypes/json1

Thanks for the great post. Indeed, as a former scientist myself, I can say you have to take everything you read with a grain of salt. I've seen inside the sausage factory, and concluded that YMMV.

> The other thing I would love to see solved is how you would add git style conflicts into a crdt. The best effort merging strategy of most OT & CRDT systems is fine for real-time editing but it isn’t what you want when merging distant branches.

I found this comment very interesting. I have been playing with the idea of 3-way merging CRDTs, similar to the git approach. Have even used this type of branching in commercial software I work on for handling concurrent changes to files.

Be very interested to know if any efforts are being made on this in the CRDT community. (I'm more of an interested onlooker. I use a lot of the same concepts in my software, but not rigorously.)

CRDT is a general concept, editing text is just one possible application. If you have a stronger datatype, great, you can build operations on top of it to implement a CRDT system, depending on its properties.

It would be a little bit strange to build a CRDT system on top of a more traditional data system. CRDTs solve problems at the network layer at enormous cost basically everything else. If you're not using them there, I can't quite understand what they're doing for you?

What I've read talks about character insertion and concepts that apply to editing text.

Perhaps I just need to find bedrock to build up from about the properties of a stronger datatype that allow CRDTs to work.

https://arxiv.org/pdf/1805.06358.pdf looks like a decent starting point, with references to work on other types.

and pointers to other papers, too - thanks!

Article mentions at the beginning that the author used CRDT in Google Wave/ShareJS.

AFAIK Wave and ShareJS both used OT (which the paper that this article referred to was also attempting to benchmark).

FWIW, I am myself also curious about this (the question of comparing CRDT to non-CRDT solutions): I found OT beautiful, but never really felt CRDT had the same feeling of elegance; and so I am downright fascinated to see the person I have always seen as a "god of OT" deciding to forsake it and move to CRDTs going forward. Are they really that great? Is there something simple I am missing for the explanation for why they are so much better? (Is it maybe that they can support arbitrary merging without a sequencer to linearize the edits? Honestly, my question probably sucks a bit as I haven't spent any time thinking about OT or CRDT in at least three years--and even then only once since a few years before that, as I have had other things to spend most of my mental energy on recently--and so I am failing to remember the details of my use cases or the tradeoffs I saw or the implementation issues I felt were easy/hard.)

Do you want a centralized server to control the data? Then just use OT. Do you want users to control the data, and have your server essentially just be a forever-present user? Then use CRDT.

CRDTs certainly do have a mathematical elegance to them.

Yep, this is the best practical advice at the moment. Well, for list CRDTs. State CRDTs (like a counter) are small and fast, and kinda better than OT in every way.

List ("operation based") CRDTs and OT systems are "equivalent" in a very academic sense that nobody really talks about or understands. Its really not obvious unless you've been staring at this stuff for years but the equivalence is there:

You can make a CRDT out of any OT system by just shipping the entire history of operations to each peer. List CRDTs essentially do that, with a whole lot of tricks to compress that data set and use it without needing to linearly scan.

And you can convert the other way too. You can add a "rename" operation into a list CRDT which assigns a new name to each element currently in the document. Before the rename operation document "hello" might have IDs [a4, b2, b3, b1, a5]. The rename operation changes the IDs to [c1, c2, c3, c4, c5]. When an operation happens you specify the version and the ID at that version of the predecessor (eg c2). The insert happens there. Then you need a method to take the ID at one version and "transform" it to the ID of the same item at a different version. Do the rename operation implicitly after every change, and viola! You now have OT semantics. "Insert after c1" means "Insert after position 1".

OT systems have one big advantage which is that you don't have to ship the CRDT state to every peer. With a rename operation, we can add back the operational simplicity of OT systems into a CRDT. But the code is (and always will be) much more complicated. So I think OT makes sense for strictly server-client systems.

You can also have a hybrid server, which talks CRDT to full peers on the network but just does OT when talking to browser clients and things like that. We talked about this at our public braid meeting at the start of the week. The discussion about this stuff starts about 30 minutes in: https://braid.org/meeting-15

> OT systems have one big advantage which is that you don't have to ship the CRDT state to every peer... You can also have a hybrid server, which talks CRDT to full peers on the network but just does OT when talking to browser clients and things like that.

Could you clarify what you mean? Assuming your CRDT is defined in terms of "operations" that contain (at minimum) an identifier+sequence tuple, zero or more references to other operations, and a value (as they are in this article) then there's no reason why you couldn't just ship a batch of individual operations to other clients when something changes rather than the whole state, since each operation is defined in absolute terms.

In other words, if you start with [A4="a", B2="b", B3="c", B1="d", A5="e"] at site A, and it gets turned into [A4="a", B2="b", B4="f", B3="c", B1="d", A5="e"] following a change from B, you can ship something like B4="f"->B2 to C as long as C's CRDT has synced up to version vector A5|B3. (And if it hasn't synced up yet, and you're not using a transport with causal delivery guarantees, the change could be cached at C until its dependencies have arrived.)

I don't think there's any need to transition to an OT system or to add renames in order to get this delta-shipping benefit: all the data you need is already there, unless I'm missing something. (But maybe you're describing something else?)

Yes, my point was that the peer needs to translate a user’s insert of “insert f at position 3” into “insert f between ID B2 and B3”. To do that, you need the “crdt chum” - you basically need that peer to know the ID of every item in the document. This data compresses well, but it’s still annoying to ship around and complex to manage. OT doesn’t need any of that.

> And you can convert the other way too. You can add a "rename" operation into a list CRDT which assigns a new name to each element currently in the document.

Operations in a CRDT must be commutative for merge/update to be well-defined, so it's not immediately clear how a "rename" operation can be expected to work properly.

Wave used OT rather than CRDTs, as the author discusses in another post: https://josephg.com/blog/crdts-are-the-future/

My mistake, it says OT was used, thank you for correcting me.

When optimizing `findItem`, did you consider storing the original index of each item on itself and using that as a starting point?

Obviously this might move later (maybe it can only increase?), but usually not by much, so I would guess it would make an efficient starting point / be immediately correct 99% of the time?

Looks like you already have 2 good solutions to this though (start from index of recent edits and range tree).

It's a great article - really informative and enjoyable to read. Thanks for making it happen. :)

I've been looking for a practical OT alternative for our online word processor (https://zoho.com/writer). We already use OT for syncing our realtime edits and exploring CRDTs targetting stronger consistency for tackling offline edits (which are typically huge & defragmented, since the edits are not syncing in realtime)

So the baseline is that OT has a better model for holding state in terms of performance/memory, since the edits can be compiled into plain string types. CRDTs in comparison forces us to hold deleted states as well and demands richer information per unit (character/string/etc) - which makes it harder on the CPU/RAM.

Here's the story as I understand:

1. Automerge tackles this by just moving to a better lower-level runtime: Rust.

2. Yjs handles this by using a similar technique i.e relying on V8's hidden classes to handle the performance optimizations and assuming real-world cases to narrow down and optimize datastructures.

But none of these, seem to be a fundamental breakthrough in the efficiency of the algorithm itself. They all at best look like a workaround and this keeps bothering me.

If you've got big offline edits (or you're merging multiple large sets of edits), even existing CRDTs will generally handle that more efficiently than OT will. OT algorithms are usually O(n * m) time complexity when merging n edits from one peer with m edits from another peer. A CRDT like diamond-types is O((n + m) * log(s)) where s is the current size of the document. In practice its super fast.

As for holding deleted states and richer information per unit, its not so bad in absolute terms. 1-2mb of data in memory for a 17 page document is honestly fine. But there's also a few different techniques that exist to solve this in CRDTs:

1. Yjs supports "garbage collection" APIs. Essentially you say "anything deleted earlier than this point is irrelevant now" and the data structures will flatten all runs of items which were deleted earlier. So storage stays proportional to the size of the not-deleted content.

2. Sync9 has an algorithm called "antimatter" which mike still hasn't written up poke poke mike!. Antimatter actively tracks the set of all peers which are on the network. When a version is known to have been witnessed by all peers, all extra information is safely discarded. You can also set it up to assume any peer which has been offline for however long is gone forever.

3. Personally I want a library to have an API method for taking all the old data and just saving it to disk somewhere. The idea would be to reintroduce the same devops simplicity of OT where you can just archive old history when you know it probably won't ever be referenced again. Keep the last week or two hot, and delete or archive history at will. If you combined this with a "rename" operation, you could reduce the "hot" dataset to basically nothing. This would also make the implementation much simpler - because we wouldn't need all these performance tricks to make a CRDT like diamond-types fast if the dataset stayed tiny anyway.

I know that it is hard to comprehend why modern CRDT implementations are fast. But the data confirms that they work great. OT seems to be much simpler, but there are real advantages in using CRDTs. The performance problems have been solved through an efficient representation of the CRDT model.

The gist of the below [1] read is that it is impossible for a human to create a document that Yjs can't handle (even in the worst case scenario). But yes, it handles real-world scenarios particularily well.

The concept of "hidden classes" is super old. It has first been implemented in a fork of smalltalk and then became foundational concept of runtime engines for scripting languages. It is implemented in V8, python, ruby, spidermonkey, ..

Yjs does not assume a "real-world scenario" and it is not optimized for any specific runtime engine. It runs fast in any browser. The benchmarks confirm this. [2]

Yjs is being used in practice by several companies (eg Nimbus Notes with >3 million users) for quite some time now. I'm not aware of any performance problems.

[1]: https://blog.kevinjahns.de/are-crdts-suitable-for-shared-edi... [2]: https://github.com/dmonad/crdt-benchmarks

You can remove tombstones in a cleanup pass if you constrain behavior a bit.

For instance, if there’s just a single server, after 5 minutes of no active connections you could clean out tombstones. After that, if a client connects with some changes they had been holding onto but got DCed, you can reject the write and let the user’s client merge by some other means (perhaps even manual).

This is great! I'd like to quote a line here, because I think the answer is “someone on HN knows” and I'd like to hear the answer as well.

> V8 is actually suspiciously fast at this part, so maybe v8 isn't using an array internally to implement Arrays? Who knows!

The V8 blog is a good starting point for learning how it works under the hood: https://v8.dev/blog/fast-properties

Hmm... I would have thought they implemented something similar to a std::deque so that you have ammortized O(1) insertions into the middle of a vector.

> There's another approach to making CRDTs fast, which I haven't mentioned here at all and that is pruning.

Pruning is a key thing I appreciate about Yjs, because it's not just a performance optimization - it's a privacy feature. Users often expect that if they delete something from a document, it's gone unless they have explicitly turned on document revisioning. A CRDT without pruning leaves every accidental paste or poorly phrased remark in the permanent record.

Many thanks for writing this detailed article and for the work you are doing on diamond and Braid.

I recently discarded a serious chunk of time and effort using Logux.io to move to YJS and so far it has been a very good move. This for major evolution of a Knowledge Base/Notes app I'm developing, Clibu Notes.

I was very pleased to read your very positive comments on YJS. Kevin Jahns continues to do great work with YJS.

Great to see this work from a follow Australian.

> my range tree is just a slightly modified b-tree. But usually when people talk about b-trees they mean a BTreeMap. Thats not what I'm doing here. Instead of storing keys, each internal node of the b-tree stores the total number of characters (recursively) in that item's children. So we can look up any item in the document by character position, or insert or delete anywhere in the document in log(n) time.

Cool! This is essentially the same idea I implemented in 2012; I call it the AList or A-List data structure: http://core.loyc.net/collections/alists-part1

Ever since I made it, I've been looking for a good application for it. I guess this is it! I mean, I knew A-List was a good data structure for text editing, but most applications can use a Gap Buffer which is much simpler. But when it comes to concurrent editing, you've got multiple editing points so a Gap Buffer is suddenly much less attractive.

> Honestly I'm shocked and a little suspicious of how little ram Yjs uses in this test.

It's good, but still it uses ~30x as much RAM as plain string edits. Not surprisingly, you got 3x better memory usage by using A-List and a more efficient language (Rust in this case, but C# and C/C++ can also do well.)

There is a great article about a CRDT concept called "Causal Trees[1]. I wonder how it compares to flat-list-based CRDTs (it's been too long since I researched this).

By the way, Microsoft has a new set of libraries for concurrent editing called Fluid Framework[2]. I'm told it's a "generalized data structure" that was inspired by Causal Trees but with a unique and intention-preserving editing scheme. I found out about it after they decided to use my fast-cloning-copy-on-write B+Tree for TypeScript[3]... they sent me a pull request for diffing versions of B+ trees, but I haven't yet looked into the architecture of their concurrent data type.

[1] http://archagon.net/blog/2018/03/24/data-laced-with-history/

[2] https://fluidframework.com/

[3] https://www.npmjs.com/package/sorted-btree

Correct me I'm mistaken

The difference between diamond native and diamond WASM demonstrates how, even with WASM, native implementations beat browsers hard, and native implementations performance-wise are still very worth, specially for lower powered devices, and, perhaps, reducing battery usage (as consequence of less CPU use) in mobile devices.

The wasm implementation here was still running under a JavaScript test harness, so I suspect it's the JS-WASM boundary interactions that are causing the slowdown. WASM itself (if it doesn't need to interact with JavaScript) usually runs with a much smaller performance penalty.

I suspect so too - given there are 280 000 calls across the JS-wasm boundary, and most of those calls pass a string. I'd love to know for sure though. I considered making this benchmark pass the whole data set in one go through JSON or something, but that felt like cheating - since thats not how the API would be used in practice during a real editing session.

But even paying that cost, it still seems much faster to use rust + WASM than run the same algorithm in native javascript. And the JS-wasm boundary will probably get gradually faster over the next few years.

Yes. Ultimately WASM is executing within a sandbox & involves being JIT compiled (read: not heavily optimized except for hot loops eventually). If native compilation is an option it makes sense to go that route

WASM competes with asm.js not asm (or, arguably, jvm etc)

WASM JIT implementations tend to be quite a bit different from JavaScript JIT, so that's not really where the perf difference comes from.

First, WASM gets all the heavy AOT optimizations from the middle end of the compiler producing it. At runtime, WASM JIT doesn't start from program source, but from something that's already been through inlining, constant propagation, common subexpression elimination, loop optimizations, dead code elimination, etc. And WASM is already typed, so the JIT doesn't have to bother with inline caching, collecting type feedback, or supporting deoptimization.

Because of that, the only really beneficial work left to do is from the back end (i.e. arch-specific) part of the compiler- basically, register allocation and instruction selection. WASM JIT compilers don't bother trying to find hot loops or functions before optimizing. Instead, they do a fast "streaming" or "baseline" codegen pass for fast startup, and then eagerly run a smarter tier over the whole module and hot-swap it in as soon as possible. (See e.g. https://hacks.mozilla.org/2018/01/making-webassembly-even-fa...)

The perf difference vs native rather comes from the sandboxing itself- memory access is bounds checked, support for threads and SIMD is limited (for now), talking to the browser has some overhead from crossing the boundary into JavaScript (though this overhead will go down over time as WASM evolves), etc.

This previous HN discussions on OT vs CRDT paper is an excellent overview of the topics [1],[2].

From the paper conclusions "Our discoveries from this work revealed facts and evidences that refute CRDT superiority claims over OT on all accounts, which helps to explain the underlying reasons behind the choices between OT and CRDT in the real world."

The fact that the paper provided the refutation to one of the HN discussion points being made in [1] (i.e. reference to itself), regarding the claimed of CRDT superiority in its footnotes is rather amusing and the first such attempt I have personally seen in a published paper.



Reminds me of the data structures in this markdown library https://github.com/markdown-it/markdown-it/blob/master/docs/...

The author hand waves away the possibility that there could be memory locality performance benefits by using arrays in JS but my hunch is that there is something in that. I know the react code base for example went for a monomorphic Object structure to represent components to leverage inline caching of hidden classes.

This is awesome! I'm researching delta state based CRDTs as a master dissertation, this kinds of optimizations on op-Based are really interesting

Excellent article! As someone who has to work with collaborative editing I must say the complexity of the whole area is at times daunting to say the least. So many edge-cases. So many mines to step on.

Now I think I am convinced that the OT vs CRDT performance comparison is kind of moot point and the question is more about the user experience. Which version produces nicer results when two very diverged documents are merged. Maybe one of these days I'll read an article about that too.

To get off on a tangent a little bit, I'd be interested to know how one could add in tracking of changes to Diamond or other CRDT? Can you add an arbitrary payload to the operation and then just materialize the areas different from the original snapshot? I know Yjs can do this by creating snapshots and then comparing them to another snapshot but it seemed a bit awkward and not suited for real-time editing.

> Which version produces nicer results when two very diverged documents are merged.

From the user’s perspective merging behaviour is basically identical in all of these systems.

Diamond supports full per character change tracking. So you know who authored what. I think Yjs does this too. I’m not sure what you mean about materialising areas differently? I’d like to have full branch support in diamond at some point too, so you can work in a branch, switch branches, merge branches, and all of that.

From the user’s perspective merging behaviour is basically identical in all of these systems.

Ah, ok. Neat.

I’m not sure what you mean about materialising areas differently?

I meant finding the ranges of the document that have been changed relative to snapshot x and showing them based on the user id who changed them. If that can be done in real-time as the changes come in that would be really impressive.

Ah. Yeah that’s doable. Diamond stores the client ID which authored each character in the document and I have a method for finding out which changes exist in one version but not in another. That would be doable

What I like about "tests" in software development is that anyone can run them, just download the source code, then run ./test or right click and "run tests". It would be cool if computer science could offer the same experience, just download the source code and run it, compare if you got the same result, inspect and learn from the source code, etc. Instead of "here's some pseudo-code we've never tried", and here's a mathematical formula that you need to be a mathematics professor to understand... Yes we know you are not a professional software developer, the code is going to be at a beginners level, but that is fine, I am not reading your paper to criticize your code for being "impure", or not using the latest syntax and frameworks, I'm reading it to understand how to implement something, to solve a problem.

I also cannot understand how a paper whose main contribution is a set of benchmarks, does not actually make the source code to those benchmarks publicly available. Unbelievable that such a paper can pass peer review. Very unscientific.

I think this data structure is usually called a Counted B-tree https://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtr... instead of range tree

Xi has/had a rope library that allowed one to apply many monoids at each internal node. So one could search for a position in the document as TFA is doing, but also count bytes, Unicode codepoints, Unicode characters/glyphs/widths, etc. with just one tree.

What's common to xi's approach and TFA's is monoids. Monoids are at the heart of CRDT.

People who are interested in the topic: I just found out they have open meetings about the parent project and seems like anybody could join - https://braid.org/

Great way to share progress. Kudos! :)

Yep! Our next meeting is in two Mondays from now on August 2nd, at 4:00pm Pacific Time. All are welcome: https://braid.org/meeting-16

If it weren’t for academic papers, we probably wouldn’t have the beautiful and nice concept of a CRDT in the first place. We might have 100 different solutions promising us 100 different replication schemes, some with and some without guarantees, hiding behind a gazillion different undefined terms that want to make us believe each solution is better than sliced bread. Also, I don’t thing Google Wave didn’t make it due to its OT algorithm.

I'm getting mixed messages on CRDTs. Are we at the point now where they are general enough that the human observer is not constantly confronted with 'surprises' from the behavior of the system?

Some of the talks by Kleppmann go straight into the weeds and make it hard to tell if he's just nerding out about finer points or lamenting unsolved problems, or even paradoxes.

As a community, we’re in the process of crossing that river right now. A few years ago it was an accomplishment to get a text based CRDT working at all. Now implementations are starting to compete on features and performance, and they’re starting to see some use in real world applications. But there’s still some edge cases to iron out and understand in terms of memory size and pruning and things like that.

In a few years the rough edges will be ironed out and well understood, and there will be a good set of CRDT implementations you could use without worrying about this stuff. I think Yjs might already be there.

Trees are a powerful and practical data structure, but even if it does not appear clearly when doing O(n) style complexity analysis, they are usually slow.

Unfortunately, the difference between slow and fast can be several orders of magnitude, while the perception of the programmer doing a back of the envelope analysis seems to be a logarithmic scaling of the reality...

Trees seemed to work pretty well for me here!

The problem with trees is usually that people don't pack enough data into each node in the tree. I implemented a skip list in C a few years ago[1]. For a lark I benchmarked its performance against the C++ SGI rope class which was shipped in the C++ header directory somewhere. My skip list was 20x faster - which was suspicious. I looked into what the SGI rope does and it turns out it was only putting one character into each leaf node in the tree it constructed. Benchmarking showed the optimal number was ~120 or so characters per leaf. Memcpy is much much faster in practice than main memory lookups.

In diamond-types (benchmarked here), the internal nodes in my B-tree store 16 pointers and leaf nodes store 32 entries. With run-length encoding, all 180,000 inserted characters in this data set end up in a tree with just 88 internal nodes and a depth of 3. It goes fast like this. But if you think an array based solution would work better, I'd love to see it! It would certainly need a lot less code.

[1] https://github.com/josephg/librope

Isn't that also the insight which led to HAMT (and friends) being so fast compared to older persistent datastructures? Turns out for a while now copying more but denser memory has been way cheaper than chasing pointers.

Brian Cantrill also noticed that in his comparison of his C and Rust versions of statemap[0]: Rust had a way, way better cache behaviour than C (96.9 L1 hit rate to 77.9%, and half the L2 misses although the better L1 behaviour also led to >90% less L2 interactions in the first place), and most of that was attributed to using a btreeset instead of an AVL BST.

[0] http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performa...

If you haven't seen https://crdt.tech/ this is a good resource for Papers etc. on CRDT's

Very nice, when I read "double linked list" I immediately thought "what about a btree like structure?" I guess Martins idea to replace the IDs comes from the "vector clock" idea for concurrent updates

If anyone is looking for the combination of a piecetable and b+tree (which appears to be what is talked about in this article), I have one that I've been using for years across various GTK components.


What was the original paper referenced in the article? I couldn’t find a link or title. I remember being in France and being with some researchers working in the CRDT space and am wondering if I might know the authors

This is a fantastic write-up. Worth a read even if you don’t care about CRDTs.

I mean, CRDTs should be really fast in theory. Now it turns out they can be in practice. Most excellent.

Damn! What a fantastic read ! And of course brilliant coding/thinking :) Well done sir !

> If you want database semantics instead of document semantics, as far as I know nobody has done this well on top of CRDTs yet.

This is very interesting to me. CRDTs for databases sounds fantastic. Databases have much much richer data types and transactional semantics than collaborative text editing, and this makes applying CRDTs to databases harder.

Let's start with the basics. CRDTs are all about designing or picking monoids that fit the problem and allow one to get a very good approximation of the semantics one is after, if not even exactly.

What does that mean for databases? Well, for one, every data type in a database will have to have a monoid associated with it -- this is... limiting, but limiting is good if the benefit is that we don't need any more complex mechanisms to get convergence in a distributed database. For example, we can have a table of "likes" where they can only increase additively, so we'll make that BIGINT with the ADDITIVE monoid or whatever. But `BIGINT with the ADDITIVE monoid` is the easy stuff. The hard part is PRIMARY and UNIQUE KEYs.

So what about unique keys? Well, we can have monoids for those too. Like tiebreakers based on timestamps. Especially since we have delineated transactions (BEGIN; ..; COMMIT;) we know that if some INSERT fails eventually then the whole transaction containing it fails (unless that INSERT is of the OR IGNORE / ON CONFLICT DO NOTHING variety).

The real problem with CRDT and databases is that CRDT is incompatible with SQL transactional semantics. Similarly for CRDT and filesystems and POSIX semantics. You can't tell if some transaction will commit successfully until you've heard from all collaborators enough to know that it must have. Instead you can consider every local transaction that succeeds locally as committed, but then later every possible conflict has to be resolvable in some way. This gets tricky real fast. It might be easier to start with "trivial" transactions like POSIX file rename(2). If two applications decide to rename the same file to different names, and both observe success, and eventually only one of those renames succeeds, then it has to appear to the loser that the winner came along and renamed the file after the loser did. This sequence of events:

  loser                        winner
  -----                        ------
  rename("/a/b", "/a/c") = 0
                               rename("/a/b", "/a/d") = 0
would have to look to the loser like:

  loser                        winner
  -----                        ------
  rename("/a/b", "/a/c") = 0
                               rename("/a/c", "/a/d") = 0
                   note the difference --+
But, of course, POSIX has something to say about this, and that is "nope!". POSIX says "no" because if the loser and the winner share notes they'll find that what appears to the loser to have happened is not at all what happened. There are very specific rules about observability of writes, and order of events, that POSIX has that a CRDT distributed filesystem simply must break. The loser's successful rename(2) call should have been visible to the winner, so the winner should have lost the race to rename that file. Still, it's a pretty good compromise for the benefits of CRDT.

Similarly for O_CREAT | O_EXCL: the loser can imagine that the winner unlinked the file created by the loser then created a replacement. Again, we've left the land of POSIX at that point, but again, it can be a pretty good compromise for some applications.

SQL transactions are infinitely more complex than POSIX ones, but I think the analysis of the POSIX case generalizes to SQL ACID transactional semantics: you can't quite have that with CRDT. You might find more of some successful-looking-to-you transaction undone or changed later by a collaborator's in ways that, if you were to look at the actual events, you'd be annoyed violate ACID. And again, for some applications this probably just won't work. Though, too, we might come up with monoids that help us reach acceptable semantics.

For example, in Active Directory each domain controller gets its own pool of RIDs (relative identifiers) for assignment to new users and groups and machine accounts, so there can be no conflicts about those, but there is no sub-namespacing of the names of those things, so the rule AD uses is that the loser's user/group/namespace gets renamed to something like "copy_of_{original_name}". Also, AD requires a strong primary role for the DC that hands out RID ranges for allocation, which again means we're leaving the land of CRDT for some things. So in AD one might find a user/group/machine unexpectedly renamed like that (but it rarely ever happens).

But if the application was using atomic transactions to decide whether to have destructive external side-effects, such as "sell!!!", "launch missiles", etc., then this kind of shifting sands transactional semantics may be... unsatisfying.

CRDT resembles eventual correctness in a way. It's not that local state is ever incorrect, but that a sequence of events is impossible in a system with traditional ACID transactional semantics.

All this said, just note that I've done zero work in this area, just lots of thinking. I've this idea that one could use PG logical replication publications and subscriptions, and monoid choices encoded in COMMENTs, and suitable functions, to implement a CRDT scheme on top of PG to explore this space. Basically, each collaborating server publishes its view of a log of local transactions and subscribes to all the others' (in separate schemas), then periodically it applies the others' logs to the local source of truth. For simple applications that might even suffice instead of a database that natively supports CRDT. I have done work on encoding useful schema metadata in COMMENTs using JSON, so I'm pretty confident that this is something worth exploring.

One thing that is clear is that some of this space has been explored already. For example, again, AD has done so for PRIMARY/UNIQUE KEYs! Indeed, AD even implements a hybrid approach to multi-mastering a distributed database, with an "infrastructure master" for some things, and CRDT for others. AD's metaschema is LDAP's, with some enhancements (especially an ObjectDN syntax for "relations" or "pointers") that make it a lot more like a relational database. And AD is a general purpose database that achieves these things. So it's not like this space is completely new -- there are a few giants' shoulders to stand on.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact