Hacker News new | past | comments | ask | show | jobs | submit login
Resilient Sync for Local First (holtwick.de)
185 points by ingve 10 months ago | hide | past | favorite | 58 comments



The ideas described here are very similar to what SSB (https://ssbc.github.io/docs/ssb/faq.html) implemented.

The main problems with having a log is that it grows with every single change (how granular are your changes? with CRDTs, any mutation, no matter how small, is a change). Questions of data retention (is your protocol tolerant to missing log entries?) or data rewriting (if your log is a merkle tree, rewriting something in the past means rewriting all subsequent entries) are also open.

The post also mentions that the log entries could be CRDTs. But if that's so, then you don't need a log at all, since all the information you need to compute the minimal delta to sync between peers is inside the CRDT itself. For a good overview of how this could be done, see (specifically the "Calculating Diffs" section): https://ditto.live/blog/dittos-delta-state-crdts (disclaimer: I work at ditto)


Thanks for the detailed feedback.

The growth of the log is indeed a weak point that could be improved by regularly merging entries. Missing entries are easily recognizable because a consecutive index is used. The checksums on the previous entry should improve data consistency.

The point that CRDTs themselves already contain all the information required for an update is absolutely correct. I have been working on this protocol for some time and one objective was the reproducibility of the individual changes fro accountability reasons. But this may not be necessary for all applications and could possibly be achieved in other ways. Thank you for pointing this out, I will reconsider the concept in this respect!


I would like to refine my answer regarding the rapidly growing log. If we assume that we have a real-time application, then every keystroke or pointer action can indeed create a change entry.

But this storage format is designed for "long term" and "slow" operations. Where "slow" means in the time lapse of a second instead of a milisecond. This allows us to combine multiple changes into a single log entry.

CRDT implementations like Yjs are good at concentrating such changes into smaller chunks of data. For example, writing text in a rich text editor like Prosemirror is then reduced to something like a string and a position.

But the UI can also be lazy and throttle things. A string input field can only fire changes when the field is left or not typed for a second or so.

These steps will significantly reduce the size of the log. They did in my implementations.

But this is not the end of realtime for such applications. These applications could still pass changes directly over P2P, as long as the log remains consistent, so that the resulting document will always eventually consistent.


I was desktop client lead for Syncplicity (major Dropbox competitor) for almost a decade.

Some thoughts:

Most important: Git (the protocol) kinda-sorta does this already. I personally don't know if git is CRDT; but the .git folder is a database that contains the entire history of the repos. The "git" command can be used to sync a repository among different computers. You don't need to use Github or set up a git server. (But everyone does it that way because it's much, much easier.)

Secondly: The proposal made assumes that everyone will rewrite their software to be CRDT-based and fit into this schema. Writing software this way is hard, and then we need to convince the general public to adopt versions of software that are based around this system. Could you port LibreOffice to writing out documents this way?

Thirdly: Resolving conflicts when computers are disconnected from the network is always a mess for end users. (This was a huge pain point for Syncplicity's users, and is a huge pain point in other "sync" products.) CRDTs don't "solve" the problem for users: The best you can hope for is something like git where it identifies merge conflicts, and then the user has to go in and clean them up.

For example: Two computers are disconnected from the network. Edits are made to the same sentence in a document while disconnected. What happens? The CRDT approach might result in "consistent data," but it cannot "read minds" and know what is the correct result.

----

IMO, some things I would consider:

Try to somehow replicate the desired workflows with git or a similar tool. (Remember Mercurial?) See what kind of problems pop up.

Consider writing a "git aware" word processor that is based around Markdown; and is somewhat aware of how git behaves around conflicts.

Try "bolting on" a sync protocol (or git) to LibreOffice as a way to understand just how easy / hard a project like this is.

Consider encapsulating the entire database in a SQLite file, instead of using the filesystem.


Git is obviously not a CRDT because merge conflicts have to be manually resolved whereas CRDTs require this to happen automatically. A less interesting subset of Git can be viewed as CRDT (e.g. mirrors).


Your users will still need to manually merge in a CRDT application; because computers can not read minds.

The best a CRDT application can do is spit out garbage when changes conflict.

IE, start with: The quick brown fox jumped over the fence.

We both disconnect. At the same exact time...

I change it to: The quick brown fox ran around the fence.

You change it to: The quick brown fox dug under the fence.

The best you can do is make a data structure that is consistent and identifies the conflict. CRDT can't "read our minds" and decide between "ran around" or "dug under".


It can't decide human intent flawlessly, but the point of a CRDT is that it does choose one, and all others choose the same one regardless of how they got there.

Git does not do this, so it is not a CRDT. The content-addressable-database portion of git sorta fits this though (as does any other content-addressable system).


This is basically git automatically doing “accept theirs” or “yours” for any fork. You can see that it will not generally be what you want, so whether such a strategy could work is domain-dependent.


No that’s not accurate. If I merge branch X and then branch Y and someone else merges branch Y and then branch X, with CRDT the result should also be the same whereas with git it won’t be if you’re strategy is always “accept theirs” or “accept yours”. CRDT is also order invariant - it doesn’t matter which ordering of edit operations you accept, the end result is consistent across all nodes.

You may want to read up on the Wikipedia page rather than taking 1 thing I said and extrapolating. https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...


I wouldn't argue that CRDT "solves" this problem, either.

The git solution exposes the conflict to the user, who can then fix it. (Or leave it there if they choose to.)

The best a CRDT can do is leave some kind of conflict marker that the user can fix. (Remember, computers can't read minds. See my "quick brown fox" example.)

Git does this. It's predictable and lossless. Deciding if it's a CRDT probably is more of a discussion about semantics than fact, because "git merge" is lossless: It presents a consistent view that the user can accept or change.

And yes, I know what a CRDT is.


But a CRDT would. The end result at node 1 seeing merge X first and then merge Y next would necessarily have to be the same as node 2 seeing merge Y first and then merge X. That’s literally the core property of CRDTs - all nodes eventually converge to the same state regardless of the network partitioning. Git does not have this property and thus is not a CRDT (for edits - it’s a CRDT for mirroring).

Git is not a CRDT not because “git merge is lossless” but because the result is order dependent which is not partition tolerant.

You may want to read the original paper which defines CRDT [1]. Here’s some choice quotes to help you:

> System model: We consider a system of processes interconnected by an asynchronous network. The network can partition and recover

> Clearly, a sufficient condition for convergence of an op-based object is that all its con- current operations commute. An object satisfying this condition is called a Commutative Replicated Data Type (CmRDT).

Git has some CRDT concepts but the core behavior of creating commits and sharing them does not generally meet the criteria of a CRDT. And no. Requiring a manual merge is also not a property of a CRDT as the whole point of it is to generate a “correct” merge result without human intervention. Otherwise the point of the paper would be almost irrelevant.

[1] https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf


You're now arguing semantics.

Let's change course for a bit: If git was a CRDT, what would happen when there is a merge conflict between two branches?


Whatever happens, the end result on two different nodes doing the same merge operations (or a commutative ordering of those merge operation) would be identical.

Think about a CRDT document: if two people edit the same line, regardless of what happens, once the documents synchronize, the final state of the document will be identical. That’s also the reason manually resolved merges don’t work because two different people might resolve the same conflict in different ways. But again, the conflict resolution being identical under any commutative ordering of simultaneous operations is the hardest requirement of CRDTs. The commutation requirement is what kills the “always theirs” or “always mine” strategy (there are other scenarios but that’s the easiest one to demonstrate).


Ahh, now you're missing some critical details: How can a CRDT perform a sane merge? (Remember my quick brown fox example.) IE, is it destructive (picks one) or does it output something like: "The quick brown fox !!!(ran around|||dug under)!!! the fence."

This is kind-of what git does: It leaves a sane conflict in your source code. (The result is always the same given the same inputs, too.) The merge conflict might not build; but how git handles merge conflicts will always result in a functioning git repository.


tbh it's increasingly sounding like you're defining a CRDT as "something is decided and written down in all cases" and simply ignoring every single other quality they guarantee.

Those other qualities matter. So much so that they're literally the defining qualities.


Yeah I'm done trying to help this person understand the differences between Gits and CRDTs. They're being intentionally difficult by redefining CRDTs to "what Git does" rather than evaluating Git against the properties a CRDT is defined to have.


Whether or not the user is manually involved at some point is a product decision. I think the trend of consumer companies is not to do that. Possibly damaging user data is simply a tradeoff in this way of thinking.


No, that’s literally the definition of CRDT. Requirement 2 out of the 3 listed on Wikipedia:

> An algorithm (itself part of the data type) automatically resolves any inconsistencies that might occur.

So no, human resolution vs automatic is not a product decision but a key definitional requirement to be a CRDT.

https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...


Excellent. This approach to CRDTs existed even before the term itself was invented. In the 2010 article[2] on Causal Trees[1], your humble servant calls these per-peer op logs "yarns". In the 2011, observing the proliferation of proposals, Shapiro&friends propose[3] the term "CRDT".

That is essentially a partially-ordered approach to oplogs. Oplogs (binlogs, WALs) underlie the entire universe of databases. The problems are exactly the same as the problems "normal" database developers faced in the far past: how much history do you want to keep? How to synchronize the log and the state reliably? What is the format of the log? How to store/transfer the log reliably?

The last question seems trivial, but it is not. If you read some databases' source code, you may find a lot of paranoid things[4], obviously inspired by humiliating real-life incidents. The other questions are not trivial by any means.

So, yes, this is the way[5].

[1]: Alexei "archagon" Baboulevitch excellent popular summary of Causal Trees, 2018 http://archagon.net/blog/2018/03/24/data-laced-with-history/

[2]: The 2010 paper https://www.researchgate.net/publication/221367739_Deep_hype...

[3]: The CRDT paper https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf

[4]: e.g. https://github.com/facebook/rocksdb/blob/main/db/log_reader....

[5]: the simultaneous post by Nikita Prokopov https://tonsky.me/blog/crdt-filesync/


Thanks for the links, awesome!


...and they existed long before 2010, too. 1975 called, and said to check out the SCCS "Weave" data structure: https://en.wikipedia.org/wiki/Interleaved_deltas

Weaves are the same thing as modern text CRDTs!


Cross posting my comment from the other thread about local first https://news.ycombinator.com/item?id=40786425

My comment may fit a bit better here as this post talks about a protocol instead.

---

https://remotestorage.io/ was a protocol intended for this.

IIRC the visison was that all applications could implement this and you could provide that application with your remotestorage URL, which you could self host.

I looked into this some time ago as I was fed up with WebDAV being the only viable open protocol for file shares/synchronization (especially after hosting my own NextCloud instance, which OOMed because the XML blobs for a large folder it wanted to create as a response used too much memory) and found it through this gist [0] which was a statement about Flock [1] shutting down.

It looks like a cool and not that complex protocol, but all the implementations seem to be unmaintained.

And the official javascript client [2] seems to be ironically be used mostly to access Google Drive or DropBox

Remotestorage also has an internet draft https://datatracker.ietf.org/doc/draft-dejong-remotestorage/ which is relatively easy to understand and not very long.

[0] https://gist.github.com/rhodey/873ae9d527d8d2a38213

[1] https://github.com/signalapp/Flock

[2] https://github.com/remotestorage/remotestorage.js


Thanks for sharing, this is super interesting. Although it doesn't seem to be super active these days. Probably because it is difficult to commercialize local first. That might be why we need a widely adopted and super flexible standard to become attractive to hosters of such services.


The challenge of course is that if your write depends on a previous read, offline/online sync can easily in the wrong sync state once you come back online (in the general case). Even CRDTs are not immune from this - for example your offline document copy says step 1 is delete line 3 so you delete it but then you sync and step 1 was corrected to be “delete line 4” your deletion of line 3 is now incorrect even though the merged result is “valid”. We see this everyday in code development - they’re called merge conflicts.


Under what circumstances would a past entry be corrected? Doesn't CRDT require an immutable stream of operations?


Original document has a list of instructions. The first instruction says “delete line 3”. User 1 offline deletes line 3. User 2 updates the instructions so that step 1 says “delete line 4”. Merged result when user 1 comes back online: step 1 says “delete line 4” and line 3 is deleted with the users none the wiser about what happened unless they double check the merged result. Physically the result is merged consistently but logically the end result of the document is not what was intended.

CRDT just guarantees that the document merges an edit stream in some sensible way without any need for manual conflict resolution. However, such resolution doesn’t (and can’t really) understand logically that the merged result is not what was intended to have happened. This is a contrived example but you can imagine similar scenarios and it’s concerning that people think CRDTs would solve the scenario I presented somehow when they solve a much more constrained technical problem. Imagine you have a CRDT algorithm for managing distributed state. And then you update the code about how the state itself is interpreted or managed. Now you have the problem of differing versions of code running mutating the document containing the state and the document is always in some valid state, but the server are not actually in agreement about how to apply the mutations and the state can diverge in unintended ways.


I get it that you can reach conflicting states, but not with this particular example.

Assuming updates are synced through a centralized server, if user 1 has seen the “delete line 3” instruction, it means that instruction was committed by user 2. At this point they can’t “edit” the first instruction. You can only move forward. Undoing the line 3 deletion is in itself a new instruction adding the line back; then line 4 gets deleted and both clients’ states have converged.


Reread what I wrote as what you wrote seems to be a totally different formulation. User 1 deletes line 3 while User 2 updates the instruction concurrently. The end result is the unintended line is deleted due to a race. The CRDT preserved the structure of the document but not it’s semantic contents.

If it helps, I drew out the document in https://news.ycombinator.com/threads?id=vlovich123#40793833


Ha, I thought by instructions you meant CRDT commands, not actual text instructions.

This is not a failure of CRDT but an inevitable consequence of trying to point to a specific line in a live document. It can happen even without concurrent updates, just two people editing a document in turns; as such it is impossible to resolve regardless of merge strategies, you have to use some kind of section marker that stays in place within the text as it moves (which is basically what every collaborative editor does internally).


This is exactly the kind of scenario CRDTs rectify.


CRDT's largely can't rectify merge conflicts... what CRDT's can do is make your merges convergent. If I auto-merged your Git code using a blunt policy I don't know if you'd reach for "correct" as your first description as opposed to "convergent".

The longer you're offline the more you have to explain to your users the circumstances where they might need to do manual data cleanup or expect creative merging policy. No matter what tech you're using.


This is true if you're working with documents edited by different devices, but many CRDTs do not and can indeed be not only "convergent" but also "correct".

How SSB works is a good example.


Can’t find what SSB is. Can you clarify?

But yes, it’s possible there are applications that can constrain the set of operations such that the merged result is also correct. I was just highlighting that CRDTs are trotted out as a catch all when a) it’s just a way of thinking about the problem rather than an actual algorithm (eg while there’s a lot of CRDT work for document editing, that work has to be done for scratch for any other problem domain) b) you have to modify your problem constrains and solution such that merge results are correct. B is a very very hard problem in a distributed system even if you ignore the challenge of a which itself is quite hard. I wouldn’t trust anything in this space too much without a TLA+ proof that it’s correct (unless it’s something low stakes like document editing).


I think SSB is "Secure Scuttlebutt": http://ssbc.github.io/ssb-db/


Some of our most important DB sync'ing algorithms were only very recently verified. It was okay to run the whole world's systems, including hospitals and armed forces this way. Very very few industry algos receive formal verification.


CRDTs solve merge conflicts by construction, because they necessarily don't permit merge conflicts in the first place.


By definition they don't have conflicts in the technical sense of the word, but they can easily get into bad states after a merge.

It's not technically a conflict since there is a path forward, but the resulting output can be nonsensical.

For example you could employ a strategy where given two concurrent edits, the merge "randomly*" picks one of the two edits for each property/fact, and abandons the other.

That's a construction of a conflict free replicated data type, but it says nothing of the quality of the final result.

Ultimately, the quality comes down to how well you can produce merge strategies for your domain.

* "Randomly" in a deterministic way, e.g. by comparing a hashes of transaction IDs or similar.


> By definition they don't have conflicts in the technical sense of the word...

...which is the sense that we're talking about, when we talk about CRDTs.


When someone goes into detail about correct versus convergent, they are explaining how the technical sense is not what people really want, and you bluntly replying that it's solved (because it's technically solved) is unhelpful at best and misleading at worse.


A CRDT is a purely technical construct! Correctness (in terms of merge conflicts) is well-defined in that context and it's orthogonal to any kind of semantic/application definition!

This is a bonkers conversation. It's like claiming integer division isn't correct because 5/2 should be 2.5 instead of 2.


Alright, try this in Google Doc.

Two users simultaneously select the same word, and replace it by a different word each.

You might end up with: "apple" -> "carrot orange"

Sure it technically merged, and did not lose data. But you still need humans to reconcile the final state anyways. Worse, you might not even notice that you need to cleanup.


Bad example. Google docs doesn’t use CRDTs but uses OT instead. CRDTs may handle your scenario just fine depending on how they decide to handle this scenario.

The easier scenario that is independent of the specific algorithm is the one I described (have follow up comments describing it in more explicit detail) where you have a logical meta instruction within the document and then the document is updated based on outdated meta instructions offline but when they come back online the meta instructions were changed.

Then there’s not even a merge conflict to really worry about.


> Bad example. Google docs doesn’t use CRDTs but uses OT instead. CRDTs may handle your scenario just fine depending on how they decide to handle this scenario.

The CRDT may pick one or the other replacement word, but who is to say that either choice is correct? Perhaps including both words is correct.

> Then there’s not even a merge conflict...

Agree, this is what CRDTs are all about.

> ...to really worry about.

I think it is important to make clear that CRDTs do not "solve" the merging problem, they merely make it possible to solve in a deterministic way across replicas.

Often, CRDTs do not capture higher level schema invariants, and so a "conflict free" CRDT merge can produce an invalid state for a particular application.

There is also the example above, where at the application level, one particular merge outcome may be preferred over another.

So, it isn't as simple as having nothing to worry about. When using CRDTs, often, there are some pretty subtle things that must be worried about. :-)


> The CRDT may pick one or the other replacement word, but who is to say that either choice is correct?

You, as the application developer.

> I think it is important to make clear that CRDTs do not "solve" the merging problem

They literally do, in the context in which they are defined. Which is about data consistency, not semantic correctness.

> Often, CRDTs do not capture higher level schema invariants,

CRDTs never capture higher level schema invariants. Just like TCP doesn't enforce HTTP session authentication. Orthogonal concerns.


Yup 100% agreed.


No, they don't, unfortunately. I feel like the term CRDT leads to misunderstandings like this. CRDTs are more like packaging and abstractions on top of various strategies for building multi-user, local-first UX, they let end users program against recognisable data structures like strings, lists and dictionaries. How they handle conflicts is implementation dependant and isn't anything new, many will fall back to LWW for example.


The specific case of an edit to a line being applied to a different line than what it was intended for because a previous change deleted said line is exactly what CRDTs rectify. I'm not referring to merge conflicts per say, or that the final document will make human sense. But CRDTs will prevent changesets from different participats from being applied in the worst possible way. And there will be no data loss.


You’re misunderstanding what I’m saying.

Original document has a list of instructions. The first instruction says “delete line 3”. User 1 offline deletes line 3. User 2 updates the instructions so that step 1 says “delete line 4”. Merged result when user 1 comes back online: step 1 says “delete line 4” and line 3 is deleted with the users none the wiser about what happened unless they double check the merged result. Physically the result is merged consistently but logically the end result of the document is not what was intended.


That's why relative positions are not used in CRDTs in the absolute sense. Entities are added via IDs and operations are recorded via IDs. If user one deletes line 3 and user 2 deletes line 4, user one will see lines 1, 2 and 5 and onwards. The commands that are captured by CRDTs only operate on the state of the document as the user sees it at the time, their own changes might be overwritten later during a merge, but their changes won't interfere with parts they were not destined to interfere with.


You seem to still not be getting it. Here’s the original online document:

> 1. Please delete line 3.

> 2. Blank line

> 3. This is line 3

> 4. This is line 4.

User 1 is accessing the document offline, sees the instruction on line 1 and changes their local copy to:

> 1. Please delete line 3

> 2. Blank line

> 3. This is line 4

User 2 concurrently changes the document to update the instruction:

> 1. Please delete line 4

> 2. Blank line

> 3. This is line 3

> 4. This is line 4

When user 1 & user 2 merge their documents, they’ll get:

> 1. Please delete line 4

> 2. Blank line

> 3. This is line 4

You can also split this into two documents with a similar effect (one contains instructions, the other contents). The point is that it’s not about how changes are represented. It’s that the logical and semantic structure isn’t represented in the CRDT and thus while some eventually consistent result is attained by all nodes, the logical and semantic structure may end up broken. That’s why you generally don’t see CRDTs for code editors - the merge conflicts still need to be resolved by hand and doing it automatically can result in worse and harder to understand results than normal.


For certain scenarios there will be conflicts, take a boolean value. Client A sets it and client B unsets it. There can only be one winner.

But that might be a benefit from the proposed log sync, because these conflicting situations can be shown and marked for human review in the UI. Each step of change is well documented and the history can fully be reviewed.


Note that in the example I gave there’s not even a conflict to notice. You just have meta instructions telling you what to do within a document. You apply those instructions offline but in the mean time your meta instructions are updated online. You come back and synchronize your changes but there’s nothing that suggests your offline sync against the meta instructions was stale so your edits are applied based on stale meta instructions resulting in a logically undesirable result.

This is basically a TOCTOU race in a distributed context and CRDTs do not magically solve this problem because it’s not solvable. That’s why we have PAXOS and RAFT to do such things and why many many papers have been written trying to solve the Byzantine generals problem in constrained scenarios with well defined criteria.


> And there will be no data loss.

CRDT's can absolutely lose data!


Not when the log is append-only.


Yes, and CRDT's can absolutely lose data. That's the point that should be delivered across when someone claims they can't. Nobody needs to be warned that if you save endless data then you could save all data.


The approach seems similar to Delta Lake's consistency model, using object storage like S3, and yet allows concurrent writers and readers: https://jack-vanlightly.com/analyses/2024/4/29/understanding...


Thanks for feedback. The link is broken for me, but I think you are referring to this page? https://jack-vanlightly.com/analyses/2024/4/29/understanding...

Indeed, the approach is similar. Especially the separation of assets (they call it "Data files") from the log ("Data log") is something I consider being a good choice.


Yes, that's the correct link. Sorry for the broken link.

If the approach works for something as heavy duty as Delta Lake, then it is should work for syncing the data of a end-user app across several devices.

Even SQLite is separating its data and WAL in separate files :)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: