Hacker News new | past | comments | ask | show | jobs | submit login

Hi archagon,

Have you checked out Noms (https://github.com/attic-labs/noms)?

> But studying Differential Sync further, I realized that a number of details made it a non-starter. First, though the approach seems simple on the surface, its true complexity is concealed by the implementation of diff and patch. This class of functions works well for strings, but you basically need to be a seasoned algorithms expert to design a set for a new data type. (Worse: the inherent fuzziness depends on non-objective metrics, so you’d only be able to figure out the effectiveness of your algorithms after prolonged use and testing instead of formal analysis.) Second, diff and patch as they currently exist are really meant for loosely-structured data such as strings and images. Barring conversion to text-based intermediary formats, tightly structured objects would be very difficult to diff and patch while maintaining consistency.

Noms is a structured database in the spirit of Git. It includes efficient (cost proportional to size of diff, not size of input) diff and merge for sets, lists, maps, and user-defined structs. The built-in merge strategies are commutative and idempotent, so if your operations can be merged using it, then your schema is a (state-based) crdt.

We didn't initially build Noms for this type of p2p, masterless architecture, but more and more recently, I find myself thinking it's a great fit.

> Next, there are some issues with using Differential Sync in an offline-first environment. Clients have to store their entire diff history while offline, and then, on reconnection, send the whole batch to their peers for a very expensive merge. Assuming that other sites had been editing away in the meantime, distantly-divergent versions would very likely fail to merge on account of out-of-date context info and lose much of the data for the reconnected peer. Finally, Differential Sync only allows one packet at a time to be in flight between two peers. If there are network issues, the whole thing grinds to a halt.

I may be misunderstanding here, but I think these issues do not afflict Noms:

1/ Noms can compute the diff between any two states efficiently. It doesn't matter how many local changes you make while offline - what is transmitted is just the diff between two states.

2/ Merge performance is relative to the size of the diff being merged, not the number of changes.

3/ I don't think one-packet-in-flight-at-a-time applies to Neil Fraser's differential sync. It definitely doesn't apply to Noms.

Given your interest in CRDTs, I thought you might be interested to take a look.

There's still more work that needs to be done to make Noms work really well for this kind of use case, but I think it's one that makes a lot of sense.

As somebody who's just spent a ton of time looking at all the options, I'd love to hear your thoughts on how Noms fits into your thinking.




1-3 is generally true for ORDTs (in archagon's terms).

I wonder, how much metadata noms needs to achieve those features?

CRDTs brute-force all those distributed/concurrency issues by adding some metadata. With CRDTs, two hard topics are (1) metadata size and (2) gc.

So, how can we estimate the overhead?

P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago.

What is your perspective on that?


Hi gritzko!

I hadn't seen RON It looks really interesting - I'll definitely take some time to play with it sometime soon. It's so great how much work is happening in this space right now.

> 1-3 is generally true for ORDTs (in archagon's terms).

That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)

> P.S. I think, I will elaborate. So, imagine a real-time collaborative editor. Assuming every edit (1 char) is also a git-like commit, you have to mention its hash and its parent hashes. noms trims hashes to 20 bytes, that's reasonable, so we have at least 40 bytes of metadata, incompressible by definition. That means 1:40 data-to-metadata ratio or 1:200 if we count compression in. That is more or less the reason we did not use hashes in one our project some time ago. > What is your perspective on that?

We don't actually transmit the hash of a chunk if we transmit the chunk itself since that's duplicate information. So the overhead would be half your calculation.

My perspective is that this frequently an entirely acceptable tradeoff. Your calculation is the absolute worst case, and there are many applications for which that worst case occurs infrequently. Today's collaborative text editors, for example, usually damp the liveness on purpose for ux reasons. It's cool to see someone type live, until you are that person, and realize that the other side is watching you misspell in realtime :).

The minimum size of an update in Noms is currently one "chunk". Chunks are set to average 4kb in the code currently, though that is tunable (the cost is that cost of access and write locally will go up).

If you wanted to reduce the size of updates in Noms further, you could do that transmitting patches to know chunks rather than new chunks.

It looks like the identifiers in RON are 16 bytes, so I think if you continue to push this direction in Noms you end up with wire formats that cost about the same. It just hasn't been something we've worked on yet.

In exchange for this tradeoff, what you get in Noms is:

- No need to ever accumulate a history of changes in order to know the current state. Each snapshot directly and efficiently represents the current state. It looks like in RON you can transmit a state, but it's not clear to me when nodes would choose to do that vs a patch.

- Local storage properties comparable to a traditional database. Store large amounts of data with efficient queries, sorted scans, mutations, etc.

I'm not sure how this all compares to RON, I'm looking forward to digging into it.


> I hadn't seen RON It looks really interesting

It is like the Lamport event model married to a LSMT database. And their children are CRDTs.

> That makes sense to me except for calculating diffs efficiently. Are you saying that can be done in time proportional to the _diff_? (as opposed to the number of changes?)

Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.

Thanks, now I understand your trade-offs better.


> Slightly different abstractions here, but yes, in most cases a "diff" is a tail of an op log. On request, it gets compacted and sent. I guess, in noms you leverage immutable data structures for that.

What happens in the case where the op log represents changes that went back and forth a bunch between two states. Does that ever get compacted down locally?

The difference here is that Noms will never have to read or compact those intermediate states. If a character gets inserted and deleted a bunch of times, and then that set of changes get synced with a peer, the source peer diffs the beginning and end state and ignores the intermediate ones. The local work done will just be proportional to the actual change between the two states (in this case finding a diff of a single char). This work is always proportional to the change in the two states, not the amount of ops that transpired between them.

> I guess, in noms you leverage immutable data structures for that.

That, and content-addressing.


To compute diffs, you have to store the entire history. Is there any point in time where you prune the history?


The history is stored such that shared data is mostly shared between snapshots. So it generally grows much more slowly than the logical size.

But yeah, you have to store history for as far back as the amount of time you want to be able to be remain disconnected and then resynchronize with peers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: