Hacker News new | past | comments | ask | show | jobs | submit login
Building a BFT JSON CRDT (jzhao.xyz)
205 points by raykyri on Nov 21, 2022 | hide | past | favorite | 36 comments



It's the Byzantine Fault Tolerant part of this that is particularly innovative and based on Kleppmanns most recent work. I believe it solves the issue of either malicious actors in the networks modifying others transactions, spoofing them, or the messages being modified by third parties ("outside" the network) who have MITM the connection. These are really great problems to solve.

However, when I was experimenting with CRDTs a while back, it seemed to me the other big issue is where multiple transactions from different users combine to create an invalid state. Most CRDT toolkits are aiming for a combination of a JSON like type structure, with the addition on specific structures for rich text. The JSON structures for general purpose data, and those for rich text as that is the most common current use case. These sort of general purpose CRDTs don't have a way to handle, say, the maximum length of an array/string, or minimum and maximum values for a number. They leave this to the application using the toolkit.

For the Yjs + ProseMirror system, effectively the CRDTs are resolved outside of ProseMirror. Thats useful as they can be combined/resolved on the server without a ProseMirror instance running. However there is a strong possibility that the resulting structure is no longer valid for your ProseMirror Schema, this is currently handled by ProseMirror throwing away the invalid parts of the document when it loads.

What I think is needed is a Schema system that is a layer either on top of these toolkits, or as part of them, that provides rules and conflict resolution. So there is a way to specify the maximum length of an array/string, or what the maximum value of a number is. Effectively generic CRDTs that have an understanding of developer supplied rules and bounds.

The "maximum length" is an interesting one, as depending on the order of transactions you could end up with a different result.


Interesting, yeah access-control is kinda open problem with Yjs. Regarding ProseMirror and rich-text documents, you can mess up documents in other ways as well. Eg deploy a faulty command with a transaction that inserts nodes with invalid children (can be prevented though by using createChecked). Or just changing your schema in an incompatible way with the previous version. So you kinda have to deal with possible malformed documents either way.

Havent had documents corrupted by Yjs allowing changes that are not parseable by schema though - has this happened to you?

And about the schema layer on top of Yjs, you possibly could inspect every update and apply some validation rules. Arent all operations just inserts, updates or deletes of nodes? You can at least rollback to previous version as you flush the updates to the doc in the db. Not ideal though.


No, I haven't seen a corrupt "unparceable" document. It's more issues around documents that don't comply with the schema.

Say for example you have a <figure> node that can contain a single optional <caption>. If two users concurrently add a caption, then merge their changes, the Yjs document will contain both. It has no concept of what the valid structure is. When this is loaded into the ProseMirror the second caption will be dropped.


Hmm hadn't thought about that. Yeah Yjs should use more granular steps instead of just replacing the whole doc each time remote update is received.


When I was working on this problem with Fluid Framework, we did a few interesting experiments with "owned objects" and centralized ACL objects. I believe the team primarily implemented centralized ACL because that implementation works for many enterprise use cases.

With a centralized schema provider, you run a connected node on a trusted server and reject changes that are out of schema or should not be accessed by a user.

An owned object is an object where a user (or user group that votes via quorum) that owns the object can veto changes to the object. The changes are temporarily applied until accepted by the owners. I haven't dug deep enough into this BFT implementation to know how our model would map to this model.


How does that compare to Google’s Zanzibar?


I think this is the schema system you're asking for: https://www.hyperhyperspace.org/

(specifically, the data representation part)


This is a cool idea, but I didn't find any examples of max length constraints or other normalization rules in the source code I reviewed. Maybe there's something in there.

Here's some source code for an early, work-in-progress Wiki CRDT: https://github.com/hyperhyperspace/wiki-collab/blob/master/s...

Page in the Wiki. Note that data types have a validate method that returns true or false; maybe if false, they're just dropped from the UI? Not sure how the method is used. https://github.com/hyperhyperspace/wiki-collab/blob/master/s...

I haven't found the underlying text or rich text CRDT implementation yet.


Hey! Author of the post responding here :) Unfortunately max/min-length is a global invariant and not one that can be reinforced by CRDTs without coordination

bloom-lang.net is a really cool project working on trying to figure out what types of program state actually require coordination at compile time


Right - I think what Sam is wondering about is if there’s a better way to maintain some of those invariants when decoding the CRDT data or at least preserve user intent with more fidelity by taking the invariants into account. For example, if a ProseMirror schema says “figure can have one caption child”, then a CRDT library can assist by making Child a LWW register instead of an array; or picking the last item in the array instead of the first when limiting length.


Ah gotcha. This would be a cool area of research! Not sure if I know of many people working on that right now


To me it seems mostly a composition problem; we need a better way of expressing these kinds of application constraints in a formal schema so we can encode/decode application data structures to CRDTs without requiring invasive intermixing of the two concerns in a way that explodes complexity of both the app and the CRDT details.


Okay wow, it's fun and surreal to see code you're working on pop up in the comment section unexpectedly right in front of your eyes :D

I've been helping Santi on this wiki application and we've actually been dreaming about how nice it'd be to eventually have some kind of schema language for defining Hyper Hyper Space types. In the current version of HHS, data types are fully defined via TypeScript classes, and constraints like that are implemented via the validation method you discovered. If that method returns false, the data won't even be stored or synced, let alone make it to the UI. So for now it's an open ended way of defining constraints, making it easier to explore and experiment. Many of the core HHS data types have constraints built into them too, enforced via the validate method. For example, the CausalSet type can be made to only allow certain elements as members.

Anyway there are a few reasons I can see that it'd be amazing to have something more like JSON Schema. One is that it'd just clarify using these data types outside of TypeScript, and ideally they'd be more succinct to read and write than TypeScript classes.

Another reason I'm especially curious about is the possibility of using it along with some kind of query system, which could potentially make a bunch of stuff easier both both in selecting data for applications and in implementing casual validation logic. Validating multiple interacting causal data types requires looking at chains of events in the DAG. Maybe being able to query for these chains could make this nicer?

All that to say, my guess is that the hardest part of moving to a succinct schema definition format would be how to deal with constraint/validation specification.

> I haven't found the underlying text or rich text CRDT implementation yet.

I think something like this in the works! Right now we're just using last write wins. We did experiment with using Y.js for realtime collaboration, but we ran into some complexity in reliably ensuring that the Y.js state could get persisted to the HHS data types without race conditions, so we took a step back from that idea.


I encourage you to follow up on the schema language plus a code generator. It’s a powerful technique that can make your implementation more agile as you develop new strategies.


> I write this blog post mostly as a note to my past self, distilling a lot of what I’ve learned since into a blog post I wish I had read before going in

The best kind of blog post!


For people like me:

BFT - Byzantine Fault Tolerant [0]

CRDT - Conflict-free Replicated Data Type [1]

[0] https://en.wikipedia.org/wiki/Byzantine_fault

[1] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...


RGA/CT - Replicated Growable Array/Causal Tree

I don't like unexplained acronyms/initialisms.


JSON - JavaScript Object Notation [0]

[0] https://en.m.wikipedia.org/wiki/JSON


Building -

1: a usually roofed and walled structure built for permanent use (as for a dwelling)

2: the art or business of assembling materials into a structure

[0] https://www.merriam-webster.com/dictionary/building


A -

1: used when referring to someone or something for the first time in a text or conversation. "a man came out of the room"

2: used to indicate membership of a class of people or things. "he is a lawyer"

[0] https://www.google.com/search?q=define+a


From the article, this appears to be an implementation of Making CRDTs Byzantine Fault Tolerant [0] in Rust [1].

[0]: https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf

[1]: https://github.com/jackyzha0/bft-json-crdt


I'm quite surprised by the [benchmarks versus Automerge JS & Rust](https://github.com/jackyzha0/bft-json-crdt#benchmarks) when it comes to memory:

> Ours (Basic) 27.6MB

> Ours (BFT) 59.5MB

> Automerge (Rust) 232.5MB

I would expect adding the public key tracking to use more memory; I wonder how Automerge is spending so much more memory. Possibly on a bunch of internal caches or memoization that give the order-of-magnitude improvement in speed?

> Ops: 100k

> Ours (Basic) 9.321s

> Ours (BFT) 38.842s

> Automerge (Rust) 0.597s


Hash graph reconciliation is generally the harder part of this problem as noted in future work, but this is a very exciting direction.


Check out this implementation of hash-stable trees. The same dataset generates the same Merkle hash, regardless of insertion orders. This makes verifiable sync computationally efficient. https://github.com/mikeal/prolly-trees


when you see it...them...


we are legion


Let's go Jackie!!!


Yessss finally someone is doing it.


That was a bit creepy while it lasted. (the mouse pointer thing that is)


To somebody who knows Erlang, isn't this reinventing the wheel basically?


Erlang is a great language and runtime to build distributed systems in but it doesn’t provide primitives to resolve concurrent overlapping updates.

Hence projects like https://www.antidotedb.eu (CRDT database in Erlang)


Yes, sorry I should have been less terse: never heard of antidotedb but I figured there is probably a bunch of these projects mature already. I was asking, not stating.


AFAIK Erlang isn't byzantine fault tolerant, doesn't use CRDTs, and doesn't even have a JSON library in its stdlib. So just from the title you can tell it has nothing to do with the article.


Could you elaborate? How do you get CRDT's for free in Erlang?


A little early in the morning for so much alphabet soup.


or to give up on an opportunity to learn/enjoy something




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: