Hacker News new | past | comments | ask | show | jobs | submit login
W3C Invites Implementations of RDF Dataset Canonicalization (w3.org)
62 points by PaulHoule on Nov 2, 2023 | hide | past | favorite | 16 comments



I worked on implementing this last summer in Clojure, is this the same URDNA2015 specification? I got all but three of the tests in the test suite working, but in the end ditched it for a java library that already had it.

This is really useful for signing json-ld, it goes a bit further then regular normalization schemes like JCS by allowing different arrangements of the same data graph to hash to the same thing.


Has JCS gotten any traction? I'll eventually have a need for something like JSON signing and recall seeing the JCS draft and a handful of test implementations, but it seemed like really early days for it.


I haven't seen much else that's competing in the same space, and it seems to be progressing as a standard, albeit slowly.


Correct me if I'm wrong, but AIUI known algorithms for graph canonicalization are NP hard. (Graph isomorphism has been found to be likely lower in complexity, but canonicalization is not the same problem as graph isomorphism). Given blank nodes, the problem of finding a canonical form for RDF is at least as hard as graph canonicalization. So this is a non-starter for general RDF graphs, you'd have to severely restrict the use of blank nodes to make this practical and the way you do this will likely be application-dependent.


You are indeed correct that there are NP-hard problems associated with RDF dataset canonicalization. However, the earlier academic research by Aiden Hogen[1] describes how it's unlikely to be a problem with real world datasets:

  ...in our evaluation we demonstrate that there indeed exist difficult synthetic cases, but we also provide results over 9.9 million RDF graphs that suggest such cases occur infrequently in the real world, and that both canonical forms can be efficiently computed in all but a handful of such cases.
There are experimental results for this near the end of the paper. Aiden Hogen's algorithm isn't identical to the one which the RDF Dataset Canonicalization specification is attempting to standardise, but I believe it would be a fair assumption to say that a majority of RDF datasets would be unproblematic, and a future test could perhaps be performed to verify this assumption. There should be lots of anecdotal and even academic results coming soon as a result of this W3C Candidate Recommendation publication.

In the case of malicious actors, there is a section for 'Dataset Poisoning'[2] in the RDF Dataset Canonicalization specification that warns implementers to watch out for those datasets that have been designed to cause processing to grind to a halt.

I think I would agree with you that application-dependent datasets will be the most viable until some more research is done; indeed the majority of interest seems to come from the W3C Verifiable Credentials space with RDF data inside digital authorization documents[3], for instance as part of the proposals for a privacy-focused revision to the European Union Digital Identity Wallet.

[1]: https://aidanhogan.com/docs/rdf-canonicalisation.pdf

[2]: https://www.w3.org/TR/2023/CR-rdf-canon-20231031/#dataset-po...

[3]: https://w3c.github.io/vc-data-integrity/


"W3C RDF Dataset Canonicalization: A Standard RDF Dataset Canonicalization Algorithm" > "4.3 Blank Node Identifier Issuer State" https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/...

IIRC ld-signatures (-> ld-proofs -> Data Integrity) specified URDNA many years ago?


If you search for “isomorphism” in the doc, they state that most real world cases will have unique IDs for nodes and that make this efficient in practice and that implementations can detect if a case is difficult (likely aborting).


https://www.w3.org/wiki/BnodeSkolemization

rdf-concepts/#section-blank-nodes: https://www.w3.org/TR/rdf-concepts/#section-blank-nodes

Rdflib docs on Skolemization: https://rdflib.readthedocs.io/en/stable/persisting_n3_terms.... :

> Skolemization is a syntactic transformation routinely used in automatic inference systems in which existential variables are replaced by ‘new’ functions - function names not used elsewhere - applied to any enclosing universal variables. In RDF, Skolemization amounts to replacing every blank node in a graph by a ‘new’ name, i.e. a URI reference which is guaranteed to not occur anywhere else. In effect, it gives ‘arbitrary’ names to the anonymous entities whose existence was asserted by the use of blank nodes: the arbitrariness of the names ensures that nothing can be inferred that would not follow from the bare assertion of existence represented by the blank node.

SPARQL has UUID() and BNODE() functions: https://www.w3.org/TR/sparql11-query/#func-bnode :

> The BNODE function constructs a blank node that is distinct from all blank nodes in the dataset being queried and distinct from all blank nodes created by calls to this constructor for other query solutions. If the no argument form is used, every call results in a distinct blank node. If the form with a simple literal is used, every call results in distinct blank nodes for different simple literals, and the same blank node for calls with the same simple literal within expressions for one solution mapping.

Blank node: https://en.wikipedia.org/wiki/Blank_node


> The document is primarily intended for the following audiences:

    Software developers that want to implement an RDF dataset canonicalization algorithm.
    Masochists.

Pick your side.


In case anyone is wondering, its really what it says:

https://www.w3.org/TR/2023/CR-rdf-canon-20231031/#how-to-rea...


I've yet to see a convincing use-case to canonicalize RDF graphs.

The document cites these use-cases:

There are different use cases where graph or dataset canonicalization are important:

    Determining if one serialization is isomorphic to another.
    Digital signing of graphs (datasets) independent of serialization or format.
    Comparing two graphs (datasets) to find differences.
    Communicating change sets when remotely updating an RDF source.
These are not real-world use-cases. Why would one want to sign independent of serialization or format? The real-world need is that people start signing graphs. But why would they sign some abstract format that is independent of serialization format? That supposedly independent format is a format too and will have competition soon. It's the way of the world: fork, fork fork.

I'm signing my RDF graphs as bytearrays with PGP and avoid all the hassle.


I assume that serialization formats might reference this standard so that they don’t need to reinvent the wheel that is graph normalization.

> A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases.

It’s kind of like how there’s a standard for structured copy of JS objects that gets used for things like the web worker spec.

Signing something independent of serialization might be useful since then the exact serialization format can vary. For example, maybe the data is already serialized using SQLite. I’d prefer to avoid loading the data into memory and reserializing it just to check the signature. Instead, it’d be nice to just canonicalize it and then utilize the indexing capabilities of SQLite to minimize memory usage.


So the use-case is a to a very tentative optimization. This tentative optimization is achieved by introducing a very complicated algorithm that is not guaranteed to run in finite time.

You could also check signatures when loading the data and keep the original bytearray separately in slow/cheap storage.

That way you can sign RDF graphs like you sign any bytearray and keep a simple design.


when you sign an RDP graph as a bytearray, how do you cope with the fact that multiple bytearrays serialize the same graph?


I don't. Why would you want to cope with that? When does it matter that multiple bytearrays give same graph?


I used RDF canonicalization in a system that built a computation graph system where the inputs and outputs to a computation were one or multiple RDF graphs.

Many of the computations were doing things like inference that created new blank nodes, and were also doing so in a non-determinstic order, and at the same time many computations created structurally identical outputs (with a low cardinality of triples). By using RDF canonicalization as the basis for content addressing those small graphs, it became quite easy to avoid re-doing a lot of the computations that would have happened due to non-deterministic order. For larger graphs we just used a hash of the native serialization, as re-doing the computation was cheaper than trying to canonicalize.

Adding that canonicalization-based system gave the whole system a significant performance boost, so yeah, there are some scenarios where you "would want to cope with that".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: