Hacker News new | past | comments | ask | show | jobs | submit login
Update of the RDF and SPARQL (RDF star) families of specifications (w3.org)
71 points by tannhaeuser on May 19, 2023 | hide | past | favorite | 56 comments

I am extremely glad to see RDF being worked on and improved. It's an extremely powerful, general data model. It gets a bad rap because if its unfortunate association with the XML syntax and the fact that many of the reference implementations were written at the peak of overdesigned OO cruft.

This type of semantically precise yet flexible data model is going to become increasingly important as a bridge between highly structured data in traditional databases, and unstructured information processing using LLMs. GPT does a surprisingly good job of converting between unstructured data and RDF, and my hope is that LLMs can provide some of the key components in building an actual semantic web, which has remained elusive for so long (for many good reasons.)

> ...unfortunate association with the XML syntax

XML isn't great for every use-case, but it's really become the Nickelback of formats. Let's be real, it's pretty brilliant for some things, and I think RDF is a good example of where it shines.

RDF is not dependent on XML as a syntax. There's a text-based syntax in common use (Turtle), as well as a separate one based on JSON (JSON-LD).

RDF primer but using Turtle (older doc):


Audio (etc) plugin format that uses .ttl:




https://drobilla.net/software/sord https://drobilla.net/software/serd

drobilla has mentioned that, if there was a C JSON-LD lib, that might be enough to warrant an LV3

For more: https://github.com/lv2/lv2/wiki

P.S. not many people know that LV2 does modular synth style CV https://linuxmusicians.com/viewtopic.php?t=20701&p=112242

As well as a nigh tabular form in "N-triples" and "N-quads"



which can end up as the easiest formats to work with sometimes.

There are tools like 'rapper' and 'serd' to convert to and from the various formats.



The problem is that they aren’t tabular and the examples they give which make them look simple are incomplete. For example, they rarely show examples that specify the language or data type. A truly tabular format is hextuples. https://github.com/ontola/hextuples

1000% agree.

My dabbling in that direction stopped at five (name-spaced) identifiers.

Effectively extending n-quads with an "edge" identifier. Mainly to eliminate blank-nodes but it handed the optional language & typing just fine and more importantly for my use case allowed for the uri to be expressed as curie (namespace) and a local_id each in their own column.

right, but they have a close association as OP says, and RDF is wrongly dismissed for similar reasons as XML (too complex, too much ceremony, too difficult, not enough value)

As somebody who worked with RDF and SPARQL for several years, none of these are actually true - RDF is very simple to work with (especially if you avoid XML stuff), can be operated on by basic string processing tools, and is conceptually pretty simple once you get into the right mindset. I think it's just suffering from bad documentation being overexposed and good examples under-exposed.

Can you please recommend me an intro tutorial? I’m trying to query DataCommons and it’s really hard to figure out even simple queries.

Wikidata has basic tutorial here https://m.wikidata.org/wiki/Wikidata:SPARQL_tutorial

Which has a lot of wikidata specific stuff but allows to get the basics.

I agree, it just has a bad reputation and there’s few obvious and well known success stories

Turtle solves all the XML cruft problems and is very readable. RDF is completely independent from rendering format, which doesn't have to be XML, I am surprised people still associate it with XML.

This draft seems incomplete / hard to navigate.

Apparently one of the new features of RDF 1.2 is „Quoted Triples“ where triples can be used as Subject / Object. But unfortunately it seems at least the Turtle representation doesn’t support this yet?

Also the „what’s new in SPARQL 1.2“ is rather empty.

Feels like the linked documents are in a very early stage of a draft.

Yes, it's an early draft.

The "quoted triples" technique is otherwise known as "RDF-star" Here's an article that explains the motivations and alternatives pretty comprehensively: https://www.ontotext.com/knowledgehub/fundamentals/what-is-r....

That's one of the reason for having an official 1.2 standard... so all the formats (including Turtle) will incorporate it in a compatible, correct way.

Just tacking it onto an existing standard is not how you get everyone to "incorporate it in a compatible, correct way".

If anything, you'll get exactly the opposite, where laymen that try to shove Turtle 1.2 (which they just know as "Turtle") files, into an incompatible Turtle 1.1 parser which claims to be able to parse "Turtle" (which was a true statement until the release of Turtle 1.2), and things explode and headaches ensue.

Sometimes things work out that way. But a new official version of a W3C spec is not just "tacking it on." If data or software claims to be compliant with some version of the relevant W3C spec, they almost always actually are. And while the W3C specs can be a pain in the ass, it's because they're fairly precise and cover most of the edge cases.

N3 (Notation 3), a superset of Turtle, can represent quoted triples.


This primer is a really good introduction to understand RDF and SPARQL. Both seem extremely powerful.

I am glad to see this as well. I decided to use RDF for my personal project because it was well specified, has many implementations, and a human readable syntax. In the end, it is just data but I wanted to make it as accessible as possible. Does this mean that RDF is always the right choice? No, but it worked for my use case. I wish there were more choices in the open source Triplestore space with good OWL2 support but my project works with what is out there and if someone wants to transform it into something else, that is entirely possible to do.

If you are interested, my project is here: https://github.com/cyocum/irish-gen and a few posts about it are here https://cyocum.github.io/.

My impression is that the trade off when choosing RDF vs a property graph when trying to model graph data is between maximal schema flexibility and the ability to infinitely break apart the data model down to the smallest atomic structures because literally everything is a node that is either an IRI(as unique identifier) or a primitive. Vs the convenience of having more complex nodes and edges with some structure built in where you can collapse some fields down and call them properties to describe individual nodes and edges. In RDF you have to create all of that yourself with triples which can lead to some large structures for relatively common tasks like referencing edges and for reification of statements.

RDF-star, which is part of the new draft, extends RDF with property graph support (with accompanying change in SPARQL as SPARQL-star)

In the nicest possible way, and from a position of ignorance of the "Semantic Web": is anyone actually doing anything with these technologies outside of academia?

RDF and related technologies are heavily used in healthcare and scientific fields, as well as industry.

The "semantic web" or "open linked data" concepts never really took off the way people had hoped, but there's till a ton of utility in the underlying standards so you'll tend to find it wherever you need complex, flexible schemas that with good interoperability between different entities.

Wikidata (a Wikimedia Foundation project) is an RDF dataset with a SPARQL querying service available at https://query.wikidata.org/

Defense industry, part of ARTT (Acquisition Requirements for Training Transformation, ), which is an incredibly-overdue-effort to merge specs. It's also being used to draft MBSE schemas for SysML, SysML has undefined overlap with the many many many other architect tools, and it's going to be the main player for MBSE (maybe . . there's some fighting about that).

These so-called "semantic web" technologies seem to come into their own when there's large scale organizations interfacing without a common reference frame. Like one org that does a spec from a programmer standpoint, and another org does one from a formal linguistics standpoint, then they have to integrate. For example, the USDoD Logistics steering group makes a spec for parts data from their requirements based on MTTF, cost, sparing, shelved space. USN makes a spec for parts data based on burn rate, transport, fuel type. It goes on and on like this, repeat a few dozen times, and you have a dump truck full of specs doing the same thing. See where I'm going here? They're speccing out the same thing from their own ivory towers, and - here's the kicker for those trying to LLM their way out of the situation - none of them are going to show their data to anyone else. The only thing that's exposed is the semantics. ARTT/CredEng is - or was, I am not sure if the program OR CredReg is still healthy - trying to solve this by unifying the semantics.

Ultimately someone's got to come along and give all these people a kick in the pants, one way or the other. You can't just float a boat around the ocean with no missiles, not these days.

It's pretty common in healthcare data, or at least the kind that deals with breadth of patient data. When trying to build knowledge about a disease by looking at a lot of patients, it's rare to get much useful info from a single source. Re-associating that multi-source data lends itself to a graph. If the company has been around for a little while, even if the customer-facing products don't use a graph database, at some point somebody has certainly tried it. (And once somebody has tried it, it lives forever in some part of the organization.)

EU data including all regulations and delegated acts and publications are searchable by sparql


I did struggle to find what I wanted. It's a labyrinth of metadata, and I was looking for the structured regulation text itself. In the end I stuck with good old fashioned XHTML scraping

Short answer is no. Spent many years (>10) listening to people explain how semantic technologies would transform Academic Publishing and make research more useful. Failure in my opinion because - nobody valued it enough to pay to have it done correctly, academic papers have a shelf life, academic papers often contain inaccurate information. Academics publish because it is required not because they have useful info to communicate. In most cases it is the least pleasant part of research.

I have seen many more useful tools come out of LLM in the short time it has been available than the entire 10 years working with academics using RDF, Ontologies etc. RDF is too difficult to use and has inadequate tooling. LLM is only going to get better.

LLMs and Semantic Web work reasonably well in concert https://friend.computer/jekyll/update/2023/04/30/wikidata-ll...

Yes, pretty much all knowledge management in biology is built upon technologies that came from the semantic web community, and biology is certainly not just academia.

There's not much application for knowledge graphs in e.g. a CRUD app of customer names and addresses, but turns out there are an unlimited number of things you can describe about e.g. a protein, and you can't just design one schema because you don't know how it's going to be queried.

See: https://bioregistry.io for countless examples of public datasets used everywhere from academia to "big pharma".

Adobe developed the XMP metadata format that embeds RDF packets inside almost any kind of file. It is heavily used in Adobe products and also others, see


I'm using it in a personal project. I wanted something extensible, and ad hoc, and RDF is certainly that. But it's also typed, which is nice. I can add my own types (I may do that, not sure yet).

I am not well versed in the other RDF technologies. I haven't paid any attention to ontologies, or OWL or any of that stuff. I just use raw RDF, and defined my own vocabularies for everything, including structure. For example, I have my own type property. RDF also has one, but I just made my own. I have my own structure system to mostly bring order to how things are displayed, or created, etc. I am pretty much as far from the semantic web as you can get.

Since everything is "just a triple" it makes it easy to share data. So, it'll be straight forward to import and export artifacts out of my system and share them with others.

And I get SPARQL "for free", so even after new data structures are added, they're still queried like the first class ones the tool already knows about. SPARQL is pretty neat.

At the moment, I have a mostly complete RDF CRUD tool, with some first class interface prototypes (by first class I mean I have forms and UI specifically for those data types, rather than a generic resource form), and really like working with it. My DB has about 3.5M triples in it.

Which database do you use to store your RDF data, which supports SPARQL, and how does it perform?

> I just use raw RDF, and defined my own vocabularies for everything, including structure.

I think this is best approach. The ontology part was more of a hindrance for me way back in 2010's when I was experimenting with semantic web technologies (using dbpedia as a source of my data) and I really hard tried to avoid going of the beaten path (no matter how flaky it seemed) as a junior level developer.

I'm using Apache Jena and Java, and using TDB.

Performance? I have nothing to compare it too. I can't complain. I know whenever I saw info about triple stores in the past, they only seem to crow about was how fast it takes to import things. "Eleventy trillion triples per bleem!" I guess nobody ever actually queries the data, they just store it.

I routinely export the model to a file, and that takes seconds (<10), using the N3 format (RDF/XML takes a very long time). I export the model to make sure my changes are reflecting properly. The resulting file is 176MB. If I read that into a new TDB instance, I can load it in 25s. Since I've been importing from a SQLite master, the resulting TDB data is roughly the same size as the SQLite DB file. My import from SQLite takes longer than 25s, and that's just data shoving from the SQL data to the RDF. I'm sure I'm the bottleneck in that case, I probably commit too much for one thing.

As for queries, well, it either can find it or it can't. It's either trivially indexed (I assume each of the properties of the triple are indexed), or it table scans. Internally when you do a query from Java, you basically set the base net you want to throw (you've only got 3 values to work with) and iterate through to filter it. When I did my "select count" query to count all the triples, that took a beat or two to be sure as it hoovered the entirety of the model and cursored through it. I have not done any crazy SPARQL queries (I can barely spell SPARQL), so I don't know what kind of decisions is makes, but, in the end, there's really only a few ways you can actually query a triple store.

Now, I have a recent Intel iMac I'm running this on, so that may well impact things as well. I have no idea how much memory I'm using, it hasn't been a problem. I've done no tuning whatsoever, I honestly don't know what tuning is available.

I do not foresee my dataset growing much more, so TDB is "fast enough" for my purposes. All told I'm pretty happy with everything.

Is your personal project open source? I would love to see how a real world application using pragmatic RDF and Jena looks like.

FWIW, RDF does enable a lot of research into useful things, so even though those cases are "academic", they aren't w/o practical outcomes as the subtext of the question implies.

RDF is great for annotating protein interactions, for example.

The schema.org markup that goes into websites for SEO and smart snippets in search engines is all RDF, usually as JSON-LD. Millions of people use RDF every day without even knowing it.


It's not nothing, even did a consulting gig years ago. Also, some non-profits such as Wikidata are putting it to good use I guess. But not everything benefits from representation as graph; for example, statistical data. Then there's always the unanswered question who's going to publish data without economical benefit when the money is, at best, in attention/eyeballs, or selling individual queries where backend tech isn't material or even visible. Do we even want to expose more machine-readable information in the age of ChatGPT?

TBL's vision for knowledge graphs is even older than the web. But should it be W3C's job to invent new tech? Does W3C's track record, legal and financial standing invite further standardization work? Their HTML and SVG charters have basically ceased working and W3C's last (final?) HTML recommendation is based on WHATWG HTML Review Draft January, 2020 [1].

[1]: https://sgmljs.net/blog/blog2303.html

The U.S. Open Data catalog [1] has all the metadata and even some data as Linked Data, same with the European Open Data catalog [2]

[1]: https://data.gov/ [2]: https://data.europa.eu/

To extend on this. A lot of this is based on DCAT[1] (which is a RDF vocabulary) and for Europe the extension DCAT-AP[2] which is then further extended by country specific standards.

[1] https://www.w3.org/TR/vocab-dcat-3/

[2] https://joinup.ec.europa.eu/collection/semantic-interoperabi...

[3] e.g. https://www.dcat-ap.de/ or https://docs.dataportal.se/dcat/en/

If you’re pying a botique consultancy fat stacks to fix your megaCo’s absolute hairball data integration, the probability of RDF approaches 1 as price goes to ∞.

Trivial (as in ‘cat graphx graphy’) merging of complex data graphs is just too powerful.

Not only via botique consultancies. Almost every megaCo has some ontology team working for them. Sadly almost all of them also seem to be stuck into a perpetual conceptual phase with very little actual impact on the business.

Yes, having seen this from both ends the more «sciency» guys at reporting/analytics try to bring SOTA practices to bear before core business realizes a need.

If the need is in fact communicating $100bn design specifications between multiple transnational engineering and construction co’s, even the most grounded and pragmatic engineer will gladly inplement RDF and ontologies in the hot path. It’s the best tool for the job on the merits.

Sometimes they need a little nudge. A python or C# library here or there, just to take the edge off, you know?

We were using in production to power our machine learning already half a decade ago, in the region's highest valued startup.

Was also told by some of my colleagues that they were earning a lot consulting in the medical world with semantic data.

Yes, there are a lot of projects using RDF-adjacent technologies and graph databases. But most of them are not very hyped up, so you won't know it unless you know where to look. Amazon has graph database: https://en.wikipedia.org/wiki/Amazon_Neptune which suggests somebody is using it (besides that I know for a fact people use it)

Google's authorisation system Zanzibar (https://research.google/pubs/pub48190/) is not explicitly using SPARQL, but it's a good example of the sort of thing that a graph-based data model can do. I _think_ Zanzibar can be implemented with SPARQL: we implemented something very very close (and more powerful in some ways) with it

Accenture just invested in Stardog, leading Knowledge Graph plaform, which is based on these W3C standards. Can't get much less "academic" than Accenture.


UNESCO does some stuff with it in places. (Ocean info hub, which I did some work on). The EU likes DCAT for data syndication, which is generally serialized as RDF. They’re also promoting linked data, but I don’t really see how it’s truly useful. It mainly seems like a way to turn a bulk download into 100k api requests.

Check out https://solidproject.org (If you want a short intro I recently gave a ~30min talk about it: https://noeldemartin.com/fosdem)

Firefox uses RDF internally. Chase Bank uses it presumably.

I wonder if RDF could actually be used to implement "trigger warnings". List triggering tags in your browser, and then the triggering content can be tagged and blocked by your browser. No censorship involved, just responsible disclosure.

You’re seeking a technical solution to a social problem. There have been a number of attempts at very similar things; they always fail because practically no one is interested in putting in the effort required, if they even know about it.

> no one is interested in putting in the effort required, if they even know about it

There is a sizeable web subculture with a focus on an accessible, sustainable, semantic, open standards etc. web with different flavors.

When you browse sites, read blogs etc. that typically load fast, have a nerd feeling to them and prominently display RSS and some of the more niche open web stuff, then you might be visiting a site of someone who would implement such a feature with a blink of an eye.

It’s the sort of feature that is fairly useless unless it gains very widespread adoption, and also often quite nebulous about how it should be used. My feeling is also that the sorts of people most likely to be willing to implement this kind of thing unprompted are likely to be producing content that can benefit from such stuff less than average.

Also, to repeat, there have been specs for this kind of thing in the past. They haven’t gone anywhere despite significant interest and investment. Granted, some were probably hampered by being unnecessarily complicated, but basically no one used them. (Examples: PICS, in in a couple of slightly different veins but conceptually similar, Schema.org’s contentRating property, and P3P.)

They should maintain the rdf api too. https://www.w3.org/TR/rdf-api/

Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
