Hacker News new | past | comments | ask | show | jobs | submit login
A Review of the Semantic Web Field (acm.org)
126 points by hypomnemata on Jan 26, 2021 | hide | past | favorite | 70 comments



We built a new semantic database first in university and then commercial open source (TerminusDB). We use the web ontology language (OWL) as a schema language, but made two important - practical - modifications: 1) we dispense with the open world interpretation; and 2) insist on the unique name assumption. This provides us with a rich modelling language which delivers constraints on the shapes in the graph. Additionally, we don't use SPARQL, which we didn't find practical (composability is important to us) and use a Datalog in its place (like Dataomic and others).

Our feeling on interacting with the semantic web community is that innovation - especially when it conflicts with core ideology - is not welcome. We understand that 'open world' is crucial to the idea of a complete 'semantic web', but it is insanely impractical for data practitioners (we want to know what is in our DB!). Semantic web folk can treat alternative approaches as heresy and that is not a good basis for growth.

As we came from university, I agree with comments that the field is too academic and bends to the strange incentives of paper publishing. Lots of big ideas and everything else is mere 'implementation detail' - when, in truth, the innovation is in the implementation details.

There are great ideas in the semantic web, and they should be more widespread. Data engineers, data scientists, and everybody else can benefit, but we must extract the good and remove ideological barriers to participation.


You're right to emancipate from the grab that SemWeb has had on the field for so long and turn to Prolog/Datalog and practical approaches IMO. Open world semantics and sophisticated theories may have been a vision for the semantic web of heterogenous data, but in reality RDF and co are only used in certain closed-world niches IME.

Pascal Hitzler is one of the more prolific authors (especially with the EU-funded identification of description logic fragments of OWL2 which are some of the better results in the field IMO), but beginning this whole discussion with W3C's RDF is wrong IMO when description logic as more or less variable-free fragments of first-order logic with desirable complexities was a thing in 1991 or earlier already.

Nit: careful with datomic. It's clearly not Datalog, but an ad-hoc syntax whereas Datalog is a proper syntactic subset of Prolog. And while I don't like SPARQL, it still gives quite good compat for querying large graph databases.


NitNit: I think the term "Datalog" the prolog subset, has been pretty much replaced with "Datalog" the recursive consjunctive query fragment with recursion (and sometimes stratified negation) term.

Most papers and textbooks I read these days use it as a complexity class for queries and not as a concrete syntax.


This is the sense in which I was using Datalog - and how others like Datomic, Grakn and Crux use it (there is a growing movement of databases with a 'Datalog' query language) - althou in our case, we can also use in the former sense as TerminusDB is implemented in prolog.


I met Pascal Hitzler on a few occasions, not long after he ended up relocating from Germany to Ohio to work at Wright State. He was a rare bright spot in always trying to bridge the gap between theory and application in the SemWeb community. He was kind enough to meet up with me and a colleague at a coffee shop in Dayton to discuss our project, all pro-bono. A real good dude.


I was "stuck" working with a bunch of leading academics and researchers on a SemWeb project using OWL/RDF, in collaboration with DARPA and the US Department of Defense, around 2008-2009.

You are absolutely correct that they are hostile to anything outside of their "ideology".

The awful, horrific performance of the RDF/OWL databases compared to the impure, heretical evil Neo4j that they despised for its practicality..... that was always funny.

Another interesting thing I encountered in the field was the side-effect of an academic field that sounds really good to people who have never built anything EVER is that it can often get a ton of funding and grant money from central government organizations, thereby creating legions of rather shitty companies (especially in the EU, where the grants were everywhere) that have the word "semantic" in the name even though they do nothing with actual semantic technology. These shitty companies are often just there to employ the academics.

The craziest thing was these projects where they would employ a dozen or so "library scientists" who were all just masters and phd students who had, for some reason, decided to study to be librarians in the digital era to create the OWL ontologies. None of them knew anything about computer science or programming, and they would just sit there and read thousands of policy documents and use an Eclipse based GUI to create and edit giant graphs of knowledge and rules. All were being paid six figures, and didn't produce a single goddammed thing of value. So much taxpayer money in those rooms going to complete waste. Glad it wasn't just me that thought the community was a joke.

The semantic web will arrive one day, but OWL and SPARQL won't be anywhere in it. And it won't be any of these academics delivering it.


Are you referring to the BFO crowd? I would be interested in your thoughts about BFO itself if thats the case.


If by BFO you are referring to "Basic Formal Ontologies", then yes, I'm referring to that "crowd".

My thoughts on ontologies in general are that they can certainly be powerful, and I've seen them used in the past in rules engines that powered fraud detection applications.

In the SemWeb community, in the late 2000s, they successfully convinced a bunch of CIOs of massive organizations, especially in the US Federal Government, that a key to being able to centralize and federate all of their data, and save money on duplicative systems, they could simply have semantic mappings on top of every IT system's databases, and query this semantic layer. Ideally, they could eliminate duplicated data, so that all systems would get data from the "Authoritative Data Source" system instead of duplicating it locally in the application's database.

I'm sure you can immediately see why this is wildly stupid and unrealistic. Imagine what it would look like if every single piece of data that I can technically obtain from another source has to remain in that source, and that storing that data locally with my application specific data is forbidden..... Suddenly, there is a massive increase in I/O, drop in performance, etc.

The whole project taught me a lesson about the politics of academia, and how there is a segment of the population that is highly educated, and has learned how to manufacture work for themselves outside of academia by pushing for high-level government officials to implement programs based on their theories..... MITRE was a big part of this particular project.


I completed my PhD in the scope of Semantic Web technologies and I can share the same experience that the semantic web community is extremely closed (coming across as feeling "elite"). Having myself no supervisor from the field, it was still possible to publish my ideas (ISWC, WWW etc), but it was impossible to connect to the people and be taken seriously.

I moved on from that field now, and I don't expect to come in touch with any Semantic Web stuff in a open-world context any time soon.

I couldn't agree more with you that the strong ideology that drives this community is one of the main reason that these technologies are not widely adopted. This, and the failure to convince people outside academia that solving the problems it tries to solve is necessary in the first place.

Good luck with TerminusDB, I think I listened to you at KGC.


Almost every pragmatic implementation of semantic reasoning I've done involved both of the same modifications (closed world and unique names). A couple efforts used SPARQLX, something I created that was a binary form of SPARQL+SPARQLUpdate+StoredProcedures+Macros encoded using Variable Message Format. This was about 18 years ago, before SPARQL and SPARQL update merged, and before FLWOR. One of these days I'll recreate it again. The original work is not available, and I was not allowed to publish.

Oh, and I forgot two things, SPARQLX had triggers, was customized for OWL DLP, and had commands for custom import and export using N3 (I was a big fan of the cwm software).


> [...] but we must extract the good and remove ideological barriers to participation.

Could you point to some resources that explain the tradeoff between the practical solutions and concepts and the ideologic cruft for an outsider?


Not the commenter, but I hope to add something to the discussion. Generally, expanding on the current state of the art is paramount in academia. In this case, I guess that defaulting on closed-world and unique names is frowned upon because academic people "know" that SemWeb concepts would be "easy" to implement under such conditions (for some interpretation of "know" and "easy"). A university lab would be reluctant to invest on such a project, because it would likely result in less publications than, say, a bleeding-edge POC.

Of course, practical solutions based on well-understood assumptions are exactly what a commercial operation needs, so it's no wonder that TerminusDB chose that path. They might not publish a ton of papers, but they have something that works and could be used in production.


Very interesting. Thanks!


> "dispense with the open world interpretation“

That can mean anything from "we have some conventional (e.g. plain old RDBMS) CWA systems but describe their schemas in an OWA DL to ease integration across independent systems" (in particular this means no CWA implications outside those built into the conventional systems with or without a semweb layer on top) to "we do a big bucket of RDF and run it all through a set of rules formulated in OWL syntax but applied in an entirely different way" (CWA everywhere). The former would be semweb as intended, or at least a subset thereof, but the latter could easily end up somewhere between simple brand abuse and almost comical cargo culting.

Well, at least that's how I feel as someone who never had to face the realities of the vast unmapped territories between plain old database applications and fascinating yet entirely impractical academic mind games of DL (old school symbolic AI ivory tower that suddenly happened to find itself in the center of the hottest w3c spec right before w3c specs kind of stopped being a thing, with WHATWG usurping html and Crockford almost accidentally killing XML)

(also, when has "assumption" turned into "interpretation"? Guess I missed a lot)


"Our feeling on interacting with the semantic web community is that innovation - especially when it conflicts with core ideology - is not welcome."

I wasn't a big fan of the "semantic web" community when it first came out, and the years have only deepened my disrespect, if not outright contempt. The entire argument was "Semantic web will do this and that and the other thing!"

"OK, how exactly will it accomplish this?"

"It would be really cool if it did! Think about what it would enable!"

"OK, fine, but how will this actually work!"

"Graph structures! RDF!"

"Yes, that's a data format. What about the algorithms? How are you going to solve the core problem, which is that nobody can agree on what ontology to apply to data at global scale, and there isn't even a hint of how to solve this problem?"

"So many questions. You must be a bad developer! It would be so cool if this worked, so it'll work!"

There has always been this vacuousness in the claims, where they've got a somewhat clear idea of where they want to go, but if you ever try to poke down even one layer deeper into how it's going to be solved, you get either A: insulted B1: claims that it's already solved just go use this solution (even though it is clearly not already solved since the semantic web promises are still promises and not manifested reality) B2: claims it's already solved and the semantic web is already huge (even though the only examples some using this can cite are trivial compared to the grand promises and the "semantic web" components borderline irrelevant, most frequently citing "those google boxes that pop up for sites in search results" just like this article does despite the fact they're wafer-thin compared to the Semantic Web promises and barely use any "Semantic Web" tech at all) or C: a simple reiteration of the top-level promises, almost as if the person making this response simply doesn't fundamentally grasp that the ideals need to manifest in real code and real data to work.

This article does nothing to dispel my beliefs about it. The second sentence says it all. For the rest, while just zooming in to the reality may be momentarily impressive, compared to the promises made it is nothing.

The whole thing was structured backwards anyhow. I'd analogize the "semantic web" effort to creating a programming language syntax definition, but failing to create the compiler, the runtime, the standard library, or the community. Sure, it's non-trivial forward progress, but it wasn't really the hard part. The real problem for semantic web and their community is the shared ontology; solve that and the rest would mostly fall into place. The problem is... that's an unsolvable problem. Unsurprisingly, a community and tech all centered around an unsolvable problem haven't been that productive.

A fun exercise (which I seriously recommend if you think this is solvable, let alone easy) is to just consider how to label a work with its author. Or its primary author and secondary authors... or the author, and the subsequent author of the second edition... or, what exactly is an authored work anyhow? And how exactly do we identify an author... consider two people with identical names/titles, for instance. If we have a "primary author" field, do we always have to declare a primary author? If it's optional, how often can you expect a non-expert bulk adding author information in to get it correct? (How would such a person necessarily even know how to pick the "primary author" out of four alphabetically-ordered citations on a paper?)

(I am aware of the fact there are various official solutions to these problems in various domains... the fact that there are various solutions is exactly my point. Even this simple issue is not agreed upon, context-dependent, it's AI-complete to translate between the various schema, and if you speak to an expert using any of them you could get an earful about their deficiencies.)


Yes. I had pretty much this conversation a while back with some non-technically minded people who had been convinced that by creating an ontology and set of "semantic business rules" - a lot of the writing of actual code could be automated away, leaving the business team to just create rules in a language almost like English and have the machine execute those English-like rules.

I had to explain that they were basically on track to re-implementing COBOL.


It's not what you have to do, or how, it's that for the first time we have a common model for data interchange (RDF) with which you can model concepts and things in your domain, or more-importantly across domains, and simply merge the datasets. Try that with the relational model or JSON. Integration is the main value proposal of RDF today, nobody sane is trying to build a single global ontology of the world .

You can despise the fringe academic research, but how do you explain Knowledge Graph use by FAANG (including powering Alexa and Siri) as well as a number of Fortune 500 companies? Here are the companies looking for SPARQL (RDF query language) developers: http://sparql.club


Many of us who have been in these battles over the decades have decided that the interchange format is almost irrelevant to the real challenge which is the modeling and semantic alignment. It's a useless parlour trick to merge graphs and call them integrated, approximately as it is to put several CSV files into an archive or simply loading unrelated tables into one RDBMS. Yes, you can run a processing engine on the amalgamation, but all the work remains to do in establishing a federating model within your query or processing instructions.

Over and over, we see that the real world problem is gated on human effort to negotiate about the models and to do data cleaning and transformation. And, the best results almost always require that modeling, cleaning, and transformation be done with an eye towards a specific downstream consumer or analysis. We get tired of having to steer leadership back to reality after they buy into the snake oil suggestion that integration costs can be avoided and unknown applications solved.

The claim that RDF solves federation more than any other serialization format for structured data and models is about the same as fixating on JSON versus XML or YAML or Lisp s-expressions.


Relational model, XML/JSON etc. simply do not have a generic merge operation defined the same way as RDF does. This can be proved with pen and paper.

And you still haven't addressed my second point about widespread industry use. It seems that SemWeb haters/sceptics always try to avoid this, why could that be?..


"simply do not have a generic merge operation defined the same way as RDF does."

Who cares? This is not a problem anyone has, which is precisely why so few formats have a solution.

"widespread industry use"

It's not in "widespread" use. It's in niche use, and it's been in niche use for about two decades, and shows no sign of escaping that niche.

Human perception is a bit broken here. You show a list of 100 users and it looks like a tech is in "widespread use"... because you don't intuit that the market has hundreds of thousands of users, if not millions. (I'm being conservative. It's almost certainly millions.) RDF is niche. You can comfortably read an effectively-complete list of users over a coffee break. Try that trick with JSON.

Also, to be honest, referring to "haters" rather proves my point about just how quickly insults get trotted out. You almost literally just said "RDF!" with no further substantive conversation exactly the way I mentioned! I know about RDF. I used it ~2005 when working on some Mozilla stuff. It had every opportunity to overtake JSON, and was never in any danger of it.

In fact my current job for the last few weeks has been working on a massively cross-team data lake in the company I work for... and nobody is talking about RDF. Not me (and I do know it, actually), not any vendor that might provide useful technology, not any vendor that consumes data to provide reports on it (nobody consumes RDF in this space), nobody. Nominally a core use case for "semanticness", and it's a complete non-starter.


Yes RDF is in its own niche -- data interchange. And that's where merge matters, when you for example need to merge protein data with genes and drugs etc. A bunch of pharma companies are using RDF Knowledge Graphs for that purpose. The need for data interchange comes with a certain company size, and that point RDF becomes the solution because there are no real alternatives.

I'm not talking about replacing JSON with RDF. Don't need data interchange -- don't use RDF. RDF is both at a different level of abstraction and solving problems of different scope.


> merge protein data with genes and drugs

Could you perhaps recommend some industry case studies or publications on that specific problem area of biopharmaceuticals?


This is one recent meta-study: https://www.nature.com/articles/s41597-021-00797-y

One of the main datasources is uniprot.org.

I know for a fact that AstraZeneca, Novo Nordisk, Novartis, Roche, Boehringer Ingelheim are all using RDF Knowledge Graphs, and there are probably many others. It would take some time to find the references though.

Check out our company page, maybe we can help ;) https://atomgraph.com/


As always should look at metacrap (http://www.well.com/~doctorow/metacrap.htm) when discussing the semantic web

- Certain kinds of implicit metadata is awfully useful, in fact. Google exploits metadata about the structure of the World Wide Web: by examining the number of links pointing at a page (and the number of links pointing at each linker), Google can derive statistics about the number of Web-authors who believe that that page is important enough to link to, and hence make extremely reliable guesses about how reputable the information on that page is.

This sort of observational metadata is far more reliable than the stuff that human beings create for the purposes of having their documents found. It cuts through the marketing bullshit, the self-delusion, and the vocabulary collisions.

in short, engineering triumphs over data entry.


I found that Job Postings are an exception. Google picks up on them, has a special API to submit them direct (due to slow crawling) and close them.

So long as you're a good actor that will get you far. If your data is low quality, wrong, error prone or otherwise you'll not get shown and will likely receive manual actions and end up in the Google proverbial sin bin.

I have found that incentives align for job postings.

That obviously doesn't prove that metadata is not flawed, just that there are areas where it seems to work well.


The whole field has been dominated by research, i.e. the wish to make simple things complicated (in order to publish papers) as opposed to engineering, i.e. making complicated things simple (in order to produce usable software efficiently). As a result the standards are horrendously - and needlessly - complicated. The few major practical outcomes like the schema.org, json-ld and the google annotation system, are results of engineering, not research. Alas, json-ld has also taken a turn towards hypercomplexities.


Yeah, this is an unfortunate consequence of having the whole ecosystem mostly within academia, including the lack of tutorials and proper documentation (e.g. not a 500 page standard).

IMO the most interesting place right now for semantic web development is Wikidata. It's still pretty difficult for newcomers to contribute (as is the case for all Wikimedia projects) but at least it has many eyeballs and a very active community / ecosystem.


+1 for WIKIDATA

There are lots of useful WIKIDATA links and demos on this page: https://www.wikidata.org/wiki/User:Daniel_Mietchen/FSCI_2017...


Maybe a good indicator that there is only minor (industry) need/benefit. The "biggest" Knowledge Graph is Google, but it is unclear, how much there is actually Semantic Web and how much search, ML, NLP etc..

They are all nice ideas, but the practical usecases are rare. I am skeptical of the often touted usecase in Medicine/Drug Interactions. The only time i saw it in the industry, it was not really used by the lab technicians. Because all questions the system could answer, were trivial. The promise of "the system can inference new combinations/interactions" was never fulfilled.


> The "biggest" Knowledge Graph is Google, but it is unclear, how much there is actually Semantic Web and how much search, ML, NLP etc..

The second biggest is possibly WikiData, and it is not that small.

As to the practical use cases, there are many, but it is the premier way of encoding metadata for search engines: https://schema.org/docs/about.html

And the amount of datasets and ontologies that exist is quite vast:

- https://lod-cloud.net/dataset

- http://obofoundry.org/

- https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_M...

I would like to understand what other options you would consider better for these datasets, for the metadata and for the ontologies?

I mean if not RDF for web metadata then what? If not semantic web for UK govt data (https://ukparliament.github.io/ontologies/, https://opendatacommunities.org/data_home, https://ckan.publishing.service.gov.uk/dataset?res_format=RD..., https://ckan.publishing.service.gov.uk/dataset?res_format=SP...) then what?

It would be nice to have something even better, but I much prefer RDF to a bunch of CSV files.


Explain this Knowledge Graph usage by Fortune 500 companies then: http://sparql.club/


I agree, the research is overly complicated.

So it's a lot of extra work to sift through, but I've found a lot of gold in there.

If you're looking for a simple, noise-free way to do the semantic web, I'm very confident that Tree Notation will enable it (https://treenotation.org/).

I've played around a bit with turning Schema.org into a Tree Language, and think that would be a fruitful exercise, but plenty more on the plate first.

FWIW I've pitched this concept to W3C for 4 or 5 years to no avail yet. I think though if someone can put together a decent prototype the idea might start clicking.

Imagine a noise free way to encode the semantic web with natural 3-d positional semantics. Could be cool!


It is unclear to me what it would achieve compared to a spog (subject, predicate, object, graph) based representation like it exists in RDF based triplestores.


Yes you are right. Semantic triplets are great. I think the semantics are largely the same. Here's my work in progress argument for why this is relevant.

My take with ontologies is building consensus is hard.

Tree Notation offers a solution to the problem of: what should we agree on for the encoding? I assume that simpler is better, all else being equal. Then Tree Notation is the simplest, in terms of the thing with the fewest pieces(tokens).

To get to Tree Notation, nothing was added, only stripped. I started with an existing notation and stripped away each visible syntax token that wasn't needed. Surprisingly, not one is needed. Not one quote, parens, bracket, colon, etc.

So now if we can get consensus around going with the simplest thing, we have got a way to agree on an whether we should use XML, JSON-LD, turtle, etc. The simplest thing works (which would be Tree Notation, or a close relative—someone can rebrand the notation but the idea is largely the same). This does not suffer from the 927 problem, as there are a few classes of things where we do have 1 new language that is mathematically superior and of a different kind than others (binary notation, for example).

So after you have agreement on that encoding, versioning and forking and merging schemas is dead simple (just use Git—in Tree Notation all changes are semantic and noise free).

So now we've solved what encoding to use for our ontologies, and we have a very fast and efficient way to collaborate on them (it's just plain text and git).

That brings us to a third advantage which is more theoretical. Tree Notation maps words/nodes to a 3-D representation. This means that there would be an X-Y-Z isomorphism with an ontology and the real world. I don't really know where we go from there, but at least by this point we've moved the semantic web idea a lot further and can start looking at the next realm of possibilities.


Honestly disconcerting to see mostly negative responses in this thread: awful community, overly complicated, research focused, academic nitwits gone wild, etc. Pretty sure there's some truth here, but would suggest the deeper argument is against semantic web as evolution of the world-wide-web. Agree this isn't likely to happen in my lifetime.

Right up there with, mostly hated, Javascript, I happen to think there are good parts of the semantic web technologies and that the pivot towards industry adoption of the graph data models related to knowledge graphs, ontologies, and SPARQL shows there are benefits outside of academic paper mills. I don't have a dog in this fight (TerminusDB), but applying some reasonable expectations and accepting the limitations of the semantic web tools has been very successful on many projects. Even more so, innovation and improvements in graph data repositories are making triple-stores and graph-based models compelling for some use cases. Not going back to CSV hell if there are better alternatives.


Wow, what a great summary with lots of realism and nuances. I agree with the author's conclusions that what is missing is consolidation and interoperability between standards (e.g. make Protégé easier to use and ensure libraries for RDF parsing and serializations exist for all languages). No technology will be adopted if it requires PhD-level ability to handle jargon and complexity... but if there were tutorials and HOWTOs, we could see big progress.

Personally, I'm not a big fan of the "fancy" layers of the Semantic Web Stack like OWL (see https://en.wikipedia.org/wiki/Semantic_Web_Stack ), but the basic layers of RDF + SPARQL as a means for structured exchange of data seem like a solid foundation to build upon.

It's really simple in the end: we've got databases and identifiers. INTERNALLY to any company or organization, you can setup a DB of your choosing and ensure data follows a given schema, with data linked through internal identifiers. When you want to publish data EXTERNALLY, you need to have "external identifiers" for each resource, and URIs are a logical choice for this (this is also a core idea of REST APIs of hyperlinked resources). Similarly, communicating data using the a generic schema capable of expressing arbitrary entities and relations like RDF and JSON-LD is also a logical next step, rather than each API using it's own bespoke data schema...

As for making web data machine-readable, the key there is KISS: efforts like schema.org with opt-in, progressive enhancements annotations are very promising.

For anyone wanting to know more about this domain, there is an online course here: https://www.youtube.com/playlist?list=PLoOmvuyo5UAeihlKcWpzV... The whole course is pretty deep (would take a month to go through it all), but you can skip ahead to lectures of specific interest.


I'm only a hobbyist in this area, but I wonder why the review wouldn't mention some of the graph databases as, at least, semantic web adjacent. Their relative success seems to lend credence to the overall vision of the semantic web and its supporting technologies. For example, are there really more than surface syntactical differences between SPARQL and Cypher?

Even though it was over-hyped, I like the semantic web because it supports a conception for the future that includes something other than neural network black-boxes. However, whether the ideas deliver remains to be seen.

If anyone is looking for an introduction, then I think the Linked Data book from Manning is worth mentioning--it might be a little dated at this point. The author provides a coherent introduction and helps, especially, in cutting through the confusing proliferation of acronyms that characterizes this field. As others have mentioned, reliable software is a major stumbling block. It's especially unfortunate that there isn't better browser support, of RDFa for example.


Check our SPARQL-driven Knowledge Graph management system :) https://atomgraph.github.io/LinkedDataHub/


My 10,000 ft layperson's view, to which I invite corrections, is broadly:

- The semantic web set off with extraordinarily ambitious goals, which were largely impractical

- The entire field was trumped by Deep Learning, which takes as its premise that you can infer relationships from the exabytes of human rambling on the internet, rather than having to laboriously encode them explicitly

- Deep Learning is not after all a panacea, but more like a very clever parlour trick; put otherwise, intelligence is more than linear algebra, and "real" intelligences aren't completely fooled by one pixel changing colour in an image, etc.

- Hence, we have come back round to point 1 again

?


>The entire field was trumped by Deep Learning, which takes as its premise that you can infer relationships from the exabytes of human rambling on the internet, rather than having to laboriously encode them explicitly

I don't think machine learning can ever replace data modeling, because data modeling is often creative and/or normative. If we want to express what data must look like and which relationships there should be, then machine learning doesn't help and we have no other choice than to laboriously encode or designs. And as long as we model data we will have a need for data exchange formats.

You could categorise data exchange formats as follows:

a) Ad-hoc formats with ill defined syntax and ill defined semantics. That would be something like the CSV family of formats or the many ad-hoc mini formats you find in database text fields.

b) Well defined syntax with externally defined often informal semantics. XML and JSON are examples of that.

c) Well defined syntax with some well defined formal semantics. That's where I see Semantic Web standards such as RDF (in its various notations), RDFS and OWL.

So if the task is to reliably merge, cleanse and interpret data from different sources then we can achieve that with less code on the basis of (c) type data exchange formats.

But it seems we're stuck with (b). I understand some of the reasons. The Semantic Web standards are rather complex and at the same time not powerful enough to express all the things we need. But that is a different issue than what you are talking about.



I think you are spot on.

I think what we'll see is Deep Learning/Human Editor "Teams".

DL will do the bulk of the relationship encoding, but human domain experts will do "code reviews" on the commits made by DL agents.

Over time fewer and fewer commits will need to be reviewed, because each one trains the agent a bit more.


This seems to me to be an insightful and comprehensive overview of the Semantic Web, both current status and how we got here. People like me, who have long been wanting to better understand the (obviously sprawling) concepts involved will be able to use the article as a good entry point.

That said, the expressed hope of consolidation in the field is likely still some way off. AI has taken over a lot of the promise that the Semantic Web originally held. But AFAICS there are two drivers (also mentioned in the article) that potentially could provide the required impetus for a reignited interest in the Semantic Web:

Firstly the need for explainable AI, and secondly the probable(?) coming breakthrough in natural language processing and automatic knowledge graph or ontologies from text.

All in all, it seems way too early to write off the Semantic Web field at this point.


Maybe for its time it seemed like a good idea.. Like SOAP or manual features for image classification. Today, it's clear that languages and knowledge don't really work like that, and it's not practical to approach them this way. I've learned about the OWL and SPARQL 12 years ago, and it already felt like a very dated idea. But then who knows... Everybody have given up on NNs once too.


> Today, it's clear that languages and knowledge don't really work like that, and it's not practical to approach them this way.

There are many applications of Semantic Web that has little to do with natural languages. If you have a better option for all the existing RDF data sets (https://lod-cloud.net/, https://www.wikidata.org/) and ontologies (http://www.ontobee.org/, https://schema.org/) it would be good to be explicit about it.

I would prefer to have more data (e.g. data from US federal reserve data, world bank data) as RDF and accessible via SPARQL endpoints than less, because it is much more useful as RDF than as CSV, in my opinion.


The comparisons to NLP presents a good view on the problems.

Its "easy" to write some logic rules to parse input text for a 50% demo. But then you want to improve & scale, and suddenly all the nuances, bites you. The rules get bigger, nested and complicated. Traditional NLP tried that avenue for a while, with decent success in small usecases, but for larger problems without success. (Compared to stuff like BERT & GPT, which still have a lot of problems)

Similar with Knowledge Graphs, you can show some nice properties on inferring knowledge on small problems, but the real world is much more approximate and unclear than some (binary) relationships.

Personally i think we Humans lack the mental capacity to build large models with complex interactions.


That's true as individuals I do not believe we can. Only as a group and with the help of tools, which is what semweb tried to achieve. We found out the tools weren't the most practical and learned a lot. Now we need the Tensorflow of those approaches, something easy to use not platform centric and with a low barrier of entry.


OWL tools are dated. One large lib OWLAPI that is full of bugs and impractical, available only for one platform (JVM). Reasoners that work or not, and get abandonned once the grant that funded them is finished... Reasoners that use OWLAPI x but never got ported to OWLAPI y so no way to use it on more recent systems...


Right... except that Uber, Boeing, JP Morgan Chase, Nike, Electronic Arts etc. etc. are looking for SPARQL developers right now: http://sparql.club/


The reason why tools like Protégé have not been sufficiently developed is because of infighting in the academic ontology community in addition to the reasons listed by the author. It has set the whole community back at least 5 years.


I think that's a symptom, not the cause.

The complexity of web standards in general smother it with it's own weight. The common web has enough raw financial and person backing to grind through that. The semantic web does not.

CURIEs and the depending standards alone are well over 100 pages. Language tags alone has 90.

RDF has like 100, Sparql has a combined of more than 300, and OWL has more than 500, even though it assumes that the reader is generally familiar with Description logics, so it's probably a couple thousand if you take the required academic literature into account.

Nobody is going to read all of that, let alone build that.

Especially not a bunch of academics who don't care about the implementation as long as it's good enough to get the next paper out the door.

So everybody pools on these few projects, because they're the only thing that's kinda working. OWLAPI, Protege, ... uh that's it.

Because everything else, is broken and unfinished.

Here's a thought experiment, name one production ready RDF libray for every major programming language (C, Java, Python, Js), that doesn't have major, stale, unresolved issues in their issue tracker. It's all broken, and there is simply too much work required to fix things.

It's only natural that people start to infight when there is only few hospitable oasis.

What we need is a simpler ecosystem, where people can stake their claim on their niche, where they have the ability and power to experiment and explore.


> CURIEs and the depending standards alone are well over 100 pages.

The curie standard is 10 pages long, and those "dependent standards" includes things like RFC 3986 (Uniform Resource Identifiers (URI): Generic Syntax) and RFC 3987 (Internationalized Resource Identifiers (IRI)) - which are well established technologies that most people should be familiar with. And you really don't need to read all of the referenced standards to be able to understand and use CURIE quite proficiently.

> RDF has like 100

Normative specifications of RDF is contained in two documents:

- RDF 1.1 Concepts and Abstract Syntax ( https://www.w3.org/TR/rdf11-concepts/ ) = 20 pages

- RDF 1.1 Semantics ( https://www.w3.org/TR/rdf11-mt/ ) = 29 pages

These page counts includes TOC, reference sections, appendices and large swathes of non-normative content also.

And really the RDF 1.1 primer (https://www.w3.org/TR/rdf11-primer/) should be quite sufficient for most people who want to use it, and that is only 14 pages.

RDF and CURIE is simple as dirt really, maybe too simple, but I think I can explain it quite well to someone with some basic background in IT in about 30 minutes.

And while the other aspects (e.g. SPARQL, OWL) are not that simple, there is inherent complexity they are trying to address that you cannot just ignore. And not everybody needs to know OWL, and SPARQL is really not that complicated either and again most people can become quite proficient with this rather quickly if they understand the basics.

> What we need is a simpler ecosystem, where people can stake their claim on their niche, where they have the ability and power to experiment and explore.

What are the alternatives? Proliferation of JSON schemas which is yet to be ratified as a standard and does not address most of the same problems as Semantic Web Technology? I think there are some validity to your concerns, but semantic web technologies are being used widely in production, maybe not all of them, but to suggest it is not usable is not true.

I have used RDF in Java (rdf4j and jena), Python (rdflib) and JS (rdflib.js) without serious problems.


Familiarity isn't nearly enough if you want to implement something.

Talking about RDF is absolutely meaningless without talking about Serialisation (and that includes ...URGH.. XML serialisation), XML Schema data-types, localisations, skolemisation, and the ongoing blank-node war.

The semantic web ecosystem is the prime example of "the devils in the detail". Of course you can explain to somebody who knows what a graph is, the general idea of RDF: "It's like a graph, but the edges are also reified as nodes." But that omits basically everything.

It doesn't matter if SparQL is learnable or not, it matters if its implementable, let alone in a performant way. And thats really really questionable.

Jena is okay-ish, but it's neither pleasant to use, nor bug free, although java has the best RDF libs generally (I think thats got something to do with academic selection bias). RDF4J has 300 open issues, but they also contain a lot of refactoring noise, which isn't a bad thing.

C'mon, rdflib is a joke. It has a ridiculous 200 issues / 1 commit a month ratio, buggy as hell, and is for all intents and purposes abandonware.

rdflib.js is in memory only, so nothing you could use in production for anything beyond simple stuff. Also there's essentially ZERO documentation.

And none of those except for Jena even step into the realm of OWL.

> What are the alternatives?

Good question.

SIMPLICITY!

We have an RDF replacement running in production that's twice as fast, and 100 times simpler. Our implementation clocks in at 2.5kloc, and that includes everything from storage to queries, with zero dependencies.

By having something that's so simple to implement, it's super easy to port it to various programming languages, experiment with implementations, and exterminate bugs.

We don't have triples, we have tribles (binary triples, get it, nudge nudge, wink wink). 64 Byte in total, fits into exactly one cache line on the majority of Architectures.

16byte subject/entity | 16 byte predicate/attribute | 32 byte object/value

These tribles are stored in knowledge bases with grow-set semantics, so you can only ever append (on a meta level knowledge bases do support non-monotonic set operations), which is the only way you can get consistency with open world-semantics, which is something that the OWL people apparently forgot to tell pretty much everybody who wrote RDF stores, as they all have some form of non-mononic delete operation. Even SparQL is non-monotonic with it's optional operator...

Having a fixed size binary representation makes this compatible with most existing databases, and almost trivial to implement covering indices and multiway joins for.

By choosing UUIDs (or ULIDs, or TimeFlakes, or whatever, the 16byte don't care) for subject and predicate we completely circumnavigate the issues of naming, and schema evolution. I've seen so many hours wasted by ontologists arguing about what something should be called. In our case, it doesn't matter, both consumers of the schema can choose their own name in their code. And if you want to upgrade your schema, simply create a new attribute id, and change the name in your code to point to it instead.

If a value is larger than 32 byte, we store a 256bit hash in the trible, and store the data itself in a a separate blob store (in our production case S3, but for tests it's the file stystem, we're eyeing a IPFS adapter but that's only useful if we open-sourced it). Which means that it's also working nicely with binary data, which RDF never managed to do well. (We use it to mix machine learning models with symbolic knowledge).

We stole the context approach from jsonLD, so that you can define your own serialisers and deserialisers depending on the context they are used in. So you might have a "legacyTimestamp" attribute which returns a util.datetime, and a "timestamp" which returns a JodaTime Object. However unlinke jsonLD these are not static transformations on the graph, but done just in time through the interface that exposes the graph.

We have two interfaces. One based on conjunctive queries which looks like this (JS as an example):

```

  // define a schema
  const knightsCtx = ctx({
    ns: {
      [id]: { ...types.uuid },
      name: { id: nameId, ...types.shortstring },
      loves: { id: lovesId },
      lovedBy: { id: lovesId, isInverse: true },
      titles: { id: titlesId, ...types.shortstring },
    },
    ids: {
      [nameId]: { isUnique: true },
      [lovesId]: { isLink: true, isUnique: true },
      [titlesId]: {},
    },
  });

  // add some data
  const knightskb = memkb.with(
    knightsCtx,
    (
      [romeo, juliet],
    ) => [
      {
        [id]: romeo,
        name: "Romeo",
        titles: ["fool", "prince"],
        loves: juliet,
      },
      {
        [id]: juliet,
        name: "Juliet",
        titles: ["the lady", "princess"],
        loves: romeo,
      },
    ],
  );

  // Query some data.
  const results = [
    ...knightskb.find(knightsCtx, (
      { name, title },
    ) => [{ name: name.at(0).ascend().walk(), titles: [title] }]),
  ];
```

and the other based on tree walking, where you get a proxy object that you can treat as any other object graph in your programming language, and you can just navigate it by traversing it's properties, lazily creating a tree unfolding.

Our schema description is also heavily simplified. We only have property restrictions and no classes. For classes there's ALWAYS a counter example of something that intuitively is in that class, but which is excluded by the class definition. At the same time, classes are the source of pretty much all computational complexity. (Can't count if you don't have fingers.)

We do have cardinality restrictions, but restrict the range of attributes to be limited to one type. That way you can statically type check queries and walks in statically typed languages. And remember, attributes are UUIDs and thus essentially free, simply create one attribute per type.

In the above example you'll notice that queries are tree queries with variables. They're what's most common, and also what's compatible with the data-structures and tools available in most programming languages (except for maybe prolog). However we do support full conjunctive queries over triples, and it's what these queries get compiled to. We just don't want to step into the same impedance mismatch trap datalog steps into.

Our query "engine" (much simpler, no optimiser for example), performs a lazy depth first walk over the variables and performs a multiway set intersection for each, which generalises the join of conjunctive queries, to arbitrary constraints (like, I want only attributes that also occur in this list). Because it's lazy you get limit queries for free. And because no intermediary query results are materialised, you can implement aggregates with a simple reduction of the result sequence.

The "generic constraint resolution" approach to joins also gives us queries that can span multiple knowledge bases (without federation, but we're working on something like that based on differential dataflow).

Multi-kb queries are especially useful since our default in-memory knowledge base is actually an immutable persistent data-structure, so it's trivial and cheap to work with many different variants at the same time. They efficiently support all set operations, so you can do functional logic programming a la "out of the tar pit", in pretty much any programming language.

Another cool thing is that our on-disk storage format is really resilient through it's simplicity. Because the semantics are append only, we can store everything in a log file. Each transaction is prefixed with a hash of the transaction and followed by the tribles of the transaction, and because of their constant size, framing is trivial.

We can loose arbitrary chunks of our database and still retain the data that was unaffected. Try that with your RDMBS, you will loose everything. It also makes merging multiple databases super easy (remember UUIDs to prevent naming collisions, monotonic open world semantics keep consistency, fixed size tribles make framing trivial), you simply `cat db1 db2 > outdb` them.

Again, all of this in 2.5kloc with zero dependencies (we do have one on S3 in the S3 blob store adapter).

Is this the way to go? I don't know, it serves us well. But the great thing about it is that there could be dozens of equally simple systems and standards, and we could actually see which approaches are best, from usage. The semantic web community is currently sitting on a pile of ivory, contemplating on how to best steer the titanics that are protege, and OWLAPI through the waters of computational complexity. Without anybody every stopping to ask if that's REALLY been the big problem all along.

"I'd really love to use OWL and RDF, if only the algorithms were in a different complexity class!"


> Talking about RDF is absolutely meaningless without talking about Serialisation (and that includes ...URGH.. XML serialisation), XML Schema data-types, localisations, skolemisation, and the ongoing blank-node war.

Don't implement XML serialization. The simplest and most widely supported serialization is n-quads (https://www.w3.org/TR/n-quads/). 10 pages, again with exaples, toc, and lots of non-normative content.

You don't need to handle every data type, and you can't even if you wanted to because data types are also not a fixed set. And whatever you need to know about skolemisation, localization, and blank-nodes is in the standards AFAIK.

> C'mon, rdflib is a joke. It has a ridiculous 200 issues / 1 commit a month ratio, buggy as hell, and is for all intents and purposes abandonware.

It works, not all functionality works perfectly but like I said I have used it and it worked just fine.

> rdflib.js is in memory only, so nothing you could use in production for anything beyond simple stuff. Also there's essentially ZERO documentation.

For processing RDF in browser it works pretty well, not sure what you expect but to me RDF support does not imply it should be a fully fledged tripple-store with disk backing. Also not really zero documentation: https://github.com/linkeddata/rdflib.js/#documentation

> > What are the alternatives?

> SIMPLICITY!

> But the great thing about it is that there could be dozens of equally simple systems and standards, and we could actually see which approaches are best, from usage.

Okay, so you roll your own that fits your use case. Not much use to me and it is not a standard. Lets talk again when you standardize it. Otherwise do you mind giving an alternative that I can actually take off the shelf to at least the extent that I can with RDF?

I am not going to roll my own standard, and if all the RDF data sets instead used their own standards instead of RDF it won't really improve anything.

EDIT: If you compare support for RDF to JSON schema, things are really not that bad.


> Don't implement XML serialization. The simplest and most widely supported serialization is n-quads (https://www.w3.org/TR/n-quads/). 10 pages, again with exaples, toc, and lots of non-normative content.

You omit the transitive hull that the n-quads standard drags along, as if implementing a deserializer somehow only involved a parser for the most top-level EBNF.

Also, you're still tip-toeing around the wider ecosystem of OWL, SHACL, SPIN, SAIL and friends. The fact that RDF alone even allows for that much discussion is indicative of it's complexity. It's like a discussion about SVG and HTML that never goes beyond SGML.

And you can't have your cake and eat it too. You either HAVE to implement XML-Syntax or you won't be able to load half of the worlds datasets, nor will you even be able to start working with OWL, because they do EVERYTHING with XML.

You're still coming from a user perspective. RDF will go nowhere unless it finds a balance between usability and implementability. Currently I'd argue, it focuses on neither.

JS is a bigger ecosystem than just the browser, if you want to import any real-world dataset (or persistence) you need disk backing. So anything that just goes poof on a power failure doesn't cut it.

Sorry but "works pretty well", and 6 examples combined with an unannotated automatically extracted API, does not reach my bar for "production quality".

It's that "works pretty well" state of the entire RDF ecosystem that I bemoan. It's enough to write a paper about it, it's not enough to trust the future of your company on. Or you know. Your life. Because the ONLY real world example of an OWL ontology ACTUALLY doing anything is ALWAYS Snowmed. Snowmed. Snowmed. Snowmed.

[A joke we always told about theoreticians finding a new lower bound and inference engines winning competitions: "Can snowmed be used to diagnose a patient?" "Well it depends. It might not be able to tell you what you have, but it can tell you that your 'toe bone is connected to the foot bone' 5 million times a second!"]

Imagine making the same argument for SQL, it'd be trivial to just point to a different library/db.

And so far we've only talked about complexity inherent in the technology, and not about the complex and hostile tooling (a.k.a. protege) or even the absolut unmaintainable rats nests that big ontologies devolve to.

Having a couple different competing standards would actually improve things quite a bit, because it would force them to remain simple enough that they can still somehow interoperate.

It's a bit like YAGNI. If you have two simple standards it's trivial to make them compatible by writing a tool that translates one to the other, or even speaks both. If you have one humongous one, it's nigh impossible to have two compatible implementations, because they will diverge in some minute thing. See rich hickeys talk "simplicity matters", for an in-depth explanation on the difference between simple (few parts with potentially high overall complexity through intertwinement and parts taking multiple roles), and decomplected (consisting of independent parts with low overall system complexity).

And regarding JSON Schema: I never advocated for JSON schema and the fact that you have to compare RDFs maturity to something that hasn't been released yet...

You would expect a standard that work began on 25 YEARS ago to be a bit more mature in it's implementations. If it hasn't reached that after all this time, we have to ask the question, why is that? And my guess is that implementors see the standards _and_ their transitive hull and go TL;DR, and even if they try, they get overwhelmed by the sheer amount of stuff.


I don’t even work with semantic technologies, but I just love the structure and completeness of arguments in the space. I suppose I should not make enemies by being specific, but compare this comment to the average (or even 90th percentile) argument on almost any other topic.

Although it looks like HN now needs to implement a “download the Kindle“ feature :-)


I'm flattered <3


Hi,

thank you for the really cool post! I am trying to understand some key concepts here, so please forgive the simple questions:

> 16byte subject/entity | 16 byte predicate/attribute | 32 byte object/value

What would be a difference between subject and entity? Do you include a timestamp of your entries next to your trible?

> Having a fixed size binary representation makes this compatible with most existing databases (...)

Are you using a external look up table to identify the human language definition of the entry, and keep using the 2^128 possible entries for internal use?

> (...) if you want to upgrade your schema (...)

> We stole the context from jsonLD (...)

What were the reasons you did not use jsonLD as a base for your software?

Could you point perhaps point me to a case study of your system, or, if this is not possible, a similar case published in literature/www etc? I would love to learn more what you are doing (my contact is in the my profile).

Wish I could upvote you a couple of times. Thank you.


Glad that you like it :D This actually pushes me a bit more into the direction of open-sourcing the whole thing, we kinda have it planned, but it's not a priority at the moment, because we use it ourselves quite happily :D.

Subject and Entity are the same thing, just different names for it. People with a Graph DB background will more commonly use [entity attribute value] for triples, while people from the Semantic Web community, commonly use [subject predicate object].

We don't use timestamps, but we just implemented something we call UFO-IDs (Unique, Forgettable, Ordered), where we store a 1 second resolution timer in the first 16 bit, which improves data locality and allows us to forget irrelevant tribles within a 18h window (which is pretty nice if you do e.g. robotics or virtual personal assistants), while at the same time still practicing the overflow case regularly (in comparison to UUIDv1, ULID, or Timeflakes), and not loosing too many bits of entropy (especially in cases where the system runs longer than 18h).

The 128bit is actually big enough though that you can just choose any random value, and be pretty darn certain that it's unique. (UUIDv4 works that way) 64 byte / 512bit not only fits into cache lines, it's also the smallest value, which is statistically "good enough". 128 bit random IDs (entity and attribute) are statistically unlikely enough to collide, and 256bit hashes (the value) are likewise good enough for the foreseeable future to avoid content collision.

And yeah well, the human language name, as well as all the documentation about the attribute is actually stored as tribles alongside the data. We use them for code generation for statically typed programming languages which allows us to hook into the languages type checker, to create documentation on the fly, and to power a small ontology editing environment (take that protege ;) ).

We kinda use it as a middleware, similar to ROS, so it has to fit into the same soft realtime, static typing, compile everything nich, while at the same time allowing for explorative programming in dynamic languages like Javascript. We use observableHQ notebooks to do all kinds of data-analysis, so naturally we want a nice workflow there.

jsonLD is heavily hooked into the RDF ecosystem. We actually started in the RDF space, but it became quickly apparent that the overall complexity was a show stopper.

Originally this was planned to bring the sub projects in a large EU research project closer together, and encourage collaboration. We found that every sub project wanted to be the Hub that connected all the other ones.

By having a 2.5kloc implementation we figured, everybody could "own" the codebase, and associate with it, make it so stupid and obvious that everybody feels like they came up with the idea themselves. The good old inverse Conway manoeuvre.

jsonLD is also very static, in the ways that it allows you to reinterpret data, RDF in =churn=> JSON out, and we wanted to be able to do so dynamically, so that when you refactor code to use new attribute variants (e.g. with different deserialisations) you can do so gradually. Also dynamic is a lot faster.

The tribles instead of triples idea came when we've noticed that basically every triple store implementation does a preprocessing step where CURI are converted to u64 / 8byte integer to be stored in the indices.

We just went: "Well, we could either put 24 byte in the index and still have to do 3 additional lookups. Or we could put 64 byte (2.5x) in there and get range queries, sorting, and no additional lookups, with essentially the same write and read characteristics.[Because our Adaptive Radix Tree index compresses all the random bits.]" 64 bit words are already pretty darn big...

Currently there is nothing published (except for it being vaguely mentioned in some linguistics papers), and no studies done. They are planned though, but as this isn't our source of income it's lowish priority (much to my dismay :D).

Keep an eye on tribles.space tho ;)

Edit: Ah well, why wait, might as well start building a community :D

https://discord.gg/KP5HBYfqUf


RDF is absolutely about triples and the converse graph form! The serialization formats are immaterial, orthogonal. The fact that you think it's about any particular format makes it obvious that you don't get it at all.


I agree with this. It is common to hear "Partial SPARQL 1.1 support"... or "Partial OWL compatibility" or "A variant of SKOS is supported". While it is true that full ECMA6/HTTP2/IPv6/SQL is also rarely provided by implementations, this doesn't hinder their use in productive environments. I think it is rare to reach the parts of ECMAscript that aren't implemented, or the corners of SQL that Postgres/MariaDB don't support. In many of the "Semantic Web Stack", however, one quickly reaches a "not implemented" portion of the 500 page owl standard.


OWLAPI, Protege - that's it? RDF libraries broken? Dude what rock are you living under? What about Jena, RDF4J, rdflib, redland, dotNetRDF etc? Most of these libraries have been developed and tested for 20+ years and are active. See for yourself: https://github.com/semantalytics/awesome-semantic-web#progra...

Why are you spreading FUD?


Jena is everything but user friendly, it has a lot of weird edge cases, bugs, and a horrible API.

RDF4J is okay for RDF, but completely ignores OWL.

RDFLib is a bug ridden mess, have you ever used it, or checked their issue tracker, and commit history? With that amount of production breaking bugs that haven't been resolved for years, it might as well be unmaintained.

Redland has last been updated years ago. Sure there's software that's finished, but with the complexity of RDF, OWL and friends, my hypothesis would be "it's dead Jim".

I haven't used dotNetRDF, but it looks okay at first glance. So at least you can do semantic web on windows...

YES! THEY'VE BEEN TESTED FOR 20+ YEARS AND ARE STILL HALVE BAKED, BUG RIDDEN, REASEARCH PROJECTS. MY POINT EXACTLY. These are all smart, dedicated people. The semantic web ecosystem is too complex to get right, no matter how many hours and $$$ you throw at it.

I have nothing to gain from talking about the shortcomings of RDF and it's related standards, except maybe inspire people to come up with something better, and to save themselves some pain and suffering.

The pot calling the cattle black. Your motives seem more questionable, with the whole "username matches the topic talked about", and has a Semantic Web consultancy business, shtick.


Research projects? Research is something coming out of academia, as you know. These are open-source projects with active developer communities. Jena has long been under Apache, RDF4J is now under Eclipse Foundation.

Can you for once answer why large companies in the industry are using RDF/SPARQL as of today if it's so "dead"? Here's a list: http://sparql.club


Nothing about this changes the code quality and maintenance status.

I'm not saying that it's dead, worse it's hard to use, bad and unreliable.

Can you answer why large companies are still using Cobol, if it's so "dead"?

Legacy, lack of alternatives, managers that don't have technical expertise but that fall for the marketing.

The semantic web, MASSIVELY overpromises, and MASSIVELY under-delivers. Both truly in a "web-scale" way.


Could you give more details on that? How can it slow down an editor development?


Even though I have been working off and with SW and linked data tech for twenty years, I share some of the skeptical sentiments in comments here.

I am keenly interested in fusion of knowledge representation with SW tech and deep learning. I wrote a short and effective NLP interface to DBPedia two weekends ago that you can experiment with on Google Colab https://colab.research.google.com/drive/1FX-0eizj2vayXsqfSB2... that leverages Hugging Face’s transformer model for question answering. You can quickly see example use in my blog https://markwatson.com/blog/2021-01-18-dbpedia-qa-transforme...


Nice though nothing about Turtle or LV2

https://www.w3.org/2007/02/turtle/primer/

https://github.com/lv2/lv2/wiki

Also, #swig (semantic web interest group) exists on freenode.


I do research in this field but I am a programmer by training before I entered this research field. I have talked to many academics and they agree that industry needs something simpler, more approachable and something that solves their problems in a more direct way, so it's definitely not an "academic exercise" for many researchers.

However, I failed to convince people that we need to implement the 2001 SciAm use case (https://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20..., see the intro before the first section) using 2021 technologies (smartphones are here, assistants are here, shared calendars are easy, companies have APIs, the only thing missing is a proper glue using semantic web tech). This goes to the core thesis of this paper that semantic web is awesome as the set of ideas and approaches but the Semantic Web as the result of all this work may look underwhelming or irrelevant today. I like to point everyone who disagrees with me to the 1994 TimBL presentation at CERN (https://videos.cern.ch/record/2671957) where he talks about the early vision of semantic web (https://imgur.com/aS2dbf6 or around 05:00 in the video), which looks awfully like IoT (many years before the term even existed). We simply cannot fault someone who envisioned communication technologies for IoT in 1994 for getting the technology a bit wrong.

Today's technologies simply cannot handle the use-cases for which SemWeb was designed for properly:

1) The web is still not suitable for machines. Yes, we have IoT devices that use APIs but nobody will say it's truly M2M communication at its best. When APIs go down devices get bricked, there is no way to get those devices to talk to any other APIs. There is no way for two devices in a house to talk to each other unless they were explicitly programmed to do so.

2) We don't have common definitions for the simplest of terms. Schema.org made a progress but it's very limited because it serves search engine interest, not the IoT community. There is no reason something like XML NS or RDF NS should not be used across every microservice in a company. Using a key (we call them predicates, but not important here) "email:mbox" (defined in https://www.w3.org/2000/10/swap/ very long time ago) you can globally denote the value is an email.

3) Correctness of data and endpoint definition still matters. We threw away XML and WSDL but came back to develop JSON Schema and Swagger.

We are trying to get there. JSON Schema, Swagger etc. all make efforts in the direction of the problems SemWeb tried to address. One of the most "semantic" efforts I see done recently is GraphQL federation, which has been a semantic web dream for a long while: being able to get the information you need by querying more than one API. This only indicates the problems that semantic web tried to address are still viable.

If anyone has attempted an OSS reimplementation of the 2001 "Pete and Lucy" semantic web use case (ie as an Android app and a bunch of microservices), please point me in the right direction. Otherwise, if anyone is interested in doing it, I am all ears (https://gitter.im/linkeddata/chat is an active place for the LOD/EKG/SW discussion).


We wasted 20 years by trying to replace one form of brackets with the other (XML vs. JSON). WHATWG and the browser vendors are responsible for this. Just like for the fact that we still don't have a machine-readable web. FAANG crawls the structured schema.org metadata like nobody else can and profits from it, and the rest of use are left with the HTML5 and Javascript crap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: