Hacker News new | past | comments | ask | show | jobs | submit login
A Survey of the First 20 Years of Research on Semantic Web and Linked Data [pdf] (inria.fr)
149 points by kkdw on Jan 3, 2019 | hide | past | favorite | 48 comments

RSS/Atom is being slowly killed. XHTML was replaced by a pile or junk that can only be parsed by a single parser. HTML in general became a rendering layer for executable JavaScript. After years of doing integrations, I've seen one(!) HATEAOS web service and it's being replaced by GraphQL right now.

I like the idea of having semantic content, but it is suffocating under the weight of all the overcomplicated designed-by-committee abstractions and formats. In fact, the real web that we're using is becoming less and less semantic every year.

I wish someone re-standardized XML without all the complicated edge case garbage and with better handling of namespaces. So it would actually get used for documents again. JSON is not a good document format, and it seems everyone is busy re-inventing XML in JSON right now.

Let's say I want to define my own concept and put it on a web page. Is there even an "official" way to do it right now, without namespaces in HTML5? Theoretical example:

    <concept tag="computer-game" guid="c1a3e0cb-5872-4e5a-8dce-b8afc00772db" />
    <concept attribute="year-published" guid="ba073188-8bd6-47a7-9065-eb55c4a8b908" />

        <computer-game year-published="1993">Doom</computer-game>
Simple, isn't it? I'm not aware of anything of this sort.

To answer your question, you can use HTML+RDFa to annotate any element with semantic info, or simply use meta-links (in the HTML head) for per-page metadata.

W3C's semantic web is a lost cause, though, as is XHTML. I don't know why the paper speaks about "the first 20 years", as there's been zero activity for years and I think the semweb charters have been closed years ago.

If you're interested in document engineering based on real and practical standards, I'd suggest looking into ISO SGML (superset of XML and HTML) and maybe ISO Topic Maps (with Tolog/ISO Prolog as query language).

> Let's say I want to define my own concept and put it on a web page. Is there even an "official" way to do it right now

What's wrong with Microdata? As in https://en.wikipedia.org/wiki/Microdata_(HTML) It can interoperate with RDF and JSON-LD.

First time I hear about it. Let's see.

It seems to aim at roughly the same problem, but god, they've managed to design something even uglier and more verbose than XML namespaces. Still requires to use valid URLs instead of GUIds. Still forces you to use external names in your markup. Sigh.

One hard limitation I see is that unlike namespaces it doesn't deal with HTML attributes, so you can't just annotate the same tag with unrelated semantics.

It also makes assumptions about document structure (item properties are nested inside an item). Semantics and structure should not be complected. This will create innumerable issues when doing logic in code or writing CSS queries.

Why does this have to be so ugly and complicated? All we really need is a way to associate data with globally unique identifiers. This gives us semantics. A single layer of indirection (mapping GUIDs to local names of your choice) would allow us to be concise and descriptive in our markup, while avoiding naming conflicts.

No matter how much W3C hates it, people are extending HTML with custom tags and attributes. (ng-whatever, for example.) Having no support for providing semantics for such things is pretty ridiculous.

> Still requires to use valid URLs instead of GUIds.

Not sure why you would want to do this, but AIUI you can use any URI for a namespace (not just URL), including a UUID via the `urn:uuid:c1a3e0cb-5872-4e5a-8dce-b8afc00772db` syntax.

You can but most of the time, especially with Linked Data, an URL is required. And sometime even a dereferencable one. So you have this weird situation where you never really know if an URL is "semantic" (just a string actually) or can be accessed. Also having a dependency on the DNS system is a bad idea in my opinion. It's really clear when you read 2 years old semantic web paper and every link is broken.

I agree with gambler that GUID + local names is a better solution, and I use that in my research.

I’m thinking IPLD is looking like a great fix for that particular issue.


It seems rather overengineered to me. IETF and W3C standards do support the general content-addressing, "naming things with hashes" use case via the existing ni:// (Named Information) URI schema (RFC6920).

It's possible to define custom HTML tag in javascript, but I don't think that is what you want.


Indeed. It could be useful for rendering custom tags, but the problem with semantic content is that there needs to be an entry point somewhere.

No matter how I think about it, it seems that at some level there always needs to be a way to say "this piece of data is officially ba073188-8bd6-47a7-9065-eb55c4a8b908 (year of publishing)". Which means that whenever you see another ba073188-8bd6-47a7-9065-eb55c4a8b908, you and all your tools know the data refers to the same kind of thing.

Everything else can be bootstrapped.

XML solved this through namespaces + URLs. IMO, it was a cludgy and inelegant solution with weird syntax and unnecessary levels of abstraction, but at least it was something.

I've come up with a new notation Mark (http://marknotation.org/), to address some of issues faced with JSON and XML.

It keeps the simplicity of JSON, and is able to support mixed content at the same time.

At the moment, it leaves out namespace to the application.

Your sample in Mark would look like:

      {concept tag:computer-game guid:"c1a3e0cb-5872-4e5a-8dce-b8afc00772db"}
      {concept attribute:year-published guid:"ba073188-8bd6-47a7-9065-eb55c4a8b908"}
      {computer-game year-published:1993 "Doom"}

Looks pretty good. Not sure how I feel about including JSON as type of embedded content, though.

I've used something vaguely similar in the distant past, but it was strictly internal for my hobby projects and mostly resembled lisp with square brackets. It was way more primitive, but the parser (lexer, really) fit into two blocks of code. All the fancy stuff was done within DOM.

From the viewpoint of an application creator what I find missing from here is that very similar goals have been pursued by the OMG and other organizations. Similarly RDF data works with F-Logic, XSB Prolog and a wide range of non-SPARQL tools. There has also been a proliferation of document and graph databases that compete w/ SPARQL databases. (Many sebweb refugees use Couchbase or ArangoDB)

I'd like to see that side-by-side with RDF because then you'd see we've made more progress than most people think.

I don't really see much connection with what OMG have done. Yes, they did do a little work on an Ontology Definition Metamodel, but I wouldn't take that too seriously. I don't think they understood what they were doing.

Similarly, I don't think graph databases have that much in common with RDF. Yes, both are graphs, but they're quite different, really.

What I've seen with OMG-promulgated standards is that the people involved know what they are doing but (1) they do a bad job communicating it, (2) people do a bad job of understanding it, (3) as with the W3C compromise in the standards process leads the specification to miss the last 20% that you need to make something that really works, and (4) some adopters of the standard see filling that last 20% as what differentiates them from competitors, so the standard is not so standard.

There are many homologies between W3C and OMG standards, one of them is that there is a mapping between the semantics of documents and the semantics of API calls, object definitions, etc. linking all the way back to the CORBA standards. Another is between the XSLT/XPath functions and the "Object Constraint Language". The OMG and W3C maintain a largely overlapping list of primitive data types, for instance.

Neo4J and similar products tend to support the "property graph" model which can be modeled with RDF/SPARQL.

As for document databases, that gets to the magic about RDF which is most obscure: a document full of facts is an RDF "graph". You can take the union of all of the facts in two documents and that is also a graph. You can take the union of all the facts in two million graphs and run SPARQL queries on it without doing any data transformation or import!

This is a necessary condition for a "universal solvent" for combining data from multiple sources but RDF standards haven't been sufficient. Serious semwebbers know about techniques like "smushing" that go a long way towards finishing the job, but oddly these are not incorporated into standards or widely known among beginners.

These mappings is exactly what I'm talking about. I advised the OMG on one of the mappings in the ODM and I don't think they understood what they were mapping. To model OWL with UML gives you very little. To then map Topic Maps into the same level as OWL is ... just misguided.

Semantics means something very different to the OMG from what it does to the RDF/OWL community. That's the root of the problem. To the RDF/OWL folks it means "mathematically based logical inference of new statements", whereas to the OMG it seems to mean "human-readable text".

Yes, you can map a property graph into RDF, but to make it work well you usually have to add a lot of information that's not in the original data.

Thanks, I know how RDF and SPARQL work.

UML is not the only standard pushed by the OMG.

CORBA and related standards have well-defined semantics. So does BPML.

Human-readable definitions are important. One thing I see missing in both the OMG and W3C worlds is a realistic approach to model visualization. For instance if you try to draw a large OWL ontology or a large UML diagram you might need to blow it up to a full wall just to see everything, never mind understand it.

Really you need to be able to paint on graphical elements to a graph to show what nodes and relationships are relevant to a particular situation or use case.

Many people don't understand OWL because it doesn't actually "make sense". That is, without mechanisms for data validation, you don't know that inference is going to proceed in a correct way, rather you get a "garbage in garbage out" situation where you get new bad facts. Given that the official explanation doesn't make sense, it is natural that people fall back on something they understand.

XSB Prolog seems interesting to me, but seems it has only Java Interface, no JavaScript/Node.js? Any more info about similar for JavaScript would be appreciated by me

I worked at a company that got sold on Semantic Web and made an initial investment in tools and training. While I found the concepts intriguing, the teams tasked with making it work gave up after about six months of trying to make it work with the use case they had. I understand that it can be a powerful set of concepts for certain kinds of use cases but it feels like the level of dedication and care needed to make it work is probably beyond many organizations' ability to execute.

The impression I got was that it was like deciding to use a Kibble Balance [0] to weigh yourself in the morning. You have to match the use case to the tool and for many organizations this simply will not be the right tool.

[0] https://en.wikipedia.org/wiki/Kibble_balance

I worked for a semantic web startup, back when it was the next big thing.

We had a tool (UI & backend) purportedly for managing ontologies, taxonomies, vocabularies.

Our customers were begging for real world solutions. Would have paid any price.

Our leadership (CTO) was mesmerized by metametametadata. Not kidding. And had zero interest in customer's real needs.

Such a missed opportunity.

The two lessons I took away...

1/ Most real world modeling problems are some narrowed use case of knowledge representation. Our customers didn't want a general purpose tool. They wanted something tailored (customizable) for their immediate use cases. As a UI designer, I guess I should have realized this quicker. My only defense is initial lack of customer interaction.

2/ At the time, for general purpose graph splunking, there was no UI solution for the "focus+context" problem. Human sized ways to query, represent, and navigate large graphs, all in one.

I did come up with a novel UI/UX that I felt would solve "focus+context", but we ran out of runway before I could get past the lo-fi prototypes.

On my to do list is to take another run at the problem, leveraging Neo4j's (awesome) Cypher query language. I may discover that Neo4j's UI may have already solved the "focus+context" problem.

Would love to see examples of useful ux. Do you have anything to share?

I have this hobby project where I’m thinking of using some kind of knowledge graph to represent beliefs about scientific facts and urban myths. An attempt to crowd source peer review of the pop-sci cited in online fora of various kinds.

My prototype app being a nutrition planner based on dietary recommendation sources from the graph in question.

Not being a ux-designer, I’m a bit stuck on how to approach it.

Very belated reply, but I feel I owe you a response...

Were I to implement my UI today, it'd most look like a query builder for Cypher.

I mocked up my UI in the early 2000s. There was nothing comparable to Neo4j's Cypher query language. Sadly, I was't clever enough to invent it myself. It's so obvious once you see it. To the best of my knowledge, it's the only graph query language that explicitly models both the nodes and edges.

In sum, my UI would be more useful for developers, enthusiasts and less useful for any specific use cases.

Happy hunting!

Much appreciated, Thanks!

What tools do existing nutritionists use?

My target group for this would not be nutritionists. Just people like me with way to much curiosity about various topics.

But a fair suggestion, I should look into that.

The long term aim of the application isn’t nutrition though. I’m thinking more of a general stack exchange like platform for quality checking beliefs in various domains.

The music brainz database might be a good example of how it would fit into an application echo system.


I'm glad someone pointed out Skirky's thesis (which you saw). You may also want to peek at folksonomy. https://en.wikipedia.org/wiki/Folksonomy

I used to be believe the world was knowable and representable, definitively. Much like the aspirations of the Cyc effort. https://en.wikipedia.org/wiki/Cyc

I now believe the usefulness of any particular model and captured dataset depends on who's asking the questions.

For nutrition, there might be different views for laypersons, producers, nutritionists, researchers, etc. Sure, the domain is the same. But the details relevant to each, their use cases, will determine the schemas, the datasets, the granularity, the queries. Further, efforts to create the uber-nutrition-o-pedia knowledgebase, useful for all audiences, will inevitably disappoint.

In other words: One size fits none.

I don't mean to be a buzz kill. I'm just relating my pessimism after once having lofty ambitions. YMMV.


I have completely different notions about belief systems, fact checking, etc.

There have been many "what is true" efforts. There will be many more. A contributor at mondaynote.com writes about some school's effort (Berkeley?) and links to similar efforts.

I think they'll all fail to meet their goals.

As we learned from MC 900ft Jesus: Truth is Out of Style.

I no longer care if something is true. I only care who said it.

Every tidbit needs to sourced, cited, digitally signed. So you can trace who said what when. Authoritatively. Anything without a signature is nothing more than gossip.

Then use the existing web of trust infrastructure. News outlets, bloggers, researchers, any one who want to be taken seriously will sign their real names to their works. If someone's cert gets pwned, or is revealed as a shell, then it can be revoked. And everyone will know.

Sorry if that's a lot to take in. I'm still chewing on the notion.

Ah Folksonomy. Had completely forgotten that del.icio.us site :)

I agree completely on the authentication being important. I referenced IPLD in another comment. My thinking was to try and build on top of that initially, at least while prototyping, to really shed the idea of having a single moderated database to rely on, can’t really imagine how to do that without strong authenticity.

When it comes to different audiences I’m hoping to address that by layering. A layperson will read books, blogs and news. Those will contain various contradictions and misunderstandings of publsihed research and pure myths.

I’m thinking a similar layering. Start with any unqualified statement. If it is interesting (contradicts “wisdom”, leads to new prescriptions, or whatever) interested parties debate it’s quality, adding metadata and rules to the graph until consensus on desireble inferences is reached, rendering it uniteresting to debate further. I’m thinking something akin to Wikipedia authoring here.

Now people will have different agendas, so a common shared consensus on a specific set of conclusions is probably not viable. Therefore I’m thinking some kind of personalized, or community curated, axioms and rules will be a thing. Perhaps not unlike how rules for spam filters are maintained and used. As a byproduct, a formal definition on which axioms and weighting’s exactly leads to conflict between groups could be a useful tool in some context. (I’m hoping a system like this could be helpful in accelerating diffusion of beliefs, to not spend decades on debates on wether global warming could be important, the next time an important question comes up)

Clarification: meant ad-blockers, not spam-filters.

Look for some 'Skin In The Game' to find what really works.

For example, HFT algorithms use indicators from financial data, news feeds and web scraping, with lots of encyclopedic and historical context. The scraping needs multilingual NLP and corpus ML analytics to extract facts, meaning and sentiment. There are many bad actors spreading fake news and selfish actors talking their book. Contradictions will occur. Some robust inference is needed until the conclusions can trigger inputs of micro+macro economic models of agents (firms, consumers, workers...), markets and whole economies. The models make predictions and the trading systems execute a strategy to make money from the insight. The whole process is constrained hierarchically by time and resources to deliver value over different horizons.

So I wonder what semantic technologies they use?

Would the blue check on Twitter or Instagram not count?

Well, they withdraw the blue tick as a punishment for people who say something unapproved, that person doesn’t stop being who they are!

The Semantic Web is dead. It was never really something. Even the regular decentralized Web has mostly died. It was replaced by mega corporations' walled gardens. 95% of content is in there; in Facebook, Google, Twitter, Instagram, Youtube, Netflix, Reddit, Twitch, Medium and a few small others. The rest is a skeleton barely being touched. The Semantic Web isn't even necessary or useful anymore, even if its technology were good.

except for wikipedia, no?

The irony of this paper kicking off with three word clouds reminded me of everything that happened with the semantic web community and movement.

The semantic web is only useful if you believe syllogisms are the way to understand the world.


Clay Shirky unfortunately has no idea what he's talking about. You can interpret a CSV file as a set of syllogisms, too, if you want. RDF is data, just like CSV, except the shape is different. CSV is good for some things. RDF is good for other things.

RDF is a fantastic way to aggregate data from many different sources, for example. CSV sucks at that.

The problem is that RDF has been marketed so poorly, which has totally confused people like Shirky.

Can you point to an example or two where RDF is a "fantastic way to aggregate data..."? My sense is that things like microformats, which are more flexible and narrowly defined than RDF, are useful for this, too, but it hardly constitutes a "semantic web". It's just obvious connections between things. Something like a custom type of hyperlink is all. Not that this "all" is bad, it's just not going to revolutionize the world or anything more than the existing web already has.

The semantic web can be used for much more, you can for example put an item for sale, and crawlers would list it on aggregate sites. Same for events, you post a party invitation, which can automatically be insert into your friends calendar, and automatically added to list of participants when they accept the invitation.

Yes exactly.

It reminds me of the difference between Machine Learning in the earlier days, which was based on hard-coded features, compared to things like vector embeddings that derive semantics and semantic connections in a bottom-up manner.

Thanks for that link. Where would one find follow up, or similar, arguments of equal quality? I’m hoping the debate has continued these last 15 years since that was published.

I suppose that means that at least the category theorists might get some use out of it.

What do you mean? Categorical logic is hardly restricted to syllogistic reasoning.

Category theory is the study of the one single syllogism, the mighty arrow.

The semantic web is alive in a different form with the microblogging community using micro formats.

See Indieweb.org and microformats.org, micropub, microsub, websub

I think the main reason why the semantic web has not become popular is a) Chicken and the egg problem. There are no search engines or aggregate sites that make use of it. So we do not bother marking up our data. b) XML is very alien to most people, we need to implement the semantic web into graphical user interfaces! c) It's hard to sell the concept as it's hard to imagine the use cases. We somehow need to bootstrap by creating data and example services, people will want it once they see it.

It seems the Semantic Web is truly taking off with MIT/Inrupt.com's SOLID effort - where all communication between decentralized machines is via rdf triples. We are about to launch this commercially.

It may alarm you, but I cannot tell if you are joking or not.

Cisco was running Tuple Spaces data stores, in 2001. The queries/unifications were pretty simple, I can't recall their use case. But they were using a system internally and very happy with performance.

I never heard about it again, after a couple of conference papers.

I could imagine that the project died because it was impossible to sell. Semantic Web researchers find nothing odd in the discovery that a networking company has some understanding of applications of graph theory. But convincing upper management, or customers, that they can compete with Google and Oracle at the same time... no.

I hope your project breaks through.

Tuple space projects were pretty popular commercially.

There was a pretty decent implementation in Java[1] and very scalable distributed implementations.

The problem is that they are stuck in the space between the flexibility and developer friendliness of databases, and the KISS approach of a simple cache.

[1] https://en.wikipedia.org/wiki/Tuple_space#JavaSpaces

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact