Hacker News new | comments | show | ask | jobs | submit login
Ask HN: What happened to the semantic web?
109 points by drdrey 4 months ago | hide | past | web | favorite | 68 comments

The Semantic Web Meetup happens at MIT in Cambridge, MA occasionally: https://www.meetup.com/The-Cambridge-Semantic-Web-Meetup-Gro...

Though it seems mostly cancelled / sporadic now, it had a lot of interesting people presenting on interesting academic uses of the semantic web / RFD / etc.

A couple times I was there Tim Berners-Lee himself was there too. He's an interesting guy to meet.

Overall though, I think due to business reasons (really companies are not incentivized to share) it has mostly caught on in academia. With a shining example in "microformats" which gained adoption because companies like Google adopted them as a way to make gathering (as opposed to sharing) data.


Personally I found a lot of aspects useful but others not all that well thought out when it comes to practical specs. The community has a tendency to try to build complete taxonomies rather than taxonomies that have long term usability. As a result they become stale. For example, Friend of a Friend (FOAF) [1] is nice but it is very narrowly speced in some areas but not others. For example, there is a tag for AOL Instant Messenger ID but none for Facebook.

Microformats in a way has some similar issues though not as bad.

[1] http://xmlns.com/foaf/spec/

Most technologies that were specific to the "Semantic Web", such as OWL and SPARQL, failed to scale and failed to solve realistic problems, and therefore died. (I always maintained that running a SPARQL endpoint amounted to running a DDoS on yourself.)

However, we got something kind of cool out of the RDF model that underlies it, especially when some sufficiently opinionated developers identified the bad parts of RDF and dumped them. We got JSON-LD [1], a way for making APIs describe themselves in a way that's compatible with RDF's data model. For what I mean about sufficiently opinionated developers, I recommend reading Manu Sporny's "JSON-LD and Why I Hate the Semantic Web" [2], a wonderful title for the article behind the main reason the Semantic Web is still relevant.

Google makes use of JSON-LD in real situations: for example, an airline that uses JSON-LD can send you an e-mail that Google Assistant can use to update you on the status of your flight, and Gmail can use to give you a simple button for checking in.

[1] https://json-ld.org/

[2] http://manu.sporny.org/2014/json-ld-origins-2/

I think the main issue is that even though "knowledge representation" with ontologies is an enticing goal, it's simply a fact that real entities, as used by humans at a practical level, don't map neatly onto mathematically-sound hierarchies.. . To see this, just look at the arguments the ancient Greeks already had as to whether a human is a "two-legged featherless animal" or the endless online arguments as to whether a "circle is an ellipse" or vice versa.

Because of this, there's just not much utility in taking the time to generate semantic markup- it'll be sloppy and incomplete even when done by a PhD student specializing in this subject.

But RDF was pretty good with that, because it didn't force records into hierarchies. You could just define relationships, and then create them freely between records. Some people created heavy ontologies, but it wasn't a result of the data model.

Right, but the whole dream of the "semantic web" to me was that one website could discuss red wines, and another website could discuss Vietnamese food, and then a reader could use the semantic vocabulary shared between the two sites to arbitrarily ask questions such as "what wine will go best with this dish?"

If you just have a vocabulary where everyone can freely define concepts and their relationships in a fuzzy way this original goal will never be tractable- There needs to be some sort of unambiguous shared concept space between disparate sites (which in my estimation appears to not be achievable in any practical sense, due to the difficulty in finding "one true way" to build ontologies.)

Right, the vocabularies were supposed to be reused as much as possible, but that doesn't require heavy hierarchies, just that they were available. A single record could have facts/relationships defined by any number of ontologies, so they didn't have to be all-encompassing either.

Also, I think there were ways to "map" different ontologies, thought I never really explored that.

Then you have the whole adversarial network aspect. People will intentionally pollute the category system for purposes of advertising, propaganda, jokes, etc. Google bombing is basically a kind of semantic hijacking that happens already and we don't even have a semantic web.

The test for any proposed Internet standard or system should be "what happens when 4chan hears about it?" I don't see semantic web ideas taking off outside of closed forums and walled gardens like academic research or the military. On the public Internet you'll rapidly end up with Donald Trump mapping to "small penis," etc.

I suppose you're talking about OWL. The thing is most programmers look at OWL, see the words class and property and immediately think that it's OOP, that they don't have to look more into it and it has the same problems.

But OWL is not based on OOP, it's based on Description Logic, which is a much more powerful abstraction than OOP and it let's you easily represent things which are very hard with something like Java. OWL includes the concept of complex class, in which you define the logical constraints of the class and then it is inferred automatically by a reasoner. This means that you can build really complex multidimensional hierarchies pretty easily.

For example, you can solve the circle/ellipse this way: the class circle is a complex class which is the intersection of the class ellipse and the class of two dimensional geometric shapes in which both major and minor axis have the same length. Any object that satisfies those constraints is a circle!

About the greek problem: you have to declare that the human class and the complex class which results from the intersection of the class two-legged animal and featherless animal are equivalent. It means that every human is two-legged featherless animal and viceversa.

You can even declare equivalences between ontologies, which lets you build conceptual bridges.

OWL has problems related with the maturity and performance of its implementations, and it remains to be seen if it's possible to treat the web as a gigantic Prolog program, but its conceptual model is powerful and sound.

> don't map neatly onto mathematically-sound hierarchies.

I think they do, but finding the right mathematical model is very difficult. If it were easy, everyone would be a mathematics PhD.

Learning to program is becoming efficient at recognizing the right spherical cow in any given situation, because such shortcuts are essential to getting shit done.

I think you'd change your tune (like I did) once you dig into the ugly details of such mathematical modeling. A book like this will show you just how hard this is https://www.amazon.com/Knowledge-Representation-Reasoning-Ar...

I recognize it's difficult since I literally said that. But programming is simplified mathematical modelling of a sort (for instance, via Curry-Howard), hence my "spherical cow".

Are viruses alive?

(To pick one question where something doesn't map onto a hierarchy simply)

Precisely define "alive" and you'll have your answer.

There's a difference between correct ontologies and the ones humans use, however. We may succeed at mapping all domains effectively, but in order to get people to use those maps, you would have to convince people that a pop tart is a kind of ravioli.

I disagree. It all boils down to the way we use language - words have meanings, but those meanings are just pretty fuzzy boundaries in conceptspace. It seems that words represent concrete things from a distance, but if you look closely, it all breaks down. Hell, the meaning of words also heavily depends on context in communication, so they can paint barely overlapping borders in conceptspace from one conversation to another.

Programming is actually great for discovering this. Especially OOP, with its introductory examples of animal taxonomy and shapes.

I'm not sure that you disagreed with anything I said.

The Semantic Web is incompatible with the commercial incentives of most technology companies. For instance, it would currently be irrational for Facebook to voluntarily publish their social network using the friend of a friend schema. Their profit is derived from their centralized, private ownership of this data. Hopefully we can move towards a decentralized or federate, public web.

Yes, once everyone thinks that 'Data is the new oil', things like public shared standards fly out the window.

There are many open upper level ontologies available (I counted 16 when I did a review a few years ago - http://www.acutesoftware.com.au/aikif/ontology.html), but the really complete ones are not publically available (Cyc full version, Googles internal ontology and the countless others held in corporate servers).

Where is Googles internal ontology used exactly?

On Google Knowledge Graph https://youtu.be/mmQl6VGvX-c

I don't know, but I just assumed they use it everywhere. They have a very good mapping of related terms and a fairly consistent mapping.

A visible example is when you look for organisations and they have a classification against it.

e.g. Google IBM and they call it "Computer manufacturing company" - these classifications are different to many of the standards for specific sets of data

Where do you see google calling IBM a "Computer manufacturing company" if you search for IBM? I'm not saying they don't. I just want to see examples of what you are talking about.

I googled IBM, and did not see it classified as such.

When I am logged on to Google I see companies details in a right hand side pane (when the search term is unambiguous)

For IBM it says

  Computer manufacturing company
  Image result for ibm
  IBM is an American multinational technology company headquartered in Armonk, New York, 
  United States, with operations in over 170 countries. Wikipedia
  Stock price: IBM (NYSE) USD155.39 +2.71 (+1.77%)
  10 Apr., 4:00 pm GMT-4 - Disclaimer
  Founder: Charles Ranlett Flint
  Founded: 16 June 1911, New York City, New York, United States
  Headquarters: Armonk, North Castle, New York, United States
  Subsidiaries: Trusteer, FileNet, IBM Global Services, Ustream, MORE
  Executives: Ginni Rometty (CEO, President, Chairperson), MORE
  Did you know: IBM is the world's eighth-largest information technology company by revenue. 

It does for me. But this actually highlights one of the problems with classifications like this. Is IBM really primarily a computer manufacturing company today? Their systems business is only about 15% of revenue.

The Semantic Web as originally imagined is also incompatible with privacy, and with malicious fake data. So is the decentralized Web you imagine, I think.

Let's keep using FOAF as an example. The facts about who knows who in FOAF are just bare RDF triples. There's nothing about who's allowed to know who knows who. There isn't even room to specify who's allowed to know who knows who. If any significant number of people had described their friends and relationships with FOAF, all of it would quickly have been slurped into a marketing database.

There's also no room in traditional Semantic Web ontologies to keep track of the provenance of why you believe something, and to disbelieve something that comes from an unreliable source. Every triple is supposed to be a statement of fact that you can derive things from as if it is 100% true. You could use FOAF to say you're married to Tim Berners-Lee, and not even Sir Tim would have a way to say "no you're not".

First, Does it matter much if agents spread misinformation in an unstructured versus structured format? I can already assert that I am married to Tim Berners-Lee. I will do so right now: I am married to Tim Berners-Lee.

I will do so again:

  <http://example.com/smadge> <http://schema.org/name> "smadge" .
  <http://example.com/smadge> <http://schema.org/spouse> "http://example.com/timbl" .
  <http://example.com/smadge> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
I think the maxim "not everything on the internet is true" applies to linked data just as much as unstructured data.

Second, although again I am out of my area of expertise, I think you can make provenance statements about statements using reification in Semantic Web technologies. I don't know if this is a good source but it seems to suggest it's possible [1].

[1] https://wiki.blazegraph.com/wiki/index.php/Reification_Done_...

Reification is at least part of the solution to provenance, but none of the Semantic Web ontologies you've ever heard of use reification, and there is no upgrade path to make an ontology based on plain un-reified triples use reification.

To continue this example: you asserted, in English, "I am married to Tim Berners-Lee". In English, anyone can respond "No you're not".

Then you said it again in RDF, in a way that hypothetically a computer system would use to draw conclusions. And there is no way to say "no you're not" in RDF.

Well, some tech companies. StackOverflow, for example, now annotates their questions and answers with http://schema.org/Answer. But then again, they've always been good citizens in that regard: https://archive.org/details/stackexchange

Onthologies are hard. Curation is harder. People are lazy.

The ideas are still around; some [1] were lifted by Facebook [2], for example. There's also continuation work that's related, like web annotations [3], but generally the commercial web is moving even more away from neatly-organized resources [4] and towards Javascript state machines [5].

[1] https://web.archive.org/web/20160713021037/http://dig.csail.... [2] https://developers.facebook.com/docs/graph-api/overview/ [3] https://news.ycombinator.com/item?id=13729525#13740110 [4] https://news.ycombinator.com/item?id=12206846#12207459 [5] https://news.ycombinator.com/item?id=12345693#12346371

I think you hint great points especially with "People are lazy".

The semantic web was a great idea, but in the period from 2000 to 2010, people advertized it as a kind of AGI that would solve all hard problems with junk data.

It is still used in biology, for example in Gene Ontology [0] but the main use case (People are lazy) is "If your research cannot find interesting stuff, just query Gene Ontology".

[0] https://en.wikipedia.org/wiki/Gene_ontology

While re-reading what I wrote, a part of a sentence makes me wonder "a kind of AGI that would solve all hard problems with junk data"

What does AGI mean in this context?

One of the big things to come out of the semantic web was RFD-A (embedding semantics in unstructured web pages) and similar technologies (microformats, JSON-LD, schema.org). It's what lets Google show product reviews and rankings in search results, and lets shopping aggregator sites show things like price comparisons from other websites. While it's probably not as widespread as its boosters from a decade ago hoped it would be, it did lead to some helpful technologies that are in widespread use now.

I wonder if Facebook won't someday be forced to publish its social graph data in FOAF format the same way Microsoft was forced to publish its Office document specs as part of an anti-trust decision.

Speaking of Facebook, the OpenGraph tags are another example of widely-used semantic data on the web, maybe the most widely-used, since all kinds of sites pull in page summaries, images, and other data from those tags. So while Facebook doesn't make social network data available, it did popularize a format for sharing other types of data (about companies, articles, websites, etc.).

Sorry, that should be "RDF-A," not "RFD-A."

Long ago I wrote a blog post as an introduction with an identical title: https://joelkuiper.eu/semantic-web

At our company we still use Semantic Web (or rather, RDF) for inference and annotation with medical ontologies (UMLS, Gene Ontology, Human Phenotype Ontology, etc). The ease of use of triples + SPARQL (basically a PROLOG-ish unification scheme) is really powerful (and quite performant when using Jena/Fuseki with Lucene as a text index). But it's a far cry from the "dream" of semantic web like federated queries and OpenAnnotations (now just W3C Annotations). Still, every time someone implements an EAV scheme without even considering an RDF triple store I cringe a bit.

It was the sort of largely academic tops-down exercise to organizing information that has mostly lost out time and time again to more organic bottoms-up/self-organizing approaches. Think Yahoo vs. Google. [ADDED: i.e. manually populating hierarchies vs. search, in case that wasn't clear] I remember when it was going to be Web 3.0. Tim Berners-Lee gave a talk about it when he won the Draper prize.

As others have said, classification is difficult under the best of circumstances. And it just doesn't fit with the way the Internet has evolved. We have Wikipedia, not the Encyclopedia Galactica.

I think there's a lot in this. The first users of the Internet saw themselves as librarians and curators, and sought to impose that vision of the world on everyone else. For a long time, people had trouble with the idea that everything didn't need to link to everything else.

Hierarchical structures are how we organized things historically. So I think it's pretty natural. I know that for a long time I was relatively careful about filing email, files, etc. into a folder hierarchy and categorizing my music collection. I won't say folders (and tags/labels) don't still have their uses. But I've definitely moved away from spending so much upfront time to carefully organizing stuff that I may want to find some tiny percentage of some day. Instead I mostly figure I can search for it if I need to.

It happened.

We got meta tags that tell us the published date, author and type of web page.

We got schema for job ads.

We got schema for recipes.

We got schema for thumbnails and images associated with a webpage.

We got schema for ecommerce products

This is all speculation and I have no idea of the actual roadmap for the specs. As I was reading this comment it gave me another reason to love component based architecture... I would think it would make sense just to allow users to self define stuff like that rather than try to do everything top down.


Google guides for SEO show you how the semantic web happened.


Noone knows it's called the semantic web these days. It's just what you have to do to have you page get picked up and highly ranked by google, and to get more links from direct product traffic.

Turns out it's not profitable to encourage understanding, rather it's better to be a hosted service provider, keep knowledge in a walled garden and charge for it.

1. We've realized that people in general can't reliably and consistently mark data up. That's a problem of incentives, technical difficulties, UI, bitrot of invisible metadata, etc.

2. We've settled on extracting information from "raw" text (with everything from regexes to recognize flight info in e-mails to getting word statistics from terabytes of garbage) and duct-taping that with special-purpose APIs.

The flight info example is one of the places where semantic web tech went mainstream. Those flight emails have embedded metadata in JSON-LD (linked data) format and Gmail uses it for more specialized display[1].

[1]: https://developers.google.com/gmail/markup/reference/flight-...

>That's a problem of incentives, technical difficulties, UI, bitrot of invisible metadata, etc

Perception, culture, linguistics, time, reality.

It pivoted to Linked Data [1] with less focus on ontologies and AI and more focus on linking, open data and a Web of Data [2].

One nice demo of the latest advances is how you can query Wikidata client-side without downloading the whole database for queries like "Directors of movies starring Brad Pitt": http://ldfclient.wmflabs.org

[1] https://en.m.wikipedia.org/wiki/Linked_data

[2] https://www.w3.org/2013/data/

It was always a cruel joke, never to be taken seriously.

At its core were SEO hucksters trying pass off page rank hacks as a business model for consulting work, during the post-dot-com bust period, when money was scarce and web design couldn't pay the bills anymore.

Many ascended to the priesthood of RESTful web microservice development, where they poo poo and tisk-tisk improper path grammar and noodle with JSON objects, in between periods of intense navel gazing.

I spent 2 years using a semantic reasoner to develop an ontology for reasoning about smartgrid vulnerabilities. Ignoring the web aspect, ontologies are very hard. In addition, one needs to use multiple languages like one to express the ontology, and another to express a query. Change the ontology a little bit and the query will break when you run it. There was no integrated IDE that was complete.

I had a project where we wanted a ticketing / events ontology and budgeted 6 weeks for three people to build it, in the end we spent probably 3 person years on it, which was dim... but we got suckered in by the idea that the ontologies themselves would be valuable (spoiler - nope).

So, knowledge engineering scales badly, but there were other problems. There was a big debate in the EU community about what kind of reasoner to use, and for some god awful reason F-logic was chosen, at the time we thought that reasoners like Otter wouldn't be able to scale and do FOL tractably. It's a shame that answersets and MCMC probalistic reasoners were 10 years later - I think that the weak reasoning and poor representation systems were big gaps.

The other problem was institutional, the way that EU semantic web funding worked, and the way that the projects developed. A lot of money was spent, and then there was no money - there was no self sustaining legacy.

It didn't add any value for commercial entities (and minimal immediate value beyond self-satisfaction for non-commercial entities) so they didn't devote any resources to implementing it.

What about NIEM (National Information Exchange Model) [1][2].

I see this tech is supposed to be replacement for paper documents and be the medium for government information arbitrage. The only obstacle for using it everywhere is structural complexity of NIEM and lack of tools. I've spent a bit of time hacking it with XML queries and my mind is blown [3].

You can interpret NIEM as a type system similar to types in programming languages, but for composing electronic documents; it could be integrated with payment systems. I think progress will go two ways: composing new documents will be happening with NIEM, older docs could be converted with natural language processing.

The latest version 4.0 is dated 2017, and US has spent lots of money to build an XML representation of real-life objects.

[1] https://www.niem.gov/ [2] https://en.wikipedia.org/wiki/National_Information_Exchange_... [3] https://github.com/NIEM

In a way Freebase and DBpedia were/are practical applications of the concept. Now when you search on Google they try to understand a simple query and send you the answer. In Freebase you could write queries about facts retrieved from many sources.

The utopia is more than this but I assume that few people will used these tools directly.

It was a solution looking for a problem.

Hardly. The promise is still there, but there are barriers in place to get there.

One of the most useful aspects of the semantic web is how it enhances the search for information. Some web citizens have become conditioned to see Google as the pinnacle of what we can achieve through search, but we can do a lot better. Let's use an example to illustrate this. Imagine a presidential election was taking place and you want to understand the positions of the candidates on topics that matter to you. Let's say foreign policy was something you were interested in, including their proclivity for war. By allowing for searching on a richer set of metadata you can more easily access the information about the positions of these candidates, without the distortions of Google's page rank algorithms. Think of it like treating the information of the web as a database you can query more directly. That's the main promise of the semantic web.

I would offer more specifically that it was trying to solve a problem with existing workarounds for where the problem existed, without checking if anyone else cared enough to write more sgml

The simple reason is that people are lazy. Someone isn’t going to put in the extra work of marking up their text with the correct semantic structure if they don’t get much out of it, and the ROI for an individual site owner was dubious at best. People keep talking about RFD, but that barely qualifies IMO, it was more of an agreed upon RFC so that search engines didn’t force site owners to all adopt differing indexable formats (events, addresses, etc). Even with the backing of Google, RFD is not something that most sites are doing until they start tackling some enterprise grade SEO optimizations.

We got separate JSON API's instead of machine-readable web pages.

It became useful for people who have as job using public data or providing it. Such as the life sciences, governments and the archaeology/history and more. It allows for nice bottom up standards and user interfaces such as flight data in your e-mail.

If your not consuming a lot of public data or providing data to the public it is not very useful other than having a bunch of better graph databases associated with it.

The semantic web is alive and well. It's just not in the places you're looking. I recommend checking out indieweb.org for a community devoted to building on the semantic web. Just because the big websites aren't using it doesn't mean the technology is dying.

the (semi)automatic annotation never really happened. there are ontologies, there are amounts of raw data everywhere on the web, but we haven't discovered a way how to reliably turn those data into rdf triplets matching the ontology without doing it by manually hand.

Google got good enough at divining the content and meaning of a page without needing magical XML pixie dust to annotate the facts on it.

I think the triple model is awesome, but we haven’t been able to develop decent triple stores to back it up.

Machines should learn to understand understand human language, not the other way around.

It was a stupid idea, although remnants can be seen in html5 with elements like address and , nav, and section.

Turns out that keeping presentation and data separate is much, much easier. Hopefully HTML 6 will get rid of everything except for div, span, and form elements.

I wasted time, money and effort on it.

What happened is that it was pointless.

Build something people want, not the semantic web.

Answer: it became obsolete by the use of Machine Learning.

uhh...haven't you heard of JSON-LD?

We call it "blockchain" now.

A lot of the solution in search of a problem work that went into semantic web just shifted to the crypto currency space.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact