Hacker News new | past | comments | ask | show | jobs | submit login
Whatever Happened to the Semantic Web? (twobithistory.org)
376 points by ColinWright on Sept 20, 2018 | hide | past | favorite | 201 comments

I have some insight here because I did a postdoc working on anatomy ontologies in the UK. A big part of the problem with the semantic web is that lots of people in European academia use it as a collection of buzzwords for making grant proposals sexier, without understanding or caring what it actually means.

Instead of saying, "Give us money to build a webpage", they say, "Give us money to expose metadata annotations using a RESTful API on the semantic web."

I would prepare conference presentations where I was just filling slides up with BS to fill time.

Devs from other universities (gotta check that international research box!) understood the technology even less than our team did. We provided them a tool for storing RDF triples for their webpage so they could store triples about anatomical relationships. They wanted to use said RDF store as their backend database for storing things like usernames and passwords. facepalm

So you have all these academics publishing all this extremely important sounding literature about the semantic web, but as soon as you pry one nanometer deep, it's nothing but a giant ball of crap.

Wow, had such a similar story, except with a Master's degree. Most of my graduate program was consumed by a highly-cited and quite arrogant professor who focused on the Semantic Web. I took 2(!) separate Semantic Web courses and couldn't understand what the fuss was all about.

By the end, I figured out the professor was full of crap. When AJAX became a big thing, I remember the professor asking, "Why don't you add AJAX to this project?" What does that have to do with the Semantic Web? In the end, I got a paper published in a fairly prestigious journal by just combining some flashy visuals with the Semantic Web and having the professor be a co-author.

That was one of the big experiences that helped me figure out that academia wasn't for me. And I would be perfectly fine never hearing about triples, RDF, or that other nonsense again!

Yup, I was in academia in UK and Germany for a total of 7 years. Had similarly crappy experience and got out. I think 80% of research is crap done "just to get the grant" and not real research.

>>When AJAX became a big thing, I remember the professor asking, "Why don't you add AJAX to this project?"

Sounds like something my manager at my previous job would say. He was a Pointy Haired Boss.

Yeah, semantic web really hacked the brains of academic-facing bureaucrats. It fell into this giant gap between what administrators don't know about business and what they don't know about technology... a gap big enough to shove every utopian idea about "an effortlessly integrated, data driven society" into.

There's no such thing as "right" way to represent any given data stream, just ways that are more or less suitable to specific tasks and interests. That's why HTML failed as descriptive language (and has become fine-grained-formatting language), and it's why symantic web was doa.

> That's why HTML failed as descriptive language

I think HTML and the web failed in general. Modern HTML is really nothing more than div tags everywhere, with a handful of span tags. We went from abusing tables to abusing the entire document. We, in effect, eliminated all semantic meaning from a document by making everything generic tag soup.

The DOM + JS has largely supplanted HTML as the source of a web page. Especially when using tools such as React or Angular.

In terms of vision, the rise of native phone apps and the fact that every major site has a mobile version and a separate desktop version really highlights how the web failed.

I do node/React dev for a living. I'll be the first to admit this pile of hacks is total garbage. Mobile web is almost unusable. I hate it. I hate the sites I work on. Their UX is horrid. Native apps are so far superior that they make the web look like an embarrassing relic. But web development pays the bills and keeps the lights on.

> Modern HTML is really nothing more than div tags everywhere, with a handful of span tags. We went from abusing tables to abusing the entire document.

True but there is some reaction to this. On one hand there are people like yourself that just make the document from data and javascript using lots of build tools, frameworks and ever more specialist technologies.

On the other hand there are people wanting to get back to a pure document that has next to no javascript needed, using HTML5 built-ins for forms, CSS Grid for layout, no polyfills for legacy browsers and no frameworks. This is not widely given exciting buzzwords but 'intrinsic web design' is happening.

When I watch conference presentations it seems to me that there are two groups of people, those to whom the ever more complicated appeals and those to whom the make it simple appeals.

The 'make it simple' currently does not fit with existing workflows, maybe for startups but not for most web agencies where a good decade has now been spent nesting divs in more divs and making it a big mess of un-maintainable 'write only' code, with 97% unused stylesheets that have got to the stage where nobody knows what anything does, they just add new hacks to it.

With where we are with HTML5 it should be easy to markup your document semantically, however, Google have kind of given up with that and if you want to put some semantic goodness in there then you add some JSON-LD on the end rather than put property tags in everything throughout the document. It is as if Google would prefer the document to be doubled up, once with trillions of divs for some bad CSS and then done again to be machine readable.

Regarding mobile, 'progressive web apps' is widely supported and has removed the need for custom mobile applications. This is progress.

I'm working on a project with a React frontend. I think I never tried it on my phone. It probably works but the site is really meant for the desktop and I don't know if it makes sense on a small display.

However... I'm writing this on my phone so HN is part of the mobile web and it works well. I read the post on twobithistory.org on my phone and that also works well. I doubt that they have an app and even if they did, why should I install it and what would happen when I follow a link from HN to them? I'll get the mobile site or would the app catch the link and open itself?

I don't even have the apps of the news sites I read most. Reading them on the phone with Firefox and uBlock is good enough. Their apps probably contain more spyware then the adblocked sites.

So the mobile web did not conpletely fail. It's still what's on the screen of my phone for about 50% of the time: since last charge 3h 55m of screen time, 57m used by phone calls, 1h 54m Firefox, 15m WhatsApp, etc.

> 57m used by phone calls

You are an unusual smartphone user. For most people, it is 0m.

Do you have stats? You might be projecting.

I checked in the Battery settings. Android 8 shows the stats there.

In USA, the average time spent on voice calls is something like 20 minutes per day. 60 minutes is unusual, but not that unusual.



I've been recently wondering if there's another, better way. The big usability win of the web is that you can run applications without installing anything. Is there a way we could build a new platform that would get us the advantages of the web without all the awful cruft?

I'm imagining starting with webassembly for sandboxing. We can then expose through to webassembly a useful set of API primitives from the underlying OS for text boxes, widgets and stuff.

Apps would live in a heavily sandboxed container and because of that they could be launched by going to the right URL in a special browser. We could use the same security model as phone apps - apps have a special place to store their own files. They have some network access, and can access the user's data through explicit per-capability security requests.

That would allow a good, secure webapp style experience for users. But the apps themselves would feel native (since they could use native UX primitives, and they would have native performance).

Developers could write code in any language that can compile to webassembly. We could make a bundler that produced normal applications (by compiling the app out of the sandbox). Or we could run the application in a normal web browser if we wanted, by backing the UX primitives with DOM calls, passed through to WASM.

The original usability win of the web was that anyone could put up some text with pictures and links and anyone else could see it. It was proto-Facebook. I think of the evolution of the web as a sequence of technologically trivial but socially innovative changes in format: from static pages to timestamped blog posts, aggregating multiple blogs in a feed, allowing comments and so on. That's the vision laid out in Clay Shirky's writings about social software, as true today as ever before. And unlike the "semantic web", that vision naturally leads to a wealth of semantic information (like friend graphs) which is tremendously useful.

Hopefully that explains why "API primitives exposed to webassembly" feels to me like thinking about the web from the wrong end. The social end is what makes the web tick. It could be built with tinkertoys for all I care.

Didn't we try this in the 90's with Java applets?

What is the essential difference between the success of browsers and the failure of X-Windows and Java applets?

For my money its that Java applets and X-Windows didn't have a distribution mechanism and security model. They simply didn't do anything I couldn't already do with desktop apps and HTML.

Also, frankly, they were kind of slow and not very good. I think thats the biggest problem with this sort of idea - the breadth of surface area for GUI toolkits is crazy huge. Building something that works well, and works cross-platform is a seriously huge amount of work.

Discovery. This is the essential difference. And this is mostly based on semantic features of html.

www has 3 main ways of discovery that alternative technologies didn’t offer: 1) search (leading you to correct info in the site, instead of just to a landing page. 2) overview pages, short summary with links to the actual info (google news, etc), 3) deep hyperlinks that everyone can easily discover (browser url) and provide elsewhere (email, Facebook posts, twitter posts, etc).

The first one is very much based on the semantic qualities of html, where google can crawl a page and make some educated guess about what the page is about.

Biggest problem with mobile apps is that discovery is completely channeled through commercial app stores.

I would like to see an alternative web tech stack that doesn’t skip the discovery part. Web assembly with canvas for example is completely useless for a search crawler.

> What is the essential difference between the success of browsers and the failure of X-Windows and Java applets?

Timeline is one of the key ones. According to chrome task manager (because browsers need task managers now) the page I'm typing this reply on the contains a text area and your comment is consuming 30MB of RAM. Back when Java applets were getting their reputation for being slow I would have been lucky to have a computer with 32MB of RAM, 8 and 16MB were still common at that time. Now there were some other things that made applets awful, but if they were introduced today they wouldn't seem nearly as bad as we remember, on the same computer this page would be clunky.

For x-windows, it was never really a contender because there was no MS compatibility, but the potential was there.

Java actually has a pretty decent security model. Like most sandboxes it was (and still is) full of holes, but browsers don't fare much better.

How are you crossing the bridge from webassembly to having access to the native UX primitives? Are you directly making C calls to native libraries like win32?

You can do that with PWAs if they are packaged as native apps.

For example on Windows, Microsoft has rebooted hosted UWP JavaScript apps into signed PWAs.

So on that case, you can check if UWP APIs are available and use all of them, depending on UWP permissions for the app.

Chrome is following a similar route with ChromeOS and Chrome Android.

As native/Web developer I tend to have a native bias, but PWAs look like the way the Web might win back native. It isn't fully there though.

Through a privileged security barrier, which in practice would look like a separate API. That API would in turn make win32 (or UWP or whatever) calls.

At a minimum we'd need a separate API to enforce a web / mobile style security boundary.


When people say this, it feels like we're in different universes, or at least looking for different things.

I main Linux on all of my machines. Most of the native apps on it have terrible UX. Even big apps - On touch screens, Firefox doesn't support two-fingered scroll. Chrome won't snap to the side of a desktop. Neither will bring up a virtual keyboard if I click on the URL bar.

The majority of my native apps that I use don't support fractional scaling - apps like Unity3d are unusable on 13 inch screens and there's no way for me to zoom in or out on them. Even system dialogs suffer from this problem sometimes, it's like nobody on native ever learned what an 'em' unit is and they're still stuck in 1990 calculating pixel positions.

To contrast, most of the websites that I'm using, even when they're badly designed will work on smallish screens or can be individually zoomed in and out. My keyboard shortcuts work pretty much the same across every site (aside from the rare exception that tries to be all fancy and implement its own). If they break, it's not rare for me to be able to open up an inspector and add one or two CSS rules that fix the problem.

Reading Hacker News, I sometimes wonder if I'm just browsing/using entirely different sites/apps than everybody else is. I don't understand how my experience is so different.

Regarding semantic HTML, I generally don't have too much of a problem there either. I don't think semantic HTML is hard to write - I use it on every single one of my sites. If you're using React and it can't be used without spitting divs all over the place, maybe the solution is just to stop using React? Modern HTML is only going to look like div soup if you fill it with divs.

I mean, I can build you a horrible SQL database that requires 30 joins on every data call, but that doesn't necessarily mean that SQL is bad. It means that auto-generating SQL tables based on a bunch of cobbled-together frameworks and user-scripts is bad. Treat your application's DOM like you would a schema, and put some thought into it. That will also solve a great deal of the responsive design problems on mobile that people are talking about, because light DOM structures are more flexible than heavy ones.

> I don't think semantic HTML is hard to write - I use it on every single one of my sites.

Buzzword-compliant semantic HTML or semantically useful HTML? Is there any user-agent out there that benefits from the extra work you're putting in?

It's important to distinguish between the semantic web and semantic HTML. They are different things.

The criticisms this article levies about the semantic web are pretty much straight on as far as I can tell.

Semantic HTML is pretty straightforward though - it's using HTML to describe content, rather than purely for layout. Some sites do it better than others, but it's certainly not dead or abnormal -- and many static HTML generators are... decent.. ish. Semantic HTML is using stuff like article tags and sections, using actual links instead of just putting a click handler on a div, stuff like that. The stuff that makes it easy to parse and understand a web page.

It's very useful - semantic HTML is the reason that sites like Pocket work, it's the reason why reader mode in Firefox/Safari works. It's the reason why screenreaders on the web work so much better than on native apps (at least as far as my experience on Linux has gone, maybe other people have better apps than me :)) It also (opinion me) makes styling easier, because light descriptive DOM structures tend to be easier to manipulate in CSS than large ones.

The semantic web, to the extent that it's well-defined at all is more about the metadata associated with a webpage. Very different concepts.

There used to be a plugin for Firefox that would indicate if a page contained any FOAF info.

Same can be told about native apps that don't bother with layout managers, specially if they are part of the OS frameworks to start with.

It's a double-edged sword. I get why some apps say, "I want to handle everything myself", because then you don't have to debug which versions of a framework you're compatible with, and you don't have deal with these massive layers of abstractions. I hate working with frameworks, if I was building a Linux app I would be very tempted to just directly call into X or Wayland.

On the other hand, the last time I launched Braid on Linux, I had to manually change my resolution back afterwards and it removed my desktop background.

And I just felt like, "I'm sure there was a really good, sensible reason for whatever hack this game relied on when it originally launched on Linux. But... come on, if you had used some common framework, for all of the terrible problems that might have brought, when I launched it years later it would have at least full-screened properly."

So I dunno. The number of really big Linux apps that end up using their own custom display code is surprising to me. Even Emacs isn't fully using GTK. I assume developers of those apps are smart, so I assume there must be a good reason for it.

Nobody is stopping people from using the relatively new article or header tags. There is not really an inherent advantage in using divs except that they are barebones maybe. Apart from that, there are data attributes which are actually used on real-world websites for annotation of texts. Indeed they go particularly well with span tags.

People forget that all these fancy frameworks produce actual HTML5 DOMs, who cares if those are static or dynamic. I someone wants to write a semantic web parser/crawler then it's a great idea, but probably it shouldn't be done using wget. :-)

> Modern HTML is really nothing more than div tags everywhere, with a handful of span tags.

So just add a layout-language to divide it from the content, as already done with style?

'There's no such thing as "right" way to represent any given data stream, just ways that are more or less suitable to specific tasks and interests.'

My core objection to "the Semantic Web" is the non-existence of "the Semantic". There is no way you can get everyone to agree upon a universal "semantic", and if you can, which you can't, you can't get people to accurately use it, and if you can, which you can't, you can't prevent people from gaming it into uselessness. But it all starts from the fact that the universal semantic doesn't even exist.

Somewhere there's a great in-depth series of blog posts from someone who describes just trying to get libraries to agree upon and correctly use an author field or set of author fields for libraries. This is a near-ideal use case, because you have trained individuals, with no motivations to insert bad data into the system for aggrandizement or ad revenue. And even just for that one field, it's a staggeringly hard problem. Expecting "the web" to do any better was always a pipe dream. Can't dredge it up, though.

(To the couple of other replies about how "it's happening, because it's happening in [medical] or [science]", well, no, that's not it. That's a smaller problem. The Semantic Web (TM) would at most use those as components, but nobody would consider that The Semantic Web (TM), at least in its original incarnation. Yes, smaller problems are easier to solve; that does not make the largest version of the problem feasible.)

I don't think a "universal semantic" was ever a design goal of the semantic web. What's needed is not one semantic, but the ability to map between competing/complementary semantics. Which is still a hard problem, to be sure, but which admits varying degrees of partial progress.

"I don't think a "universal semantic" was ever a design goal of the semantic web."

And I'm pretty sure it was the whole point. Nobody would ever have written as many reams of marketing material if the pitch was "Hey, someday, you'll be able to reach out to the web, and with specialized software for your particular domain you can access specialized web sites with specialized tags that give you access to specialized data sets that can be fed to your specialized artificial intelligence engines!"

Because that pitch is basically a "yawn, yeah, duuuuuh", and dozens of examples could have been produced even ten years ago. The whole point was to have this interconnected web of everything linking to everything, and that's what's not possible.

These two visions you present lie on extreme ends of a continuum. They're both complete strawmen. The folks behind semantic web were aware of both of them, and were careful not to let their work be pigeonholed into either one.

Consider these passages, from "The Semantic Web," Tim Berners-Lee et al, Scientific American, May 2001:

"Like the Internet, the Semantic Web will be as decentralized as possible. Such Web-like systems generate a lot of excitement at every level, from major corporation to individual user, and provide benefits that are hard or impossible to predict in advance. Decentralization requires compromises: the Web had to throw away the ideal of total consistency of all of its interconnections, ushering in the infamous message 'Error 404: Not Found' but allowing unchecked exponential growth."

"Semantic Web researchers... accept that paradoxes and unanswerable questions are a price that must be paid to achieve versatility. We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired. This philosophy is similar to that of the conventional Web: early in the Web's development, detractors pointed out that it could never be a well-organized library; without a central database and tree structure, one would never be sure of finding everything. They were right. But the expressive power of the system made vast amounts of information available, and search engines... now produce remarkably complete indices of a lot of the material out there. The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web."

> still a hard problem, to be sure, but which admits varying degrees of partial progress.

I think one of the big problems with the semantic web was that it turned "varying degrees of partial progress" into "multiple competing approaches", each of which wanted to detract from the others.

My impression, back then, was that The Semantic Web would be the sort of thing that you could create a real, general, AI upon. But populating the SW accurately, and maintaining it, was far too large a task for a small group of people and required too much coordination for a large group of people. You'd need a real, general, AI to manage it. So you can't create it without the AI, and you don't need it if you've already got the AI.

> Somewhere there's a great in-depth series of blog posts from someone who describes just trying to get libraries to agree upon and correctly use an author field or set of author fields for libraries.

That sounds interesting! Don't suppose you remember any more terms that might enable a Google search to find it?

The entire design of the "linked data" ecosystem is based on the idea that, as you point out, "There is no way you can get everyone to agree upon a universal "semantic", and if you can, which you can't, you can't get people to accurately use it, ..." etc.

It's like the dictionary vs encyclopedia thing. A dictionary (vocabulary) is probably the best we can hope for.

I have a foggy recollection that Jerry Fodor (Holism) or maybe Eco had thoughts on the matter.

>hacked the brains of academic-facing bureaucrats

In all fairness, it was also promoted by legitimate academics including Time Berners-Lee. I actually saw him give a talk about it a number of years back for his Draper Prize speech.

I remember his AAAI '07 talk where he laid into the audience for changing and reusing URLs. It made me really want to see this world he was imagining where millions of people using the web every which way agree to universally abide by a rule that makes their life a lot harder but makes it easier to reason about algorithms on the Web.

Btw, I don't think this or failure of semantic web reflect badly on TBL. He is in a rare class of folks, along with Stallman, who can leave a bigger dent in the world while missing their ideals by a mile than most of is could if we got everything we wanted.

Oh, I'm certainly not going to be very critical of Sir Tim! Along with the obvious reasons, I believe he coined the read/write web term and was a strong advocate of users being creators as well as consumers. And TBH, while it's fairly obvious today that classification and discovery has to happen largely organically if only because of the scale of the Web, thinking in terms of formal schemes is a pretty natural bias for someone of TBL's background to have.

My main point was that this wasn't just some fantasy of out-to-lunch bureaucrats.

I've never worked in any field that might be considered "semantic web", but at my last gig I spent some time researching deductive database technology.

"Semantic web" was like a keyword that popped up in the abstract of virtually every paper published on the topic since the mid 90s or so. It seemed to me one of those phrases that virtually guaranteed grants of funding if you could work it into your paper.

I think you're correct that people working in the field of semantic web oftentimes didn't understand (or lost sight of) the intended nature and utility of the concept. But I also think that its buzzword status lead to the label being applied to many things that were only loosely related.

There was (is?) a lot of good work happening under the heading of "semantic web", but—partly because the term was not an apt one for the work, and partly because the dream of the semantic web never really actualized—that work has remained relatively obscure. Obversely, there was much work of questionable worth happening under the same banner, likely because it was a reliable way to get funding, at which point we return to that extremely important-sounding literature that's nothing but a giant ball of crap...

Really interesting to hear that my experience is shared by any. I wrote my master thesis on rdf and linked data and the more I got into it the more I realize how it's just a collection of people praising each other's ideas without any outward potential.

In theory the idea is brilliant but the execution is poor. In many cases the world of academia is way too detached from software development to be able to produce any real-world useful tooling beyond toy projects and proof-of-concepts.

> So you have all these academics publishing all this extremely important sounding literature about the semantic web, but as soon as you pry one nanometer deep, it's nothing but a giant ball of crap.

It's not just academia. I made an observation about how most BigCo and BigGovts use a lot of Semantic Web: https://news.ycombinator.com/item?id=18036041

It's not cynism, is how much things went. Meanwhile there is much genuine work in knowledge representation awaiting to be done and funded both in the foundations and the platform sides.

This is really fascinating. It seems there's an interfacing problem at a level, with a sort of depth-expectation on one end of the problem (real-world / technology application), and a more exploratory breadth-expectation on the other (pure academia / science). It'd be cool to develop some standards to create a sort of bridge between the two.

On the breadth end there may be a need for a sort of "contract of depth commitment" to provide insurance against the flighty hijinks of the ever-expanding mind, and at the real-world level you need a sort of "grant of exploration" which allows academia to bring its own type of surface-level one-night-stand concern or even ADHD to the game. Without the latter there's probably too much risk of repeating history while someone else is blazing the trails you won't try.

Anyway thanks for that comment, it's really interesting.

Somewhat related, I suggested how the Semantic Web (aka "Linked Data") and the more real-lifeish Data Science field could be merged, at a Linked Data event in Sweden earlier this year:

TLDR; What semweb / linked data lacks the most, is a practical logic / reasoning layer, so that facts can more easily be inferred form existing plain data. I further suggest this is already available in a rock-solid proven technology; Prolog.

Blog post: http://bionics.it/posts/linked-data-science

Slides and video: http://bionics.it/posts/semantic-web-data-science-my-talk-at...

Could you use something like Google Inference API to generate inferred form from existing plan data?

Yes. I was trying to formalize an ontology for astronomical data sets, found that good people in Strasbourg had already done a good job there, but had no experience with unification over simple tuples. Tried to explain RDF with Prolog queries.

I tried to steer away from typical databases, but in the end I was just getting in the way.

That is so sad. I proposed, and worked briefly on with a handful of really awesome engineers, a project that would use an LDA to try to ascertain the topic for a web page, then use NLP to pull apart the sentences for their contextual components, the next step was to slot the values into an ontology framework to store alongside the web page text. Its initial goal was web content scoring with respect to ontological coherence but later for building knowledge bases dynamically.

Thanks for sharing. I am worried that exactly the same will happen to AI in Europe :-/


Very different beast. The semantic web was an effort to improve the commons. AI/ML benefits are captured by their owners.

I think AI is a different kind of thing. Re:

>More than 2,000 European AI experts have come together to collaborate on research and call for large-scale funding from the European Union to counter China and America’s rapid progress in the field.

you wouldn't get them worrying about the US / China's rapid progress in Semantic Web.

What do you think of SNOMED as an anatomy ontology? Is it good enough or are there serious flaws?

Sorry, I only worked with FMA, with some tangentially related ontologies like CHEBI, and some custom in-house ontologies.

Cool comment...but say you have a friend whose dumb and wanted to explain why its not ok to store usernames and passwords....

What would you say to them?

Different data store types are used to solve different problems. An RDF store is designed to help solve the problem of ontological relationships. I wouldn't use an RDF store if I wanted to store information about who has the current high score on my video game or who has signed up to my site for a username and password.

Unfortunately some developers think that once you've chosen one particular storage technology (RDF, relational DB, document storage) that's the bucket you hae to put everything into in order to do anything in your application. I suspect that was the mental model of the other devs in the story above.

Well for one thing, with triple-stores it's not uncommon to expose an unsanitized read-only query engine (usually SPARQL), usually harmless because none of the data is secret. That goes out the window if you're actually storing business-sensitive stuff in there.

Aside from that, I guess there's theoretically nothing stopping you from using the triplestore for usernames/passwords (I hope you mean salted passwords) but sheesh, talk about killing a fly with a bazooka.

To be fair to those colleagues, it might have been less about them being clueless, and more about them wanting to offload work to my team, lol.

1) users often use the same username and password on various websites, despite this being dumb (because it's so convenient)

2) if you ask such a stupid question you probably won't be able to properly secure the machine hosting this

3) your users will be totally screwed when you get hacked

And this is why you should never store passwords but only a salted hash, and never trust any service that can email you your password when you click "I forgot my password".

In my circle the semantic web types morphed seamlessly into data scientists and continued to extract funding without missing a beat

What happened to the semantic web?

Well... it happened.

1) We got schema data for Job Postings that companies like Google reads to build a job search engine.

2) We got schema for recipes. https://schema.org/Recipe

3) We got the Open Graph schema for showing headlines/preview images in social networks. http://ogp.me/

4) We got schema for reviews: https://developers.google.com/search/docs/data-types/review

5) We got schema for videos: https://developers.google.com/search/docs/data-types/video

6) We got schema for product listings.

The semantic web happened but the Semantic Web didn’t. Schema.org is used because it a) solves a problem which exists in reality and b) works well with very modest requirements.

All of the crazy bikeshedding about labyrinthine XML standards, triples, etc. or debating what a URL truly means has very little to show for the immense time investment.

The main lesson I take away is that you absolutely need to start with real consumers and producers, and never get in the state where a long period of time goes by where a spec is unused. Most of the semweb specs spent ages with conflicting examples, no working tooling, etc. which was especially hazardous given the massive complexity and nuance being built up in theory before anyone actively used it.

"Good judgment comes from experience, and experience comes from bad judgment." Maybe humanity had to first try "all of the crazy bikeshedding about labyrinthine XML standards, triples, etc. or debating what a URL truly means" in order to gain enough understanding about this domain to later create things like schema.org.

These are definitely interesting things that semantic web influenced, but they're definitely not what semantic web aspired to be. The differences are a good laundry list of things that make standardized data models viable in the wild:

They are organized around use cases, not data origins. They are opt-in, rather relying on distribution via some discovery standard. They are only used by subsets of the industry they serve, and that's ok. They aren't integrated with regulation in ways that force their use. (list continues)

I would rather see features on the web appear because users want them and expect them to work, and not because some regulation mandates their use.

I agree. While it didn't turn out to be the solution to all our Problems and not every webpage is chock full of semantic markup that anyone can scrape, there are still a lot of bits and pieces that stuck around.

I always felt that one problem of more widespread adaption by users was missing software. There were and still are all these great microdata pieces out there, but can my browser save an address, add an event to my calender etc.? If these things were easily possible to even an inexperienced user, I think we would see much more semantic markup. But in the meantime, search engines and social networks sure have an easier time scraping information then before we talked about the semantic web.

Like a lot of things--though not everything--with the modern Internet (e.g. RageRank vs. hierarchical directories), some echo of the Semantic Web ended up happening mostly organically. The Semantic Web was a rather academic concept--think a library classification system for the Web. Tim Berners-Lee, among others, promoted it for a long time.

That is literally one of the points of the article.

I worked on the Semantic Web, designing core data infrastructure, back when it was still hot. It disappeared because it had two fatal flaws: it was intrinsically non-scalable, both conceptually and technically.

First, there is no universal "semantic". The meaning of things is ambiguous, and given a broad enough pool of people and cultural contexts, it becomes nigh impossible to converge on a single consistent model for individual terms and concepts in practice. A weak form of this is very evident in global data interchange systems and implementations, and the Semantic Web took this bug factory and dialed it up to eleven. In short, the idea of the Semantic Web requires semantics to be axiomatic and in the real world they are inductive and deeply contextual. (Hence the data model rule of "store the physics, not the interpretation of the physics.") That said, it is often possible to build an adequate semantic model in sufficiently narrow domains -- you just can't generalize it to everything and everybody.

Second, implementation at scale requires an extremely large graph database, and graph database architectures tend to be extremely slow and non-scalable. They were back then and still are today. This is actually what killed the Semantic Web companies -- their systems became unusable at 10-100B edges but it was clear that you needed to have semantic graphs in the many trillions of edges before the idea even started to become interesting. Without an appropriate data infrastructure technology, the Semantic Web was just a nice idea. Organizations using semantic models today carefully restrict the models to keep the number of edges small enough that performance will be reasonable on the platforms available.

The Semantic Web disappeared because it is an AI-Complete problem in the abstract. This was not well understood by its proponents and the systems they designed to implement it were very, very far from AI-Complete.

Third, you can't force people to use the correct semantics. They'll use them wrong on purpose for fun and profit. Mark some disturbing content as wholesome, mark it as whatever is popular at the moment to get it in front of more eyeballs, mark it as something only tangentially related in the hope there's a cross over of market, mark it wrong because they don't actually know better.

Yes, I think this is the biggest issue. Since many websites are funded by advertising, they do not want people to be able to extract data from their pages, they want people to "visit" the site and view the ads.

Also, rights. People "own" things like sports fixture lists... they don't want you extracting that data without paying to use it.

The semantic web could be perfect technically, but it was never going to apply to content that people were attempting to "monetise"... which seems to be most of the web's content.

I don’t really understand this argument, because there already are lies published on the internet. What difference does it make if those lies are published in a standardized machine readable format or not?

A human-targeted web structure that contains some lies is still useful for humans because humans can filter out those lies with somewhat satisfactory efficiency.

A machine-targeted web structure that contains some lies is not useful for machines because they can't filter out those lies yet. It might become useful when they can (but that might be a hard-AI problem), but it's simply not usable until that point.

What's the point of having the marks at all if they are not actually correct?

If you're answering the question, I think it would be good to answer it directly.

We as humans have a lot of intuitive tools for knowing whether a source of data is trustworthy. AI could possibly approach this ability given enough training... we'd need to do something like add a "trust" score to every node in the graph.

Your experience shows!

I made a few observations in my own comment.

One being that there is no usable graph store you and I can use as of 2018.

Another being about monetizing the Semantic Web when playing the role of the data/ontology provider. You provide all the data while the consumers (Siri, Alexa and Google Home) get the glory: https://news.ycombinator.com/item?id=18036041

> First, there is no universal "semantic". The meaning of things is ambiguous, and given a broad enough pool of people and cultural contexts, it becomes nigh impossible to converge on a single consistent model for individual terms and concepts in practice

It sounds like the Semantic Web failed because we tried treating a longstanding (and possibly unresolvable) ontological problem as a straightforward and technical one.

I don’t really follow academic philosophy, but is it known these days if such categorisation problems are even “solvable” in the general case?

Your first point is being worked on and there are a few upper ontologies being used. I find BFO [1] quite promising.

I believe that we actually can generalize modelling. We have dictionaries filled with definitions, given enough time and discipline, I don't see why we couldn't make them formal. It's not an engineering problem though.

[1] http://basic-formal-ontology.org/

This is why category theory is so important! It allows us to move between axiomatic systems by looking at the structure they have in common. I am convinced that the 'semantic web' will be accomplished via some easy to use version control meets functors gui program.

What happened is that the technology spawned by the Semantic Web "fad" is now absolutely everywhere but it looks and works nothing like how people thought it would.

Freebase, after being bought by Google became the foundation of the Google Knowledge Graph (aka "things not links"). This kicked off an arms race between all the major search providers to build the largest and most complete knowledge graphs (or at least keep pace with Google [1]). Instead of waiting for folks to tag every single page, it turned out that simple patterns cross referenced across billions of pages were good enough to extract useful knowledge from unstructured text.

Some companies who had easier access to structured but dirty data (like LinkedIn and Facebook) were also able to utilize (and contribute to) all of that research by building their own knowledge graphs with names like the Social Graph and Economic Graph. Those in turn are helping to power a decent amount of their search and ad targeting capabilities as well as spawning some interesting work[2]

All those knowledge graphs became a major part of Siri, Alexa and Google Home's ability to answer a wide range of natural language queries. As well as being pretty fundamental to a lot of tech like semantic search, improved ecommerce search and a bunch of intent detection approaches for chatbots.

So yeah while the technology and associated research did turn out to be incredibly useful, adding fancier meta-tags to pages was not the direction that proved the most useful.

[1] https://ai.google/research/pubs/pub45634 [2] https://research.fb.com/publications/unicorn-a-system-for-se...

The problem with all this is that Google, Facebook, Linkedin et al are private companies, so their knowledge graphs are, well, theirs.

The idea with the semantic web was that it would be open and it would belong to its users, not to some cabal of giant corporations that would use it to control the internets.

That notion of openness and co-authorship of the knowledge on the web is now as dead as the parrot in the Pythons skit. And we're all much the worse for it- see all the debates about privacy and ownership of personal information and, indeed, metadata.

IIRC, Common Crawl exposes the semantic data from the sites they crawl. One could build their own knowledge graph (or at least bootstrap one) from that and other available data sources (DBPedia, WikiData etc.)

That's not sufficient - the "private" knowledge graphs of e.g. Google aren't "crawlable", they aren't public and don't (solely) rely on the sites. DBPedia+Wikidata+all other open data sources are not sufficient for a good knowledge graph that can be competitive (in terms of coverage, thoroughness, and recency of updates) with what the megacorps can afford to maintain behind closed doors.


I made an observation about monetizing the Semantic Web when playing the role of the data/ontology provider. You providea all the data while Siri, Alexa and Google Home gets the glory: https://news.ycombinator.com/item?id=18036041

I thought Freebase [1] was the most promising "Semantic Web" technology, with a powerful query language (MQL) and an application platform called Acre [2]. I'm biased because I worked at Danny Hillis' adjacent company, Applied Minds, and met with the Freebase folks to talk about graph databases. I went to one of Freebase's Hack Days, and I could feel the energy around building applications on a semantically-aware global database.

Unfortunately, they got acquired by Google, and Freebase eventually shut down. Thinking back now, I wonder if there would have been a business model in hosting private data graphs to subsidize the open source data.

[1] - https://en.wikipedia.org/wiki/Freebase [2] - https://opensource.googleblog.com/2010/08/acre-open-source-p...

Danny has been collaborating with some people in the MIT Media Lab and Protocol Labs to get a non-profit, open-source, and distributed version of Freebase off the ground, called the Underlay. I know they're looking for passionate people to work with, so if you're interested you should reach out! https://underlay.mit.edu/

That really sucks. Being bought out by Google probably creates some implicit benefits to their search experience (likely with their topical breadcrumbs which returns widgets inside the search window giving a list of objects that match and near-match, like books by an author).

But the acquisition prevents there from existing a service that can inspire a wider ecosystem on something like a federated platform.

Blood monopolies, man

You probably would like to have a look at Wikidata that is very similar to Freebase and is gaining a huge community of contributors and data reusers: https://wwww.wikidata.org/


You might be interested in the Underlay Project that Danny is starting: https://underlay.mit.edu/

Its great to see he's continuing this work! Makes sense to use existing technology like IPFS to distribute the graph, instead of building the technology from the ground up like we were trying to do back in the day.

Seems pretty clear to me that freebase was a thread to Google search. A semantic knowledge search with a powerful query language could replace a good chunk of free text google searches at least for power users.

Makes sense for them to buy it and get rid of it that way.

You can still download the data at https://developers.google.com/freebase/. Looks like the data is available there and the license is "Creative Commons Attribution (aka CC-BY)". Wonder why someone hasn't created a new company starting with their dataset? It's "only" 2gb compressed, 8gb uncomp, 63 million entries. That is smaller than I expected.

I guess wikidata/wikipedia is the offshoot.

>Makes sense for them to buy it and get rid of it that way.

Except they didn't : The Freebase team moved to Google and carried on the work there, as "Google Search".

btw s/thread/threat/

I was involved in the semweb community ~7 years ago, particularly the "RDF knowledge graph" end, and it's still a bewitching idea. A lot of smart people have/still do work in it, but it never reached any kind of success on the commercial (as opposed to academic) web, because:

Serialization is not the hard part.

The semweb community was obsessed with ontologies and OWL and schemas and taxonomies. If we can just break the problem down enough, the logic went, then systems will be able to infer new data about the world. But it never worked out that way.

Eventually you just have to write some code. If you have to write code anyway, all the taxonomies and RDF in the world aren't helpful (indeed, they're almost certainly the least efficient way to model the problem). You just scrape the pieces of knowledge out of whatever JSON, HTML, or whatever else and glue them together with the code. You don't need the all-knowing semantic web, you just need a .csv of whatever tiny piece of it you care about.

I have a distinct memory of trying to sell someone on the startup I was working on, a SPARQL database. I was pitching RDF as a way to model the problem, but eventually the person I was pitching just said "well, we can just outsource the scraping to our eastern european devs and put it all in one big table." I had a kind of "oh my" moment where I realized that the startup was never going to work: in the real world, you just write code and move on. Taking part in the great semantic knowledge base of the world doesn't matter and isn't needed.

The other end of semweb, the "machine-readable web", more or less came to pass. schema.org, opengraph, and that sort of thing did 99% of what the semweb community wanted at 5% of the effort. The fact that all of that data is not in one giant database doesn't really matter to anyone; you rarely care about more than 2 or 3 web pages at once.

I worked for semantic web startup. The idea was we'd build private "knowledge graphs" for companies especially Pharma and Biotech. We experienced something similar to what you describe. We had a nice RDF generator and a query engine. The idea was we'd parse data from clients' DB and unstructured stuff and generate semantic graphs - whcih would be used for semantic graph apps like search and inferences. Looking back, it was never going to work. Most clients came to us for "analytics dashboard". They were happy with giant tables to power these dashboards (and they were right!)

It's really too bad that XML+XSLT didn't take off as the "replacement" for HTML. Before you recoil in horror hear me out...

Web pages are a giant mess of content and presentation, and CSS doesn't really help much. XML is at least a way of describing data in a meaningful way. <book>, <author>, <chapter>, etc. XSLT provided a way of formatting XML in the browser. Sure the internet would still be full of inconsistent content structures, but it would still be way easier to machine read than the big mess of arbitrary <div>s and <p>s (most of which just display something blinky) that we have today.

The two problems here are that XML/XSLT are horrible to work with (to the extent that you want to achieve what they were supposed to, you do it with the modern "single page app" style where you write javascript that retrieves data from an API and renders it into a UI) and that no-one actually wants to separate the content from the presentation anyway.

Except you just contradicted yourself. If everyone wants to use JavaScript frameworks and APIs for data they they ARE separating the content from the presentation.

If you think JS frameworks are used only for data manipulation, you're sorely mistaken. Runtime styling has been a primary aim of JavaScript code since even before the popularization of jQuery.

> If everyone wants to use JavaScript frameworks and APIs for data they they ARE separating the content from the presentation.

That's not true. The DOM that the javascript frameworks mangle around isn't data but UI elements, some of which happen to be data.

> no-one actually wants to separate the content from the presentation anyway.

Except of course the developers who are using react & friends because it allows them to separate content(data) from presentation(view).

Every time I am forced to work with XML, I wonder how differently it would have turned out had 10% of the resources devoted to building increasingly complex and tangled specs had been instead directed towards making tools and documentation which aren’t terrible. Little things like the user-hostile APIs for namespaces, validation, extension, etc. and often horrible error messages, not to mention the lack of updates for common open source libraries like libxml2 really cemented the XML=pain reputation by the time JSON came along.

I worked in a web app that used XSLT transformations of XML-formatted domain models to render HTML pages. You ended up doing the same sort of thing that React does, but with _far_ more cumbersome programming constructs. Functions done in XSLT were a nightmare.

The idea behind the whole thing was that the "web devs" could be hired for their HTML/CSS skills and never have to touch code. The reality was, they had to become experts in a peculiar, clunky, XSLT-based programming language of their own.

> Functions done in XSLT were a nightmare.

Did you "push" or did you "pull"?

> Did you "push" or did you "pull"?

I think you're proving my point. :-)

More specifically, we defined parameterized components that could be composed to form the output. The definition of those components was incredibly verbose, as was the instantiation, and it was tightly coupled to parts of the domain model in surprising ways. IIRC WebDevs ended up having to work a step removed from the final XSLT, using some DSL for components that transformed down to XSLT.

Further, deponent sayeth not.

I would love to see, what you did there, because, from your description, I do not understand, why you needed to use some DSL, if your input was XML and your output should have been XML, too.

And how does my question prove your point?

I am asking, because I really would like to understand, what happened there.

My gripe with JSON winning is that we're left with competing standards for JSON paths, transformation, and JSON Schema is still a draft. Clunky as XML is, the tooling is more portable and powerful.

That probably would have been cleaner back in the day, but with rich UIs being what they are, I think JSON + client side templating logic is probably simpler (and composes well).

I never had much luck using XSLT for anything non-trivial, and I imagine that experience isn't uncommon.

This only addresses the easiest and least interesting problem to solve if it’s really a problem at all (I mean this doesn’t even show up on the Doctorow list of “meta crap” as an issue— like micro formats are ugly but they can do essentially the same thing). It just makes peoples lives miserable to use more XML for no benefit.

> XSLT provided a way of formatting XML in the browser.

Actually, that was XSL:FO.

XSLT is transformation, not formatting; of course you could translate to (X)HTML with embedded styles, or XSL:FO or something else that includes formatting, but XSLT doesn't do formatting itself.

No, XSL-FO does not do formatting in the browser. It is meant to do printed page formatting. Currently the most common drivers produce PDF. But I heard, there is a LaTeX driver as well as PS.

Typical workflow is: XML -> XSL-T -> XSL-FO

The thing that makes this hard is the security model in browsers. You either keep the XSLT scripts in the same directory as the XML or need to turn off the check for cross-site scripts.

Ok, how do you get XSLT that isn't in the same directory as XML to work in a browser ?

With regard to the first example, lowering volume of playing media when you get a phone call, I had that set up on my Nokia N900 a decade ago (Dbus on the N900 would trigger a script to ssh into my computer and pause mpd). Naturally this was a nerdy thing and not something accessible for the general public, but I mention it here just to encourage my fellow nerds to realize how much power they might already have with existing tools.

The writer says that a business owner must add their office info to Google or Yelp and suggests there are no alternatives to such centralized repositories of information. However OpenStreetMap also has opening hours for businesses and medical practioners and that data is yours to process and play around with as you like.

In fact, there is just so much data present now in OSM, we simply lack convenient end-user tools to extract and process it automatically.

OSM is open, but it is still centralized. It doesn't come to your webpage and get the information about your hours from you (as far as I can tell)

A hybrid approach seems sensible: the OSM data contains a vetted/known URL for your business, then your client uses that URL to fetch the hours from the website. In a perfect world, anyway.

OSM is a wiki, anyone can edit things, and there's wiki-like community review. So the owner of a business can add the details if they want, or someone walking past and looking at a sign can.

Right but the semantic web idea is that the business owner puts the hours on their site and that lets everyone who wants to know the hours. No central site (OSM, google, yelp) required.

I mean the person looking for the hours would probably go to one of those sites, but the hours wouldn’t be stored there.

For the same reason graphical programming languages are still unpoular, and the command line still rules: inputting normalized tags is many orders of magnitude harder than typing free text. Even on an adaptive touch interface. Even with tag completion. Even with a template to fill in. And when you do manage to input some well formed AST for your todo list, you're on an island by yourself, because everyone else is using free text (or their own, different tags or syntax). Because even if you have the same structure, you also need to use the same tags! What language are they supposed to be in?! They might as well be unique numbers unless you speak that language.

It's facepalms all the way down.

I took TBL's Semantic Web class ("Linked Data Ventures") when I was a grad student in the fall of 2010. The class was well structured, and included an introduction to basic concepts and languages, lectures by people using it in production environments, and group projects. I wrote an account of the first class here (http://www.ilamont.com/2010/09/encounter-with-tim-berners-le...) and you can see a demo of the rudimentary educational app our team built here (http://www.ilamont.com/2011/03/challenges-of-creating-mobile...).

As the title of the class indicates, the idea was to encourage the creation of real-world applications, and to that end the class groups were encouraged to have a mix of Course 6 and business school team members. At the time, it seemed that the Semantic Web was more of an academic/open source project rather than something that was widely embraced by developers, although some guest speakers did have working applications at their places of business. I think the hope was to seed the Cambridge startup ecosystem with SW/Linked Data examples that could encourage its spread into the real world.

One of the teams in our class actually turned their project into a startup that was later acquired. I ran into one of the co-founders a few years later and asked if they continued to use the Semantic Web/Linked Data model that they had demoed in class. The answer: No, because it couldn't scale. That was an issue that was anticipated and discussed during the class, but there was hopeful talk that scaling issues would be resolved in the near future through various initiatives.

I worked on the Semantic Web. It has so many fatal flaws, that I am amazed in hindsight that I didn't see them back then.

Berners-Lee was successful with the Web because it was not an academic idea like Nelson's and Engelbart's hypertext, but it was a pragmatic technology (HTTP, HTML and a browser) that solved a very practical problem. The semantic web was a vague vision that started with a simplistic graph language specification (RDF) that didn't solve anything. All the tools for processing RDF were horrendous in complexity and performance and everything you could do with it could typically be solved easier with other means.

Then the AI-people of old came on board and introduced OWL, a turn for the worse. All the automatic inference and deduction stuff was totally non-scalable on even toy examples, let alone web scale. Humans in general are terrible in making formal ontologies, even many computer science students typically didn't really understand the cardinality stuff. And how it would bring us closer to Berners-Lee vision? No idea.

Of course, its basic assumptions about the openness, distributedness and democratric qualities of the Web also didn't hold up. It didn't help that the community is extremely stubborn and over confident. Still.They keep on convincing themselves it all is a big success and will point at vaguely similar but successful stories built on completely different technology as that they were right. I think this attitude and type of people in the W3C also has lead to the downfall of the W3C as the Web authority.

There are different flavors of OWL nowadays. Some of them are especially dedicated to reason on huge volumes of data (polynomial algorithms), although they are not very expressive. Some are more expressive, but don't scale very well. Some are incredibly expressive, but are undecidable, so you can only use them as a formal representation of a domain, not something you can reason from.

The practice in the community is to choose a fragment of OWL/description logic that fits your needs. Different tools for different uses. In practice I'm especially fond of the simplest languages, just a little more expressive than a database schema or an UML class diagram, as they are easy to describe things with and yet very useful, with lots of efficient algorithms to infer new things.

I agree on the problem with abstract goals.

I could never really thought understand what it was going to do in specific terms, going from a "Programs could exchange data across the Semantic Web without having to be explicitly engineered to talk to each other" to some specific cases that seemed useful.

Spolsky had a great blog about this ki d of thing. CS people looking at napster, overemphasizing the peer-to-peer aspects and endeavouring to generalized it. Generalising is what science does, so the drive was there.

Generalising a solution is... It can lead you down path to solutuon-seeking problem. The web is also hard. Lots of chicken-egg problems to solve.

When TBL released www he had a browser, server and web pages that you could use right now. The "standards" existed for a non abstract reason.

On criticisms of W3C... idk. The have an almost impossible job. The world's biggest companies control browsers. Standards are very hard to change. Very hard network effect problems, people problems. Enourmous economic, political & ideological interests are impacted by their decisions.

You could say that they not have been involved with the project until it was much more mature and they could decide whether or not to include it. That said, if they were those sorts of people I stead of academic... I'm not sure if that's better.

So.e things just don't work out.

Nothing "happened to" the Semantic Web. It's here, and it's growing in utility and capability as the technology matures. What isn't necessarily growing is understanding of what the Semantic Web really is, who it's for, how to use SemWeb capabilities, etc.

I'll accept some responsibility for that last bit, as somebody who has been active in promoting, and advocating for the adoption of, SemWeb tech. I could do more / do a better job in that regard.

I think the general understanding of the need for formalized semantics on the web is going to grow when people realize that chatbots think the answer to "name a fruit that isn't orange" is "an orange": https://hashtag.ai/blog/2018/09/23/fruit.html

Well, now's as good a time as any to start! Let's say I wanted to throw a layer of semantic markup over an existing site - where would I go to figure out what schemas to use and how to use any given schema (it's been a while since I tried to SemWeb up a site.)

That's an interesting question, because it has a few assumptions baked into it. I'd love to write a long eassy on that right now, but I don't really have time. But to answer the core question, one good place to start familiarizing oneself with the various schemas that are available is:


There's also a lot of good information at


although I fear that site doesn't get as much love / attention as it should, and some of the links might be stale.

Care to share a concrete example of growth? Very curious onlooker here.

Wikidata would be an obvious example. They support SPARQL queries via the endpoint at https://query.wikidata.org/

The main downfall of the Semantic web efforts are not technological, but due to a misalignment of incentives. Semantic web formats require content creators to annotate metadata for machines where webpages are intended for human readers.

We think that the main way to achieve a practical semantic web is to have AI synthesize a Knowledge Graph from applying CV/NLP techniques to understanding all webpages. More about our project here:


We actually are on the semantic web for healthcare. See:


Other fields are moving towards semantic also

What happened is that it became. We learned that we can't trust anything on the Web, but joly it is rather nice that you marked up your opening hours.

In the end the semantic Web uptake was on the data not the meta data.

Regards to the academic semweb grant story these same idiots are now chasing the cloud with out a clue. And before it was grid.

For some fields there is uptake because it solves problems. But they hardly market themselves as semweb.it's more profitable to market they solve solutions.

The only way this works is if there is some central agency enforcing and standardizing the tags/APIs.

People (and programmers) are lazy, and ignorant. If its not in their face broken it frequently won't get fixed. I used to have html validator default enabled in firefox, which would point out HTML errors for every page I landed on. The percentage of web pages that had in your face HTML errors despite all the tools to check for broken HTML still didn't mean people put in the effort to assure their pages were error free. Basically, if the page rendered "correctly" in the developers browser and maybe another test browser or two, then it was job done.

The potential of the semantic web is massive. It's hard to understand why it hasn't been a massive game changer. I remind a few years ago making crazy query to answers questions that still today has no equivalent like: Find companies CEO that has less than 100k employees and was created before Neil Amstrong walked on the moon. The winner take all approach we have today with all those sillos doesn't benefit human kind in any way.

I just had to try this with Wikidata. Unfortunately did not work well


I am surprised that the author did not mention the semantic web we did get. It's in his source, after all. If you look at the header of the page, you'll see the lines below. Sure they are not the utopian version of the semantic web we were promised. Instead, it's even better: it's the pragmatic semantic web we need:

    <!-- Twitter -->
    <meta name="twitter:card" content="summary" />
    <meta name="twitter:site" content="@TwoBitHistory" />

    <!-- OpenGraph -->
    <meta property="og:image" content="https://twobithistory.org/images/logo.png" />
    <meta property="og:url" content="https://twobithistory.org/2018/05/27/semantic-web.html" />
    <meta property="og:title" content="Whatever Happened to the Semantic Web?" />
    <meta property="og:description" content="In 2001, Tim Berners-Lee, inventor of the World Wide Web, published an article in Scientific American.
" />

The author specifically mentioned opengraph, or are you referencing something else?

The author does but as an aside. I think OpenGraph is the practical application of a more utopian semantic web and it deserves a lot more recognition.

This is why we can't have nice things. When someone (sure, W3C- anyone) tries to, you know, design stuff before it's built, everyone whines and complains about how it's all too "mathy", how the standard is bloated so we shouldn't have any standards at all, how the standard is not good for "real work", etc etc. Then, since all those hard-working programmers are, allegedly, too dumb to get their heads around XML (XML! Oh, the complexity!), RDF and OWL, along come the big companies and crate their own, de facto standards. So now, if you want to do work, you have to abide by those standards, whether you like it or not and you don't even get to influence them, because they're not some open web committee that you can badger about the quality of their standards, but closed, walled-up conglomerates that don't care how nice the web is, only that they can control it.

Not to mention, the end result is a hairball alright, a big pile of tangled up hacky, ad-hoc APIs, bashed together as fast as possible, "to get things done quickly".

... and everyone is still using XML anyway.

HN, sorry for the rant. RDF was such a good idea, especially as human-readable, human-editable turtle.

There was a lot of clunkiness there, in the W3C standard, but, W3C standards are made to be openly debated and revised. Facebook APIs, on the other hand - not so much.

It's been replaced by the Decentralized Web as the latest fad (also with the backing of Tim Berners Lee). Let's see how far this one goes as well (though I'm really rooting for it to succeed).

The semantic web relies on people not lying. Unfortunately, meta tags were instantly filled with seo spam as soon as they were implemented. It's a trusted client approach to data integrity.

It was supplanted by AI and machine learning.

Not only did these outbuzzword the Semantic Web, but as it turns out it's much easier to have a bunch of GPUs running CNNs to extract semantic info from the dirty data you have rather than attempting to cram that data into a well-specified ontology and enforcing that ontology on new incoming data.

Extracting information is not the issue. Figuring out what it is is what the Semantic Web (or any good Ontology) helps solve.

For a AI/ML to provide that insight - requires the ML to have access to a good Ontology.

The reason is more nuanced. The main reason being money: https://news.ycombinator.com/item?id=18036041

"The Semantic Web will never work because when it works, you won't know it's the Semantic Web". Source: https://twitter.com/TomDeNies/status/653572860766781440

It couldn't make money...

One of the best examples of the semantic web was Daylife[1], and they wound up being "acquired" by two bit players[2] that figured out how to monetize things better.. :-/

[1] https://en.wikipedia.org/wiki/Daylife

[2] https://techcrunch.com/2012/10/17/content-licensing-service-...


I made a few observations in my own comment.

One being that there is no usable graph store you and I can use as of 2018.

Another being about monetizing the Semantic Web when playing the role of the data/ontology provider. You provide all the data while the consumers (Siri, Alexa and Google Home) get the glory: https://news.ycombinator.com/item?id=18036041

Store is easy, you put your triples to linearly scaleable cassandra. What you want is some fancy query language on top of that, right?

The financial incentives have become stronger for building walled gardens than a semantically open web. The semantic data has been more useful to giants that monetize it, than to millions of small publishers who are supposed to abide by the rules and maintain it. The issue is even bigger if you are listing valuable goods - from products, to jobs, to real estate/rental listings as a part of your marketplace or business. Aggregators like google can scrape and circumvent you, by taking away your users earlier in the acquisition chain, so why bother giving them your product graph.

There is great power in rdf graph databases (Allegrograph) and the rapidly growing collection of valuable ontologies https://lov.linkeddata.es/dataset/lov/

The barrier to entry is thinking in “graph” instead of relational dB, which is a big cultural change, and then shifting focus and attention to the information science of building valuable ontologies. Once you make the leap, it’s hard to go back - it’s an order of magnitude productivity gain.

The Semantic Web would never have worked IMO. All you have to do is take a look at Soundcloud's tags. People will tag their songs with whatever tags they think will help their music get hits.

The Semantic Web is alive and well and doing great, thank you very much. You should drop by some time and check out the real thing. Unfortunately, hype in any field will attract losers and opportunists, but why focus on the negative? Never mind academia: Some of the world's largest companies are investing serious money in semantic-web approaches to get a grip of their information resources. You won't see it because it's mostly behind the scenes, in intranets and infrastructure that feeds data to your nice shiny restaurant recommender, or whatever.

The software stack is getting better and more robust-- you can do things quickly with billions of triples that would take you weeks of development to program in a non-trivial relational database environment. The Semantics 2018 conference just took place in Vienna. It was heavy in industry presence and there was a lot of money going around. These guys guys don't give money outside the company unless they're going to get value for it.

So yes, reports of the imminent arrival of the Semantic Web ten years ago were greatly exaggerated. But if you're looking for a topic with an amusingly clueless commentariat, you'll do better to google "PHP object-oriented programming" (or just "hacker", for that matter).

I was in the unfavourable position of trying to implement a commercial, performance product on top of a triple store with OWL inferencing. It was unworkable. A slightest error in the entire dataset was able to break inferencing elsewhere (butterfly effect), not to mention the performance. We worked out the performance, by adding layers of caching but the data correctness is in my view unrealistic.

As it is now, the one who provides the data is the one who pays for everything (production, storage, compute, bandwidth), while middleman search engines take all the money. What if the middleman was required to pay a fee to index the data ? For example build money transactions into the HTTP protocol, where web sites could automatically ask for a small fee in order to "see more".

IIRC, the (by some definition) original hypertext system, Xanadu, had this. It was supposed to (among many other things) keep track of who quoted who and how much and make sure everybody got paid fairly. It was much too complex and never got anywhere, and was completely replaced when the WWW came along with its dead simple model of URLs and HTML.

No website would ask for the fee for SEO reasons.

The Semantic Web was not XML but RDF. XML was the serialisation format.

I thought the addition of elements to HTML5 like nav, main, article, etc. was the end result of the semantic web.

Not really. That stuff does relate to "semantics" in the "Semantic Markup" sense, but it doesn't actually have much to do with the "Semantic Web" per-se. I mean, yeah, there is a weak sort of connection there, but when people talk about the "Semantic Web" they are mostly talking about RDF, whether it's encoded using RDF/XML, N3, Turtle, JSON-LD, or "other". And along with RDF are related technologies like OWL, inference engines that reason over a triplestore, etc.

One core problem is that the attention economy of the conventional web has driven content creators so far away from the path of truthfulness that they would just ruin any truly distributed ontology with lies.

As soon as you start actually consuming semantic data it becomes a protocol that begs to be "hacked".

The $64,000 question is: how do you implement the semantic web without changing any HTML or backend code?

Because the web is never going to change to adopt a semantic web standard. What we have now are facsimiles of the semantic web, things like Open Graph (which only provides the gist of page media, if that), proprietary search engine results, and proprietary APIs for walled gardens like Facebook.

It's looking like machine learning is going to provide richer gists and then manually-coded directories will provide user interface controllers for those gists in Alexa and other agents. It's a far cry from a truly semantic web but most people won't know the difference.

This is actually a pretty easy problem to solve, but to do it, we'd be running against the wind of capitalism. The semantic web is running behind the scenes at Google, ad agencies, even the NSA. Except they've built it around people's private data instead of publicly accessible documents.

Just to throw some ideas out there, I would start with the low-lying fruit: we need a fully-indexed document store that doesn't barf on mangled data. We need a compelling reason for people to have public profiles again (or an open and secure web of trust for remote API access). We need annotated public relationship graphs akin to ImageNet or NIST for deriving the most commonly-used semantics (edit: DBpedia is a start). Totally doable, but developers gotta pay rent.

> Imagine a Facebook that keeps your list of friends, hosted on your own website, up-to-date, rather than vice-versa. Basically, the Semantic Web was going to be a web where everyone gets to have their own personal REST API, whether they know the first thing about computers or not.

Sounds more or less like what the Urbit project (https://urbit.org/) is trying to accomplish. Not an endorsement; it has serious flaws just like everything else. This is a very hard problem to solve. But I sure do hope someone manages to figure it out.

It was the answer to a problem that nobody had, straight out of the dotcom bubble days where those answers could be sold as a business cases or for university grants.

Basically any form of structured data, be it in XML or JSON, served through some channel of data, is everything people need. There is no benefit in further standardization. Simple, informal standards work better than monstrous specifications that nobody ever bothers to deal with properly. The most important part is reducing friction, that's why JSON is the most successful format despite its shortcomings.


I made a few observations in my own comment.

One being that while a set of SPARQL Federated Queries would elegantly replace my assorted, custom collection of python scripts, scrapy and PhantomJS (slowly porting over to puppeteer) programs talking to Postgres, there is no usable graph store you and I can use as of 2018.

Another being about monetizing the Semantic Web when playing the role of the data/ontology provider.

The majority of your clients will want your data in relational formats than turtle/RDF files/format anyways.

.. and if you do provide all the data, the consumers (Siri, Alexa and Google Home) get the glory: https://news.ycombinator.com/item?id=18036041

If you ask that question in 5 years perhaps the answer will be that it came alive with AI. As AI becomes more important so should the Semantic Web since it will provide data to train the machines.

Someone commented recently in another thread that you can make software from order (via languages) or from chaos (ML), or often a combination of the two.

Perhaps determining "meaning" on the web is similar, where the synthetic "order" approach is semantic markup, but the analytical "chaos" approach is NLP, image object recognition, etc.

I think you need both, since human produced content doesn't always follow discrete predefined categories, but also has patterns that can be pre-classified to solve real problems more easily.

The web is just a subset of the internet, and a shrinking one as a share.

More communication over the internet is between client-server apps rather than between browsers and other open standards, as was envisaged when the web started. JavaScript apps and mobile apps are tightly coupled with their services.

Although HTTP has proven resilient, HTML/XML has not. XML's verbosity that enables semantic meaning is exactly its undoing compared to JSON. When meaning is built into both client and server, communication needs to be skinny not rich.

> Sean B. Palmer, [...], posits that the real problem was the lack of a truly decentralized infrastructure to host the Semantic Web on. To host your own website, you need to buy a domain name from ICANN, configure it correctly using DNS, and then pay someone to host your content if you don’t already have a server of your own.

Exactly that was the lacking brick of true distributed linked (semantic) web which now has a chance to be fulfiled by IPFS/IPNS/IPLD or some upcoming standarized equivalent.

Every few months my mind circles around that topic.

I think it won’t work if the underlying transport/presentation is „the web“ (i.e as in Web 2.0).

Instead of decorating semantics/hints around the actual information mostly for SEO reasons it should work the opposite: using all available semantic hints and information bits there already are to create new information by aggregating and putting things in a new context.

It adds value while building upon previous knowledge and allows information and context to be relevant indefinitely.

Try expressing not such a vague concept, but a concrete, real world application that actual stakeholders would be interested in investing into.

You will find that:

A) It's probably not that valuable

B) None of the hard problems are technological

You‘re right, I didn’t go into concrete detail.

It‘s also hard to describe but the best analogy I can come up with: picture a CMS that actually is about _content_ instead of being tied to presentation. So e.g. writing an article about a certain historical event at a certain place consists of stringing all the information and relationships together.

Bringing the correct pieces together eliminates errors and gives a piece of information more meaning when used in different contexts.

Being able to correctly reference e.g. Venice, Italy instead of Venice LA, CA makes a huge difference when looking up time schedules, weather forecast, flight connections etc. Sure there are IATA codes for airports. Wouldn‘t it be great to mention Springfield in an article and having all information about that place (as well as all „backlinks“)?

I also don‘t think it is a technology problem.

However, I‘d like to think about this more in terms of DRY principle of information. There are publications on the web that solely exist to duplicate short-lived, relatively low quality information and putting ads on it. This may be acceptable for some consumer‘s point of view, but fails to create long-lasting contribution to mankind.

Just dumping all bits we currently store into massive archives is possible but taking measures to keep the amount of „information archeology“ needed to understand this data feels the right thing to do.

I‘ll iterate on that.

Sorry if this reads even more confusing and esoteric, need some sleep now.

It's worth looking at what you wrote once more from the parent poster suggested perspective "that actual stakeholders would be interested in investing into".

You're giving an example about an improved CMS. If I imagine myself in the shoes of any actual stakeholder who's got a bunch of employees using (or is paying for the development of) a nontrivial CMS system, I don't really see why they would consider your proposed features as needed and valuable. They don't have a problem with referencing the correct Venice, they can say what they want to say as accurately they want with the current CMS systems. If they're writing an article, then either the weather forecast and flight connections would be relevant to the intended message and included by the writer/editor, or otherwise they should be avoided in order not to distract readers from what the publisher wants. Similarly, having 'backlinks' may be considered harmful if the publisher doesn't want the reader to easily go to another resource.

That is the point of looking at the benefit to stakeholders. It doesn't matter if some approach will or will not "create long-lasting contribution to mankind", that's not why technologies get chosen - if the stakeholders who are making the decision on whether to use this technology have an incentive to do so, it will get used, and if they don't have such an incentive, then the technology will die.

And that's the prime weakness of semantic web - its usefulness requires content creators to adopt the technology, but it doesn't provide any strong incentives for these content creators to do so; the main potential benefits accrue to someone else e.g. the general public, not to those who would need to bear the costs of adapting the content. I don't see how it can be successful without addressing this important misalignment of incentives, since incentives matter far more than technology.

Mixing semantic web techs with ML is hot in the domain right now.

Because ML solves problems symbolic approaches cannot solve (dealing with huge amounts of raw, poorly structured data) and symbolic approaches solve problems ML cannot (dealing with logical reasoning and inferences, like in the query "give me all cities of more than 1 million inhabitants that are less than 300km away from Paris, sorted from southernmost to northernmost").

Interesting, this post implements the og tags for social sharing (image and description) but skips the full og:article tags with author, date etc.

Edit: https://search.google.com/structured-data/testing-tool/u/0/#...

Blogger/blogspot and RSS. We already had it, and still have it. It's not as pretty as Instagram, though. It's all over.

The RDF-based RSS versions never saw anywhere near as much adoption as the non-RDF based ones, though.

On a related note, I'd also be interested to know how the semantic web was related to the rise of "knowledge graphs". That's another term that we heard about for a while (and clearly was implemented - Facebook, Google and Microsoft have them, for example), but I haven't heard much more publicly for a while.

Companies that don't have pressing revenue issues and a large petty cash allowance use it. A LOT. As a result, most BigCo and BigGovts use a lot of Semantic Web. There are full time employees at these places where all they do, full time, is write pages of documents about applying the Semantic Web to solving a problem. For example, thousands of man hours are spent by the U.S. Military on the Semantic Web every year.

The reality for the rest of us: It does not make financial sense to build the Semantic Web.

It's a chicken and egg problem.

No one has been able to find a way to monetize the Semantic Web when playing the role of the data/ontology provider.

You cant slap on an ad. You hand off the data and someone else renders it and slaps on an ad raking in all the money.

If you are a data provider, it's much more practical going the traditional way: using relational databases and importing/exporting/feeding data using relational formats than turtle/RDF files. The majority of your clients will want your data in that format anyways.

Designing, Building, Maintaining, Querying an Ontology takes a huge amount of expertise/resources.

Even if you had all the money in the world to obtain the data: there currently exist, in single digits, with none of them being open source/free, capable/scalable triple stores that can store an Ontology/Graph that's dense enough to be meaningful while providing any level of practical turnaround time for queries.

Individuals like you and I or small businesses just don't have this expertise/resources.

We would be spending too much time writing our own graph database, carrying out alignments between entities from various datasets, looking at and correct bad data etc before we even get to what we originally set out to do.

Instead, most of us scrape the data from HTML/REST+JSON; use taxonimies at best and custom code to do what we need to get done and call it a day.

12 years ago when I started learning about the Semantic Web, I envisioned we will be, in 2018, using software agents to make our lives simpler:

1. My software bot looks at my calendar to figure out my day's trip and queries the traffic data from the endpoints relevant to my route

2. It also tries to estimate when I will have time to eat and generate a list of nearby restaurants or fastfood locations depending on my available time

3. It would be able to query endpoints from gas stations relevant to my route to figure out whether and if I should fill up gas

4. If portions of my route has toll roads on it, to find out if I already have a pass and remind me to put it in my car ...

A critical component of this happening would be support for federation: la SPARQL Federated Query.

While SPARQL does support Federated Queries, no one has an incentive to support the feature because of the above mentioned monetization challenge.

So is my vision in shambles?

No. I still things done but now through an assorted, custom collection of python scripts, scrapy and PhantomJS (slowly porting over to puppeteer) programs talking to Postgres.

There is not a single line of SPARQL involved in the whole pipeline and it does what I want it to do.

... just like everybody else, we are getting along just fine with our hacky solutions.

Who needs the semantic web when we have machine learning to extract the information we want?

Yes, Google surprised us all and made the semantic web unnecessary, even though I doubt it would have come to pass anyway. Consider a paper in 1998 by Sergey Brin, Larry Page, and others, that showed them finding the titles of books and the names of their authors amid the sludge of the World Wide Web:

> We begin with a small seed set of (author, title) pairs [...]. Then we find all occurrences of those books on the Web [...]. From these occurrences we recognize patterns for the citations of books. Then we search the Web for these patterns and find new books. --- http://dis.unal.edu.co/~gjhernandezp/psc/lectures/MatchingMa...

Starting with a seed of just five author-title pairs, their formula found thousands more.

Extracting information is not the issue. Figuring out what it is is what the Semantic Web (or any good Ontology) helps solve.

For a ML to provide that insight - requires the ML to have access to a good Ontology.

We are trying to build a global knowledge graph from the web using AI at diffbot.com.

Web API's happened instead.

And ML also happened. If you can do pattern recognition and semantic analysis without the markup, why bother with it?

But is it that easy? Any ready to use OSS projects available?

Standards compliance is short-term expensive, someone like Facebook would have to make it cheap and/or dangle a carrot to make it worth it

not only the Semantic Web, but semantic anything. All the millions of taxpayer funds spent in semantic research...

This I Know Is True:

1. There are few Internet historians. It is difficult and thankless work that doesn't pay well. And too much of the information is lost to history, or can protested by gain-saying, even the true bits.

2. The first browser, Silversmith, was released in 1987. [http://www.linfo.org/browser.html] (thanks BELUG). It worked in English and grew out of my work on the Association of American Publisher's Electronic Manuscript project, the first U.S. electronic publishing project using tags outside of IBM's product offerings. At the time I had been on the Internet since 1972 and I was tired of typing 128.24.67.xxx. (There was a phone book of IP addresses at the time and I was listed in it for work I was doing on satellite image processing on Illiac.)

3. The second browser, a version of Silversmith, was designed for Old Norse for a researcher and it used Norse runes for the display; the controls were in Roman characters.

4. The third browser, a version of Silversmith, was a semantic browser for a U.S. military application. It was successful as far as I know.

5. The fourth browser, Erwise, [http://www.osnews.com/story/21076/The_World_s_First_Graphica...] came about after I gave a paper on Silversmith in Gmunden, Austria in 1988. Erwise worked in the Finnish language. I understand that TimBL looked at it before developing the W3c browser but decided against using it because the comments were in Finnish.

6. I have seen various dates for the browsers from TimBL and MarcA, but they were at least a few years after Silversmith. We can call them the 5th and 6th, but I'm not sure of the ordering. Both of these browsers were based on the earlier AAP Book tag set.

7. Some of my work on Silversmith grew out of Ted Nelson's work on the Alexandria (Xanadu) project. Much of his work has still not been implemented, but that may soon change.

8. Ted developed hypertext controls for printed documents. In that approach when you finished reading a child section a return page number was there to show you where you left off.

9. I developed the first eHypertext system for networks that would link you back to your source document that you came from by pressing the ESC key. In Silversmith you could link between text, images, sound and semantic information.

10. Silversmith is a scalable system. Please observe that just because you don't know how that is done, does not mean it cannot be done. That was what was told me about browsing and searching earlier also.

11. At the time Silversmith was developed, it was understood that VC's would not talk to you without you having a working product. Once I had it working, I found that VC's would still not talk to you. I talked with about a dozen Boston VC's. They would not even sit for a demonstration. I did a demonstration for the ACM in 2007 (thanks PeterG). That is the nature of tools and the bane of toolsmiths, no one wants to pay for tools. I have a recurring nightmare of the yokel who returned his anvil to the smithy saying, "It doesn't work. I can't use it to make beautiful horseshoes like Kevin does, and he has the same anvil. There's something wrong with this one." With Silversmith I lost a competition among 80 vendors for a search application, when none of the others even had an application. One competitor even called me and demanded all my specs and internal design documents. That is largely why you will not find any published information on Silversmith.

12. I can't tell you how many times I have been schooled on programming languages. "You should program that using ThinkC/ObjectiveC/SmallTalk/the X-System." "You need to switch to Ruby-On-Rails/Perl/Python/Awk, that's the way to do it." People, it's not the language, it's the data structures and the code that is important. And, enough with "speed is important." We are all using supercomputers and they will never be fast enough.

13. Silversmith predated the W3c work by several years, that is why I prefer to use the term "semantic web" (lower case) to distinguish it from the W3c term. I discussed the term "web" with Ted and he agreed that was in use earlier before the WWW.

14. Monitizing a tool is an interesting discussion. No one wants to pay $1 every time they pick up a hammer. But for a cabinetmaker, his/her primary tool is the table saw. This means that they are more than willing to pay on a regular basis for maintenance. They must, it is their livelihood, and the manufacturer is going to make money on that maintenance. He does not expect to be able to charge the cabinetmaker a portion of his sales. That is not how that market works. To me, even razor and blades are not fully monitized if you sharpen your own blades.

15. Semantic work is "path dependent" work. Once you start down a certain path it becomes very difficult to retrace your steps. I used to be critical of academics who "sold out" to the W3c vision, but now I realize that for the most part they are trying to provide what the industry wants and uses.

16. Work on Silversmith continues and I'm pleased to say that it is progressing well. The next version will assist in finding and using knowledge in a more conceptual way.

Appreciate the history.

History's written by the victors, so this is a little SGML oriented -- I still recall my own transition from developing Gopher sites to WWW sites for Lynx, and how at the time, these things felt the same -- though it was clear what would win.

Wikipedia's discussion of browser history also omits Gopher + clients, which is too bad, it was kind of a big deal at the time.


To put it on the timeline, your Silversmith was 1987, Berners Lee's WWW browser 1990, McCahill's Gopher 1991, Lynx 1992, and Andreessen's Mosaic 1993.

Still blown away by the force of an idea trying to happen.


The author of semantic web forgot about money. What is the profit of website owner if he freely shares markup?

I don't think the use of Semantic Web technologies presupposes open data. Linked Data doesn't have to be Linked Open Data. One could create services that adhere to the standards while still monetizing them.


Rest API/JSON is the de facto industry standard for data exchange. No need to parse and extract data from semantic tags.

We're getting there with CSS grid. First, we have to deal with the usual "You actually have to put div A inside div B in order to do X. Now which one gets the semantic tag? And what happens when we also need div C as a container to fix Y problem? "

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact