Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Three reasons why the Semantic Web has failed (gigaom.com)
19 points by rhythmvs on Nov 13, 2013 | hide | past | favorite | 28 comments


Warning: snark ahead.

"The result is an inherently boring web of data. Google’s Knowledge Graph promotional video is a great example of how boring this web can be. “Let’s say you’re searching for Renaissance Painters”…. Really? Who searches for that?"

Somebody with a brain that is used for more than gross muscle control over a TV remote?

" I really don’t care what Leonardo DaVinci’s height was or which Nobel prize winners were born before 1945. I care about how other people feel about last night’s Breaking Bad series finale. How did they find the ending? What other series or movies might I enjoy based on those experiences?"

This made me throw up in my mouth a little.

I can understand that the author wants results to be more applicable to the everyday life of the searcher, but I don't see his vision of a Semantic Web any more useful than the current one.

Finding out what other like-minded people think about my favorite TV show improves the Web, and thus my life? Really?

I'm all for more accurate classification of online data. What I don't want is "boring" (in the author's words) data being pushed aside in favor of what somebody thinks is "interesting".

We've already got companies doing that, in the form of "targeted ads".


Enchanting way to put it! I’m also horrified by the mercantilist eulogy of blatant ignorance, these days. But the author of that gruesome piece has a few points: RDF, OWL, XML specs are tl;dr, too cumbersome to be of practical use, developers don’t bother, and founders don’t care. Meanwhile Snapchat turned down a $3-4 bilion offer today. Advances in AI are made because of consumer driven services. The days of CERN’s Internet are long past…


But the author of that gruesome piece has a few points: RDF, OWL, XML specs are tl;dr, too cumbersome to be of practical use

I don't think that's necessarily true. While RDF, RDF/S, OWL, etc. can get complicated, it isn't that hard to learn the basics, and the 80/20 rule probably applies here. You don't have to walk around mumbling about Horn Clauses, First Order Logic and Godel's Incompleteness Theorem to do useful stuff with semantic web technologies.

developers don’t bother, and founders don’t care.

This developer / founder does. We are baking semantic web technologies into our offerings, and that stuff enables some really, really wicked cool functionality, IMO.


"Advances in AI are made because of consumer driven services."

Not necessarily. IBM's Watson doesn't seem to have been aimed at consumer driven services. IBM's market is primarily large companies.

And while we may not have seen much RDF or OWL based stuff on the public web, that doesn't mean that it isn't being used in proprietary services, such as searching legal documents.


That begs the question: searching those documents is only possible, because someone did all of the tedious OWL/RDF/etc. markup beforehand. Basically, that’s the problem with how the semantic web is perceived generally: sprinkling hints for machines into ambiguous, man-written documents (e.g. Google’s schema.org).

The Watson computer is good at playing Jeopardy, but it’s just fact collecting. Inferring a rule set from a corpus of documents, reasoning, and drawing solid conclusions, is a different game. We would be better off with a technology that would go over any given corpus of unstructured plain text, parse, tokenize, normalize, structurize, iterate, and dump the graph in a queryable data store. Natural language processing with a semantic reasoner. Legal documents are indeed a great use case.

I went to see my lawyer yesterday. Over the last three years I paid (i.e. was extorted) about $30k in legal fees. That’s lot of money for thin air. Withal, most of the research and paperwork I did myself: my lawyer copy/pasting and putting his court-accredited sign under it. To the legal scribes I’m but a layman, and may not speak on my own behalf in court (unless I want to jeopardize my cause). Despite the fact that I am particularly good at parsing texts, studied linguistics, did a post-doc in natural language processing, and know how to read laws (that are assumed to be known to and understood by all citizens, in the first place, are they not?). Fact is: lawyers, judges and clerks form a self-sustaining caste that benefits of the de facto (and de jure) monopoly of interpreting the law. They have a lucrative interest in laws and bills that are poorly written, are contradictory, full of ambiguity and logical flaws. They share that interest with retarded lawmakers who produce all that ill-conceived cruft.

Suppose we had our laws written in a formal language, with a well defined regular grammar; suppose we had unit testing for new bills — machines could administer justice, and they would be much better at it than the dunces who went to law school.

There’s simply too little stakes in developing semantic technologies that would do away with human corruption, underdeveloped intelligence, sentiment, and subjective interpretations. Or rather: some industries (especially those which are controlled by the powers that be) have too big a stake in preserving backward human knowledge parsing. If big law firms have an interest in using such technologies, they only have so, as long as they gain a competitive advantage from it (vis-à-vis those which don’t have/use such technologies). Legal corpora, neatly marked-up by hand (or semi-automated, no difference), are certainly a valuable asset, that you wouldn’t want everyone to have cheap access to, not your competitors, and especially not your litigant clients.

If we had instead semantic technologies that would cheaply produce queryable knowledge systems from large, impervious document collections (like our codes and jurisprudence), then, very likely, lots of sectors in our so called “knowledge society” would be disrupted, leaving lots of overpaid “knowledge workers” unemployed, overnight. That’s a threat to the very industries which are supposed to support research and development of such technologies, as potential customers.

That’s different with IBM’s customers, I guess: they do have an interest in disrupting their industries (or rather the industries in which they are newcomers), using tech. And they may do so, only because there’s no monopoly guaranteed by a selfish legal system. Maybe also because present day’s state-of-the-art in semantic technologies and AI offers good enough technologies for these use cases, which are less complicated applications than those that would be needed for use cases wherein more difficult knowledge parsing is required, _and_ are bound by a legal/economic anathema?

Anyhow, I will support any startup that would create such tech with the intention to run the human legalese interpreters out of business. And that’s an exhilarating thought, because if such technology would be produced, it will be equally good at solving problems in all branches of science.


That begs the question: searching those documents is only possible, because someone did all of the tedious OWL/RDF/etc. markup beforehand. Basically, that’s the problem with how the semantic web is perceived generally: sprinkling hints for machines into ambiguous, man-written documents (e.g. Google’s schema.org).

Well, some work is underway to automate a certain level of semantic extraction from works that were not explicitly marked up as such by a human.

That said, I get what you're saying about the law thing, and I think we're still a decent ways off from a computer that can truly understand the legal code. :-(


Thank you, now I feel better as I see more sane souls. The OPs article is full of personal, short-sighted asumptions.


I’m glad, too. Was curious about HN’s position, and hoping for some sensible opinion as regards the ideal of a Web of knowledge — and its future. But my! I’m not the author of that piece. Just submitted it to stir some discussion.


> But my! I’m not the author of that piece.

Ah ok, sorry about that!


We need a web in which information (both questions and answers) finds you based on how your attention, emotions and thinking interconnects with the rest of the world.

This article sounds like it was written by an ad.


it was, essentially. look at the author's "credentials" at the bottom of the article.


This is one of the worst articles I've read on the subject, and that's a pretty strong statement.

> “Let’s say you’re searching for Renaissance Painters”…. Really? Who searches for that?"

It's an example that happens to be easy to demonstrate. It's pretty easy to think of other cross-dataset queries that you might be more interested in.


It's a horrific example, but the point being made is true, in that in areas such as music enough information is generated regularly in a non-structured way with no tie in to the semantic web that, the Facebook silo aside, practically no one has up to date information about bands etc., certainly not Freebase or similar. It's also a lost cause hoping for such things to be comprehensive.

Facebook have also become a serious problem, in that a lot of places publish there without realising or caring that their information essentially is locked in. This gives their graph search a simply enormous advantage over anyone else.


practically no one has up to date information about bands etc., certainly not Freebase or similar.

Hmm.. that leads to a couple of random thoughts:

1. How up-to-date and comprehensive does it need to be? What kinds of queries will people need to access (either directly or indirectly) about music, to serve their purposes?

2. DBPedia, through Wikpedia, actually does have a lot of information about bands and musicians and music. For example, see:

http://dbpedia.org/snorql/?query=PREFIX+dbo%3A+%3Chttp%3A%2F...

3. But Wikipedia will never be comprehensive exactly because of their notability guidelines. All the latest new super-underground Norwegian Black Metal bands that are recording in a wood shack in somebody's backyard, are not going to have Wikipedia entries.

4. On the other hand, musicbrainz.com seems to have an awfully comprehensive set of listings. And their data is part of the semantic web / linked data cloud, as well.

By way of example, Verminous are something of an underground act, and are not on Wikipedia, but their info is in Musicbrainz.

Anyway... just thinking out loud here.


I think the complaint is more from a perspective that there has been relatively little activity in the information space for Renaissance Painters. I do not necessarily know if that is a true statement, but I do think it is fair to say that the majority of the information out there has already been heavily vetted in that area.

Also, it falls into the category of searches that you'd likely just as soon hit up wikipedia directly. :)


I did a PhD in art history, and I can assure you my ex-colleagues still do a lot of vetting of Renaissance painters. Their books and papers might make it to the Google Books or Google Scholar silos, one day: the findings in there, are still unretrieved by semantic web technologies. Present day’s search tech serves social chatter and pulp much better.


I have no doubt that they do. How do you think they compare, numbers wise, to other topics?

I'm not so sure I think it is the semantic web that is at fault. Seems that it is as much consumption as anything else. The web, search, and related technologies are all keyed around what people are actively consuming today. If you have trouble finding something, that is mainly because not many other folks are also looking for the same thing.


How many people, numbers wise, do you think were discussing particle physics? Yet, there we are: the Web was devised to foster such niche discussion and make the information in there easily findable and sharable amongst researchers. It just turned out that what Bernes-Lee & Cailliau came up with, could be applied to what the general public likes to talk about, too.

So I don’t believe that innovation in web and search technologies is keyed around what consumers want. Rather, it’s the other way around: more demanding use cases drive innovation, some of it can be repurposed and then gets fine-tuned thanks to the bigger R&D budgets that come with mass market application.

But the seed of innovation lays in the niches.


The author is from a company called Bottlenose, which (Wikipedia says) is a "company that analyzes social media and business data to detect trends for brands."

So it's pretty clear that he has an angle here.


Instead of the semantic web trying to create knowledge from data the author wants to take all that data and create more chatter and noise, oops, I mean buzz.

Don't waste your time reading this nonsense.


This author has no real deep understanding of current AI research and the what semantic web actually means. Throughout the article, the author not only ignores and reference any authentic websites for defining semantic web, but also trying to make fun of it by unrelated example such as Google's Knowledge Graph.... Comparing semantic web with the Knowledge Graph is like comparing graphs with oranges. Those are completely different beast and for different purposes. This goes to show the understanding of the author and his/her intent. And ironically, everything that the author tried to sell us on are spot on part of semantic web. For example, the Stream data he/she defined is precisely what semantic web is trying to do, is to label and give meaning to a certain data instead of a big chunk of text/binary. The author also mentioned about pushing and pulling, I argue that the way information/data is disseminated has nothing to do with the ontology or the semantics of data.

And there you go, such a shallow article with a very aggressive link title bait, just pathetic.


Yeah, the article is crap. But did you take a look at the site: bottlenose.com? It's kinda gorgeous.

Having been in the biz (NLP search with sentiment extraction) once, it still comes down to precision and recall, with poor precision being the fastest way to lose sales. I didn't see any indication of bottlenose accuracy. But maybe I was too blinded by the beautiful d3.js.

Speed is also an issue. Machine Learning takes setup time and good corpus sets, after which it's pretty fast. Traditional NLP is faster to start but slower and less accurate after. Neither is remotely close to real-time, which makes me wonder what they're really delivering.

Also, at least in my experience, I couldn't get product managers/marketers to give a hoot. But ad-agencies ate it up. And boutique survey shops. And sometimes CEOs. And sometimes customer-service organizations that had too much inbound hate mail and had to triage.

I think they may be smoking it to think they're going to get inbound traffic. I had to pound doors.


So, I actually somewhat agree with the idea that many of the search examples ads and such use are ridiculously misguided.

Consider, when I'm looking at a blog or an article, it is easy to see other people's reactions if there is a comment box. What is sometimes harder to find is the general context of why a blog/article exists. Did its existence prompt the creation of other blogs and articles? More, is the article still relevant? I think some pushed for this concept with "trackback" and such. But I don't think that really took off. (Maybe I just need to learn to use some tools better.)

However, I think I get lost around the notion that things should be pushed to you. I mean, unless you are referring to twitter style "you probably ignore 90% of what is pushed at you."


sort of a tragedy of commons. The expenses to be born individually while rewards to be ripped by the society/community. Only incarnation of semantic web profitable individually that has been discovered so far is blossoming SEO.


I have to admit, I haven't read TFA, and I'm not sure I want to. The Semantic Web has hardly failed - tons of people use the Semantic Web everyday and just don't know it. The thing is, the SW isn't necessarily meant to be something that the average end user knows about and uses explicitly. It's just about making it easier for machines to understand semantics around data on the web, so those machines can do a better job of helping the humans do whatever it is they are trying to do. So Google could be using the Semantic Web behind the scenes all day long, and the end user would never know it.

And yes, Google do use the Semantic Web.[1][3] So does Yahoo.[2][3] Etc.[3]

It doesn't matter that some people use RDFa, others use microdata, others use microformats, others use RDF/XML, others use JSON-LD or whatever. That's irrelevant syntactical details. The point is having explicitly defined semantics associated with things.

Anyway, the Semantic Web is becoming more and more important with every passing day. As tools[4] for automating the process of extracting rich semantics from unstructured data mature and become better and more widely available, the number of applications for explicit semantics is just going to mushroom.

Just to illustrate (and forgive me a bit of what might be seen as self-promotion here) - our Enterprise Social Network product, Quoddy, has Stanbol integration such that we can process all the various bits of "stuff" that flow through the system, do semantic concept extraction, and store those entities and relationships in a triplestore. Our Information Discovery Platform, Neddick, does the same thing as we consume RSS feed data, Tweets, Emails, etc. Now we can do things like show you, for, say, a given status update, the blog posts, emails, tweets, people, documents, etc, that are conceptually related. And while end-user use of "semantic queries" might not seem useful to some people, the bottom line is that this enables searches that you just can't do with "regular" (that is, non-semantic) tech.

An example... let's say you do something with musicians. Your ESN status update messages occasionally mention, say, Jon Bon Jovi, Bob Marley, Richard Marx, and Madonna. How would you do a search without SW tech that says "show me all posts that mention musicians"? Not gonna happen. But with the semantic extraction + triplestore, we can make that kind of query trivial.

It gets better though... Stanbol comes "out of the box" with the ability to dereference entities that are in DBPedia and other knowledge bases, which is cool enough in it's own right... but you can also easily add local knowledge and your own custom enhancement engines. So now entities that are meaningful only in your local domain (part numbers, SKUs, customer numbers, employee ID numbers, whatever) can be semantically interlinked and queried as part of the overall knowledge graph.

Hell, I'd go so far as to say that Apache Stanbol (along with OpenNLP and a few related projects... UIMA, Clerezza, etc.) may just be the most important open source project around right now. And nobody has heard of it. Again, the Semantic Web is largely not something that the average end user needs to know or think about. But they'll benefit from the capabilities that semantic tech brings to the table.

<rant-over />

[1]: https://support.google.com/webmasters/answer/99170?hl=en&ref...

[2]: http://developer.yahoo.com/blogs/ydn/searchmonkey-support-rd...

[3]: http://ebiquity.umbc.edu/blogger/2011/06/02/microdata-rdfa-g...

[4]: http://stanbol.apache.org/


- https://github.com/fogbeam/Quoddy - http://www.fogbeam.com/ - http://fogbeam.org/

Looks great!

Any chance any of these could be applied to legal corpora? (Cfr above: https://news.ycombinator.com/item?id=6731714 )


Any chance any of these could be applied to legal corpora?

That's a pretty broad question, but generally speaking, I'd say the answer is "yes". It depends on exactly what you want to do.

Feel free to email me if you'd like to talk about that in more detail. I will issue this caveat though: We haven't - to date - focused on the legal world, and it's not something I have a lot of specific knowledge of, vis-a-vis the domain specific parts.


It failed because we chose a 1 to 1 relationship between the window object and the document object. There should instead be a 1 to many relationship.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: