I've seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I've come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.
The semantic web as originally proposed (Berners-Lee, Hendler, Lassila) is as dead as last year's roadkill, though there are plenty out there that pretend that's not the case. There's still plenty of groups trying to revive the original idea, or like most things in the KM field, they've simply changed the definition to encompass something else that looks like it might work instead.
The reasons are complex but it basically boils down to: going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.
You can find all kinds of sub-reasons why this was stupid: monetization, most people are poor semantic modelers, technologies built for semantic system generally suck and are horrible (there's pitifully few reasoners built on any kind of semantic data, turns out that's hard), etc.
For years the Semantic Web was like Nuclear fusion, always just a few years away. The promise was always "it will change everything", yet no concrete progress was being made, and the vagueness of "everything" turned out not to be a real compelling motivator for people to start adding semantic information to their web projects.
What's actually ended up happening instead has been the rebirth of AI. It's being called different things these days: machine learning, heuristic algorithms, whatever. But the point is, there's lots of amazing work going into things like image recognition, context sensitive tagging, text parsing, etc. that's finding the semantic content within the human readable parts of the web instead. It's why you can go to google images and look for "cats" and get pictures of cats.
Wikipedia and other sources has also started to look more structured than it previously was, with nice tables full of data, these tables have the side benefit of being machine AND human readable, so when you look for "cats" in google's search you get a sidebar full of semantic information on the entity "cats": scientific name, gestation period, daily sleep, lifespan, etc.
Like most things in the fad driven KM world, Semantic Web advocates are now simply calling this new stuff "The Semantic Web" because it's the only area that kind of smells like what they want and is showing progress, but it really has nothing to do with the original proposal and is simply a side-benefit of work done in completely different areas.
You might notice this died about the same time "Mashups" died. Mashups were kind of an outgrowth of the Semantic Web as well. One of the reasons that whole thing died was that existing business models simply couldn't be reworked to make it make sense. If I'm running an ad driven site about Cat Breeds, simply giving you all my information in an easy to parse machine readable form so your site on General Pet Breeds can exist and make money is not something I'm particularly inclined to do. You'll notice now that even some of the most permissive sites are rate limited through their API and almost all require some kind of API key authentication scheme to even get access to the data.
Building a semantic web where huge chunks require payment and dealing with rate limits (which appear like faults in large Semantic Networks) is a plan that will go nowhere. It's like having pieces of your memory sectioned off behind tolls.
Here's TBL on this in 2006 - http://eprints.soton.ac.uk/262614/1/Semantic_Web_Revisted.pd...
"This simple idea, however, remains largely unrealized."
There's a group of people I like to call "Semanticists" who've come to latch onto Semantic graph projects, not as a technology, but as a religion. They're kind of like the "6 minute ab" guy in "There's Something About Mary". They don't have much in the way of technical idea, but understand the intuitive value of semantic modeling, have probably latched onto a specification of some sort, and then belief caries them the rest of the way "it'll change everything".
But they usually have little experience taking semantic technologies to successful projects (success being defined as not booting up the machine and loading the graph into memory, but actually producing something more useful than some other approach).
There's then another group of Semanticists, they recognize the approaches that have been proposed have kind of dead-ended, but they won't publicly announce that. Then when some other approach not affiliated with the SW makes progress (language understanding AI for example) will simply declare this new approach as part of the SW and then claim the SW is making progress.
The truth is that Doctorow absolutely nails the problems in his essay "Metacrap" http://www.well.com/~doctorow/metacrap.htm
He wrote this in 2001, and the issues he talks about still haven't been addressed in any meaningful way by professionals working in the field, even new projects routinely fall for most or all of these problems. I've seen dozens of entire companies get formed, funded and die without addressing even a single one of these issues. This essay is a sobering measuring stick you can use to gauge progress in the field, and I've seen very few projects measure well against any of these issues.
Semanticists, of both types, are holding the entire field back. If you are working on a semantic graph project of any kind and your project doesn't even attempt to address any of these things through the design of the program (and not through some policy directive or modeling process) you've failed. It's really hard for me to believe that we're decades into Semantic Graph technologies and nobody's bothered to even understand 2.5 and 2.7.
If your plan to fix problems you're experiencing with your project, the reason it isn't producing useful results, is to "continue adding data to it" or "keep tweaking the semantic models" you've failed.
"The Semantic Web is not here yet."
No, I've rethought this, the SW is not like Fusion, it's more like Communism.
The key idea is to make it easy for another party to add the semantics on top of your data. This solves some fundamental issues that you and Cory Doctorow mentioned:
1) The economics equation for tagging now works out. The user that's doing the tagging has an immediate need (and payoff) for doing that tagging.
A corollary of this is that the parts of the web that are most valuable (in the sense that users need them the most) tend to get tagged first.
The following are responses to Cory's essay:
2.1) The person that's doing the tagging is also an end user, so there's an incentive to do the tagging honestly. That doesn't stop the underlying website from lying. But that's an issue with the web in general, and is mitigated by things like SEO penalties, reviews, etc.
2.2) Again, the tagger is the person who benefits from the tagging, so as long as the data is valuable enough, it will be tagged despite laze.
2.3) We haven't overcome human stupidity. Presumably since the person tagging the data needs it, it will be at a "good enough" level to be usable.
2.4) This one doesn't apply; the tagger is a different person.
2.5) 2.6) and 2.7) These are tougher, and we haven't started working on them yet. You have the same problems when trying to consolidate data from multiple sources. One possibility is to have several alternatives and allow searching to choose between them. That's how Bloomberg solves some of these problems, though it does result in fragmentation.
I'd love to talk to you about this some more. You can email me at firstname.lastname@example.org
Full disclosure: I'm one of the founders of http://www.parsehub.com
2.5-2.7 are really hard problems. I think that lots of people working in the field get lost on these by trying to achieve some sort of perfect model, or by trying to aggregate every possible option into their model, but neither of them have really been terribly satisfactory or provide the kind of subtle decision framework that humans feel comfortable with.
Watching the explanation of differential gear https://news.ycombinator.com/item?id=8513209, I thought why not make wikipedia the central axis around which you let the diversity of the semantic web spin at its own pace. If most people agree on this authority (or if you wish convention over configuration mess) things become easily connectable.
In other terms instead of relying on sloppy ontology rely on wikidata_id as the sort of referential association table.
"Take eBay: every seller there has a damned good reason for double-checking their listings for typos and misspellings. Try searching for "plam" on eBay. Right now, that turns up nine typoed listings for "Plam Pilots."
I wonder, are there search tools, anywhere from functions to libraries to engines, that will search for mis-spellings? Google, DDG and probably everyone else will correct your mis-spelled query, but will anything large or small go the extra miel and search for mis-spelled hits?
It does actually work for popular things: http://i.imgur.com/emjLPad.png
For example, if you search google right now for "plam pilot" you'll get results for "palm pilot".
In context: "... the knowledge management community ..."
(I guess mainly acadamic)
I think it's equally disingenuous to suggest that a vision, and associated definitions, aren't allowed to evolve - and to suggest that "X failed" because "X" isn't exactly the same today as it was in 1999.
One of the reasons that whole thing died was that existing business models simply couldn't be reworked to make it make sense. If I'm running an ad driven site about Cat Breeds, simply giving you all my information in an easy to parse machine readable form so your site on General Pet Breeds can exist and make money is not something I'm particularly inclined to do
It may not make sense for every use-case, but plenty of companies have found value in using SemWeb technologies.
I think that's absolutely a fair criticism of my critique. However, I stand by my critique. Nobody is still calling cars "horseless carriages". At some point, things stop evolving and become something else. The SW has had 13 years to demonstrate value and really hasn't been able to do it in any broad sense while everything else the SW was promising has been met and exceeded by non-SW approaches (which are now being co-opted and called the "Semantic Web" simply because they work).
I think my main thrust is that the Semantic Web failed. That's okay. We learned a lot. Now it's time for something else to take over, call it "Cross Domain Reasoning Systems" or "Global Presence" or some other Gartner Conference ready term.
The Dinosaurs died, and that's too bad, but we learned a lot, and out of that we ended up with birds, of which Chickens and Turkeys are delicious.
But it's time to put it to bed and move on. I've found it always curious that people who work with Semantics have such a bad understanding of what things actually mean.
Programmers do have to agree on ways to represent information when interfacing different systems. Right now for the web, that is HTTP APIs.
There is a pretty powerful religion pushing "true REST", i.e. emphasis on using the correct HTTP verb in the proper way in relation to the type of entity being updated.
To me, that is enforcing a very basic general (CRUD) type of knowledge representation for the programmed web. I think it demonstrates that the instinct for a common language is there.
You can also see serious KR use in certain domains like biomedicine.
One area I have been thinking about applying KR is in defining information systems and programming languages. The reason I want to do that is because there seems to be quite a lot of overlap between a very diverse set of programming languages, and also because I want a format to be able to represent algorithms that can serve as a basis for a type of open source operating system. This operating system either needs to force everyone to produce code for a certain programming language or common lower-level virtual machine. OR: it can use a higher-level semantic metalanguage (maybe based on description logics) and then people can program that in the language they choose (perhaps something most suited to a particular system or domain).
All of these are programmer use cases. I think more and more of this general purpose type of knowledge representation will inevitably start being applied by programmers.
One key reason we don't see it applied more often I believe is convenience. You need a convenient representation and convenient tooling. Most of the semantic web or more generally KR tools haven't focused on that enough, which is understandable because the core requirement is machine-processable exchange. I think the trick might be coming up with a way to embed a compact KR more directly in general purpose programming languages or data formats, or translate automatically to and from domain representations to a KR format.
Another idea I had besides building an operating system on top of a DL (translated to and from different programming languages or representations, serving as a common metalanguage) was to try to popularize semantic computing at the same time as you popularize a standard for the metaverse. https://github.com/runvnc/vr
I recently wrote an article touching this subject and how standardized APIs will be a boon to integrating web content in non-standard interfaces, but truth be told i'm secretly a 'semanticist' and have always been excited about the possibilities SW can produce: https://medium.com/@mcriddy/semantic-web-design-92ef35f66c9f
I think this is a very small use case and there's not enough critical mass behind it. The main reason being that even if you were able to aggregate all the different API together to build something new and bold, it's such a one off idea that is specific to a developer that it really wouldn't be of a much use investing huge amounts of time and money towards building a consolidated API that would serve multiple masters.
The rise of AI and machine learning really puts into doubt the economic feasibility of spending money and time on a service that focuses purely on aggregating disparate islands of data unless the payoff existed in that particular market.
You don't see a major API provider that provides data to everything, instead you see small enclaves of niche businesses that focus and specialize in their specific consolidation efforts and get paid for it.
All in all, I think the semantic web will become more of a private endeavor, at least for proprietary commercial goals. As AI gains more ground, the need for such a central API or the need to unify all the different data together will slowly die off.
- I'm not sure the lack of motivation for implementing SW is economic but rather : what's the point if no one is using it.
- I don't think the shortcomings described by metacrap are foundational : it's like saying a word doesn't describe the reality of the thing it 'signifies' therefore language is useless and will not be adopted. There might be competing standards on ontology but I'm sure they'll emerge.
- AI might be (one of) the tool that make semantic web actually relevant. The reason why I believe google has an important part to play in it.
- and finally : developers, developer, developers !!! Bring an elegant tool in the ecosystem (JSON-LD maybe ?) : devs won't shy away in front of horrendous xml, and all of the sudden SW might get caught in a virtuous circle and become the previous big thing.
You might be right, I actually would love to be wrong. But even as a thought experiment (let's suppose the SW was up and working and fully realized today) what are the use cases other than "changing everything"?
Are these use case the same or similar to just doing it through API access to some set of underlying relational databases? Would the API access give us better performance, even under a federated scenario? If not, what's the advantage?
> Bring an elegant tool in the ecosystem (JSON-LD maybe ?) : devs won't shy away in front of horrendous xml, and all of the sudden SW might get caught in a virtuous circle and become the previous big thing.
I'm conflicted on this. A big part of me says that if something ends up taking off, it won't be the ideas posited by the W3C and it'll just be something else that happened to work and now Semanticists are just co-opting it and re-labling it as "Semantic Web".
I'm deciding not to fall for it. The SW failed. If something else takes off that achieves similar goals but through entirely different means, it's not the SW, it's...whatever it is.
I think it's okay the SW failed. If I sound critical it's not because it was a failure, it's because of the people that keep insisting it wasn't in the face of overwhelming evidence.
The failure of the SW has brought us lots of important information about large scale distributed information systems from a technical and sociological standpoint. It's time to study those issues and outcomes and try something else.
Instead, the field is like the search for the Higgs Boson, except whenever the last particle accelerator the community tried failed to produce the Higgs Boson, and instead found some other particle, the community simple decided to call the other particle the Higgs Boson instead. It's kind of mind-bending to work anywhere near the field.
A trivial example I gave on another comment :
> The way google exposes the web today is unidimensional : keywords => related website list. It's great for humans to parse. But its extremely limited for machines. Why isn't there yet an API or an UI to ask for "all books of Japanese novelists of the last 2 centuries" (website links included per book).
An other example that comes to mind is : you query a person and instead of a list of related website you get a set of tabs with : bio from different sources, work, news, images & videos.... Bonus, you don't need to be google anymore to do that ! Ok that's far fetched but imagine the potential for webapps.
So that's kind of one of the canonical examples for the SW that's always given. Query for an object and get an entire dossier back on that object, assembled, federation fashion, from tons of disparate sources all over the web.
It turns out that
a) that doesn't refute the API access to a bunch of relational stores all over the place and the better performance you're likely to get from those stores
b) the notion that just compiling all that stuff in one place works better
c) really any of the issues presented by Doctorow, read his essay again with a critical eye and think about how each of his criticisms would apply to something like this.
Modern search engines have largely figured out how to provide a fairly high-level ontology equivalent to this use-case by simply parsing out the content on the pages and centrally storing it. This is not the Semantic Web.
If google opens up it's index to the world so you could use it as an api to query the web with the power of a relational DB, the case for semantic web would probably be pointless. But it's locked and only serves there ad system.
I'm not defending Semantic Web per say, but because the lack structure in the web makes it only 'parsable' by gigantic entities like google. Their work is remarkable and they offer a nice service I can consume. But I can't really build on it.
I think the debate should shift from the means to the goals... Semantic web might not be the right way to do it, but I strongly believe in the necessity of a way to (openly) connect the dots of all that data or it's a giant waste.
What are the relational stores you mention in a) ?
Yes yes yes. The conversation needs to change from "if we build the Semantic Web it will change everything" (where everything is unspecified and not discussed) to we need to do this, what do we need to do it.
> What are the relational stores you mention in a) ?
Behind most modern on-line connected websites there's a big Postgres/Oracle/SQL Server database somewhere (and increasingly non-relational no-SQL type of stores). The SW basically chose to turn the web into a giant federated RDF triple-store which, if you think about it, is kind of ridiculous from any kind of sensible performance POV, just accessing the information from the source, the databases that are used to generate the pages, seems to make more sense.
That question suggests that you see "The Semantic Web" as being exactly equivalent to "HTML with semantic metadata added". But that most definitely is not the case, and "API access" is part of the Semantic Web. An HTTP based protocol for SPARQL has been around for forever, and continues to evolve, as do the standards around semantic discovery of API services.
The whole RDFa, microformats, microdata, GRDDL thing is just a small part of the overall picture.
Some of these things in your list were proposed specifically because the Semantic Web technology inventory had failed to produce anything useful.
Well, that's one possible narrative one could believe in. Another would be that a lot of people engaging their NIH tendencies invented a parallel technology track, covering a lot of the same ground, and offering no real advantages, just because they could. shrug
I know we disagree on this topic, but thanks for having a sensible debate on it!
So you're arguing that JSON-LD isn't a Semantic Web technology?
It doesn't solve any of the fundamental problems that the Semantic Web has. Just taking JSON (not a SW tech) and specifying a serialization method on it doesn't suddenly make it SW.
In fact, because JSON is easier to handle the XML, it should make the failure of the idea even more apparent.
Instead it's acted as a distraction, emulating "progress" while everybody starts to move their semantic graph engines to support it while not actually kicking the ball forward on any specific front. There's really nothing new that JSON-LD introduces other than being easier to parse. Except that it somehow specifies a syntax that turns JSON into something about as ugly and verbose as XML.
Honestly, your argument, to me, sounds like saying "cars have failed since Henry Ford didn't mention fuel injection, overhead cams, or turbochargers".
The short thought experiment that would reveal that cars won't work simply hasn't been done as a field with the SW and it's usually not done in any kind of Semantic technology circles. You end up with "just tweak the model!" and "it'll start to work when there's sufficient data" and "we just need to build the reasoning engines" is what the field has been spinning on for more than a decade.
Even in SW circles, there's a general consensus that the SW has not arrived. In some more honest pieces it's recognized that it was a failure. But there's tremendous momentum behind the idea because of TBL and people aren't willing to give it up and jump ship onto what's actually working until it gets a big name and W3C (or some other notable committee) to back it.
AI is definitely big, on multiple levels. Even without AGI, the ability of AI/ML techniques to help serve as a bridge to the SemWeb world by, for example, extracting semantic data from unstructured text, is huge. This is why Apache Stanbol excites me so much. I foresee the addition of progressivly better and better enhancement engines for Stanbol, constantly improving the ability to do that structured extraction. This will make the overall Semantic Web vision that much more practical.
"what's the point if no one is using it" IS an an economic problem.
It's still very much the reason people are more inclined to scrape a certain site, they want to simply piggy back off an existing source and charge people money for it. These types also have a tendency to 'pay as little as possible' when it comes to acquiring the data.
IMHO, I've observed that Semantic Technologies have continuously hit the same hard edge cases over and over again and most of those are human factors and social issues. Moreover, there's a number of technologies that provide near-enough functionality without the huge painful social overhead associated with STs, basically worse is better.
What really needs to happen is for somebody to take a few of the huge publicly available triplestores and write some really compelling reasoning systems on them and demonstrate almost effortless merging across the datasets. Outside of fairly trivial examples, almost all of which are similarly or outperformed by near-enough technologies, really complex reasoners haven't happened.
I think the other problem is that most web pages these days are generated from some datastore somewhere. It's a fool's errand to go from this nicely related data to generate a human readable webpages with embedded semantic markup and try to SparQL your way across a hugely federated search domain when you can just get API access to the underlying data in the first place. Semanticists will probably call this "the Semantic web", but querying several databases and merging results programmatically is something that came along decades before the Semantic Web.
Also, DB people are used to have some lower level things in their toolbox, while with these you really don't have an idea what's happening with data structures.
A fair portion of these database is not even maintained today.
Just to conclude, it's very hard to find an alternative in the triplestore world to Neo4J, Cassandra, MongoDB, Couch, <put_your_db_here>, (especially free or cheap) which developers can just get up and running easily and experiment, learn and scale latter.
Documentation and community support is another topic altogether, don't make me start on that one...
Right, you mostly wouldn't do that. If you want access to data in a relational store, from a semweb perspective, you use R2RML with something like D2RQ.
There's also a lot more to the Semantic Web than just extracting structured data from otherwise unstructured content like HTML. If you have semantic data you want to expose, depending on the use-case, you just expose it using the remote SPARQL protocol, for direct M2M use.