Hacker Newsnew | comments | show | ask | jobs | submit login

Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

The Shirky article is a well-known strawman.

Thanks for the pointer to the last one, I'll read it when I get a chance.




I really shouldn't be doing this...

> Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

Because Wikipedia is centralized, and the SemWeb isn't.

> The Shirky article is a well-known strawman.

DH3. Contradiction

-----


I don't want to turn this into a huge debate either, but those articles (and uncritical readings of them) have set the web back years.

> Because Wikipedia is centralized, and the SemWeb isn't.

If data-Wikipedia and a television station are both publishing data about when your favorite show is on that station, who are you more likely to believe?

Obviously you need to be careful about where your data comes from, but a single centralized source is not necessarily more trustworthy than many carefully selected sources.

Blind crawling isn't (and will probably not be) the norm for data collection on the semantic web.

> DH3. Contradiction

Heh, got me there.

Shirky's thesis is based on the idea that making inferences from data is the ultimate purpose of the semantic web.

But linked, machine-readable data--that is, the semantic web--is useful even if inferencing is useless. I don't think this is a claim that needs evidence, it should be fairly obvious.

Shirky's article's portrayal of the semantic web has little to do with the real thing. Here's a much broader debunking of it: http://www.poorbuthappy.com/ease/semantic/

-----


I'm happy some Semantic Web proponents understand that blind crawling won't work. But TimBL disagrees:

> I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers... The ‘intelligent agents’ people have touted for ages will finally materialize.

The debunking explicitly agrees with Shirky's conclusion, and should have given more serious scrutiny to his premise. The RDF format deals with "triples" precisely to enable inferences ("syllogisms"). Syllogisms are the only thing the SemWeb brings to the table that wasn't there before. If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.

-----


Where does TimBL say that "intelligent agents" will be blindly crawling? Certainly agents have to follow links they haven't seen before (there wouldn't be much point if they didn't), but following links provided by trusted sources is vastly different from what Google does.

> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").

As far as I know, this is not and has never been true.

RDF deals with triples because they're a small unit of data, which makes it easy to take the chunks you want from one dataset and graft them onto another set.

I suppose you can call matching URIs to graft one triple onto another a syllogism, but it would be a stretch; if that's a syllogism then so is joining two tables in a relational database. It has nothing in common with the ridiculous examples Shirky uses.

> If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.

Have fun merging data from multiple sources. RDF can't make this completely painless, but it can make it easier than CSV files.

Your third article doesn't make much sense to me. How is RDF "semantically committed"? An individual RDF vocabulary is "semantically committed", but so is an individual XML schemas or a documented use of JSON. RDF (like XML and JSON, and the generic tools for all 3) doesn't care what you put in it.

-----


Me:

>> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").

You:

> As far as I know, this is not and has never been true.

TimBL, http://www.w3.org/DesignIssues/Semantic.html :

> sometimes it is less than evident why one should bother to map an application in RDF. The answer is that we expect this data, while limited and simple within an application, to be combined, later, with data from other applications into a Web. Applications which run over the whole web must be able to use a common framework for combining information from all these applications. For example, access control logic may use a combination of privacy and group membership and data type information to actually allow or deny access. Queries may later allow powerful logical expressions referring to data from domains in which, individually, the data representation language is not very expressive.

I'm not sure if this quote supports my point of view or yours, or even if there's any factual difference between our views.

-----


This has gotten kind of confused.

When I talk about merging data, I'm talking about taking two independent documents:

   <brian> parentOf <bct>
   <brian> name 'Brian'
and

   <bct> name 'Brendan'
and being able to join those graphs on the <bct> node, to say that a person named Brendan has a parent named Brian. This is what TimBL means by combining data from multiple applications (IMO).

This is trivial for software to do and takes a lot of the effort out of merging datasets. It's what makes the semantic web a web; you're linking different datasets together. I don't see how Shirky's arguments apply here.

--

When I say "inferencing", I mean something like Swish http://www.ninebynine.org/RDFNotes/Swish/Intro.html#ScriptEx... does.

Given two statements:

    <brian> parentOf <bct>
    <bct> gender <male>
and an appropriate set of rules, an inference engine can create a third statement:

    <bct> sonOf <brian>
This is what I understand Shirky's article to be about. IMO the applications of it are limited. It can also lead to the ridiculous results Shirky suggests.

Enabling inferences of this kind is neat, and it may be useful in the future, but it's not what the semantic web is About.

-----


Your first example takes

    <brian> parentOf <bct>
    <brian> name 'Brian'
    <bct> name 'Brendan'
and deduces

    'Brendan' hasParentNamed 'Brian'
How is this substantially different from the second example? Forgive me if I'm thick; I'm honestly trying to understand.

-----


It's not deducing a third property "hasParentNamed".

It's joining the two graphs so that you can do a query like this:

    SELECT ?parentName WHERE
    {
      ?child name 'Brendan'
      ?parent parentOf ?child
      ?parent name ?parentName
    }
to find the name of Brendan's parent.

You're being quite patient with me, thanks. :)

-----


Still not getting it, here's your second example in that syntax:

    SELECT ?son WHERE
    {
      <brian> parentOf ?son
      ?son gender <male>
    }
What's the fundamental difference? That one example yields a new RDF triple, and the other yields a query result? Surely this is just a matter of representation.

-----


Good point. I think you've changed my mind about the utility of inferencing :).

The difference between querying and inferencing isn't what I was trying to emphasise, though. My point was the difference between being designed for making queries/inferences within a dataset, and being designed for joining distinct datasets.

Querying within a dataset is easy: SQL, XPath, XQuery, LINQ, etc. You can write rules for transforming any data model that you can query.

RDF isn't anything special in these areas (though I do think that SPARQL is an awfully nice query language). What it gives you is a way to link and merge datasets.

-----




Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: