Check it out if you have not seen this before, these guys are Rockstars!!
I prefer XML and haven't used JSON extensively enough to be proficient with it. If there are pluses to using JSON over XML I'm ignorant to them.
I use JSON when I'm doing Ajax things. Jquery and Django both have methods to encode/ decode JSON.
I think this is one problem with Freebase. Another problem I see is the structuring of the data, it is a hurdle to sharing.
Finally the last problem I see is Data is currently very much seen as a competitive advantage. When Google introduce functionalities to make corrections on Google Maps, I don’t think they do it to share with everyone else. They do it because they want Google Maps to be the destination with the best possible data.
A wiki of data is a great idea, but I’m not sure we have the solution yet.
One of the biggest competitive advantages companies have is data. Like the article and previous comment already said, adoption is the hardest barrier. Unless someone can provide a compelling reason for companies/scientists/etc to give data or data access, there really isn't much else to discuss.
I think sites like mashery and dapper.net are going in the right direction, by providing good licensing rights and monetization controls that can incent large companies with reliable datasets to participate.
Baseball is the leading example of why giving away most of your data is the best use of it. The sport of baseball -- the way it's played on the field, the way players are scouted and trained, and the way it's enjoyed as a fan (Moneyball? Fantasy Sports?) -- have been revolutionized by amateurs making use of free open data.
If you give out the great bulk of your data, people will be enhancing it with metadata, building tools on top of it, and most importantly connecting it to the rest of humanity's knowledge store and mining it for connections you'd have never conceived. Giving out "up to last month" or "daily intervals" will grow sharply the market for "real time" or "second-by-second". Baseball's mission statement concerns bats, bases, butts and seats -- not visualizing correlations among heterogeneous data stores. By releasing their data for free they let the smartest people in the world have the opportunity to perform that second task for free.
We're about to enter the age of ubiquitous information. Drawing these data stores into open formats, making them discoverable, and interconnecting them across knowledge domains presents explosive opportunities. But who will own this data and what access will they allow? If you want to help ensure that the answer is 'everyone' and 'all of it', come join the http://infochimps.org project, a free open community effort to build an Allmanac of everything.
What I don't understand is, why don't sport leagues open source their data. What do they lose? Its a good think that people are so excited about your sport, that they build custom apps based on it. Sadly sport leagues don't seem to get it, I remember the MLB cracking down on a fan generated datatbase of baseball statistics a while ago.
Current major and minor league data is available as well, though MLB will crack down on anyone who is trying to make money off of derivative products. Here's where you'll find it, as XML:
One can do pretty cool stuff with all of it, and many people have, despite the fact that we can't make money off of it:
http://minorleaguesplits.com/ (my site)
http://baseball.bornbybits.com/2008/pitchers.html (analysis based on detailed pitch speed / break information that MLB started collecting last year.)
They think that a popular application using "their data" is necessarily lucrative and they think that they should get a huge hunk of that money.
The Shirky article is a well-known strawman.
Thanks for the pointer to the last one, I'll read it when I get a chance.
> Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?
Because Wikipedia is centralized, and the SemWeb isn't.
> The Shirky article is a well-known strawman.
> Because Wikipedia is centralized, and the SemWeb isn't.
If data-Wikipedia and a television station are both publishing data about when your favorite show is on that station, who are you more likely to believe?
Obviously you need to be careful about where your data comes from, but a single centralized source is not necessarily more trustworthy than many carefully selected sources.
Blind crawling isn't (and will probably not be) the norm for data collection on the semantic web.
> DH3. Contradiction
Heh, got me there.
Shirky's thesis is based on the idea that making inferences from data is the ultimate purpose of the semantic web.
But linked, machine-readable data--that is, the semantic web--is useful even if inferencing is useless. I don't think this is a claim that needs evidence, it should be fairly obvious.
Shirky's article's portrayal of the semantic web has little to do with the real thing. Here's a much broader debunking of it: http://www.poorbuthappy.com/ease/semantic/
> I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers... The ‘intelligent agents’ people have touted for ages will finally materialize.
The debunking explicitly agrees with Shirky's conclusion, and should have given more serious scrutiny to his premise. The RDF format deals with "triples" precisely to enable inferences ("syllogisms"). Syllogisms are the only thing the SemWeb brings to the table that wasn't there before. If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.
> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").
As far as I know, this is not and has never been true.
RDF deals with triples because they're a small unit of data, which makes it easy to take the chunks you want from one dataset and graft them onto another set.
I suppose you can call matching URIs to graft one triple onto another a syllogism, but it would be a stretch; if that's a syllogism then so is joining two tables in a relational database. It has nothing in common with the ridiculous examples Shirky uses.
> If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.
Have fun merging data from multiple sources. RDF can't make this completely painless, but it can make it easier than CSV files.
Your third article doesn't make much sense to me. How is RDF "semantically committed"? An individual RDF vocabulary is "semantically committed", but so is an individual XML schemas or a documented use of JSON. RDF (like XML and JSON, and the generic tools for all 3) doesn't care what you put in it.
>> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").
> As far as I know, this is not and has never been true.
TimBL, http://www.w3.org/DesignIssues/Semantic.html :
> sometimes it is less than evident why one should bother to map an application in RDF. The answer is that we expect this data, while limited and simple within an application, to be combined, later, with data from other applications into a Web. Applications which run over the whole web must be able to use a common framework for combining information from all these applications. For example, access control logic may use a combination of privacy and group membership and data type information to actually allow or deny access. Queries may later allow powerful logical expressions referring to data from domains in which, individually, the data representation language is not very expressive.
I'm not sure if this quote supports my point of view or yours, or even if there's any factual difference between our views.
When I talk about merging data, I'm talking about taking two independent documents:
<brian> parentOf <bct>
<brian> name 'Brian'
<bct> name 'Brendan'
This is trivial for software to do and takes a lot of the effort out of merging datasets. It's what makes the semantic web a web; you're linking different datasets together. I don't see how Shirky's arguments apply here.
When I say "inferencing", I mean something like Swish http://www.ninebynine.org/RDFNotes/Swish/Intro.html#ScriptEx... does.
Given two statements:
<brian> parentOf <bct>
<bct> gender <male>
<bct> sonOf <brian>
Enabling inferences of this kind is neat, and it may be useful in the future, but it's not what the semantic web is About.
<brian> parentOf <bct>
<brian> name 'Brian'
<bct> name 'Brendan'
'Brendan' hasParentNamed 'Brian'
It's joining the two graphs so that you can do a query like this:
SELECT ?parentName WHERE
?child name 'Brendan'
?parent parentOf ?child
?parent name ?parentName
You're being quite patient with me, thanks. :)
SELECT ?son WHERE
<brian> parentOf ?son
?son gender <male>
The difference between querying and inferencing isn't what I was trying to emphasise, though. My point was the difference between being designed for making queries/inferences within a dataset, and being designed for joining distinct datasets.
Querying within a dataset is easy: SQL, XPath, XQuery, LINQ, etc. You can write rules for transforming any data model that you can query.
RDF isn't anything special in these areas (though I do think that SPARQL is an awfully nice query language). What it gives you is a way to link and merge datasets.
But, this is also not only a technology solution. People have to make a choice themselves to filter and promote good information.
Some people actually do build their business on this. It's not the most sustainable business model, but in today's world, good data can be - and is - a competitive advantage. In an ideal world, should it be that way? Maybe not. I think that's what this post is getting at.
Could they really end up being the Encarta Encyclopedia of data?
And Freebase, as mentioned by dood
I am not sure if this is true or what the status is.
Open Data Commons has an interesting license for open datasets
And I think you meant 100 thousand words per language. :-)
As for dictionaries, there's wiktionary, but it's broken because it's based on words, not meanings, so you'd need 30*29 translations for each word.
Mmm, maybe I should do it...
It stores the data as English text, parseable into Prolog predicates (ie, key/value data chunks), and you have a special editor which shows you the predicates generated by the English you just wrote.
Does this sound useful to anyone? I'd love to pick up on it.
The parser for English I wrote is in Lisp, and works very well (it's recursive, etc).
And of course, you don't have to make a new database, you just do the usual Wiki editing and just make sure the text you enter is parseable by the latest parser.
Seems like factual data will inevitably gravitate toward free, though. The sooner the better.