
We need a Wikipedia for data - bootload
http://bret.appspot.com/entry/we-need-a-wikipedia-for-data
======
prakash
Isn't Metaweb's Freebase (<http://www.freebase.com/>) doing this?

Check it out if you have not seen this before, these guys are Rockstars!!

~~~
Readmore
Exactly my thought. Freebase is awesome, I just wish they didn't use JSON.

~~~
mrtron
Why no JSON?

~~~
Readmore
I'll be the first to admit it might just be me ;)

I prefer XML and haven't used JSON extensively enough to be proficient with
it. If there are pluses to using JSON over XML I'm ignorant to them.

~~~
pistoriusp
"Readmore" perhaps? ;)

~~~
Readmore
HA, nice one! Any suggestions on where to start?

~~~
pistoriusp
Here you go:

<http://json.org> <http://en.wikipedia.org/wiki/JSON>

I use JSON when I'm doing Ajax things. Jquery and Django both have methods to
encode/ decode JSON.

------
cousin_it
Joel had an article on "commoditizing your complements":
<http://www.joelonsoftware.com/articles/StrategyLetterV.html> . Of course we
want to commoditize data, to raise the value of hackers (raise the expected
return from hacking). On the other side, data companies want to commoditize
hackers, to raise the value of data... a process we are sure to resent.

~~~
allertonm
Absolutely, my first thought on reading this was "Why won't these data
companies understand they should throw away their business so us coders can
make some money?"

------
jfno67
It's like freebase all right, but the article has something right about
adoption. He points out some big company would have to donate a great starting
dataset to drive adoption.

I think this is one problem with Freebase. Another problem I see is the
structuring of the data, it is a hurdle to sharing.

Finally the last problem I see is Data is currently very much seen as a
competitive advantage. When Google introduce functionalities to make
corrections on Google Maps, I don’t think they do it to share with everyone
else. They do it because they want Google Maps to be the destination with the
best possible data. A wiki of data is a great idea, but I’m not sure we have
the solution yet.

~~~
pius
_It's like freebase all right, but the article has something right about
adoption. He points out some big company would have to donate a great starting
dataset to drive adoption._

<http://news.ycombinator.com/item?id=157966>

~~~
jyu
This improves the (easy) technical hurdles, but how about (very difficult)
social, economic, and bureaucratic ones?

One of the biggest competitive advantages companies have is data. Like the
article and previous comment already said, adoption is the hardest barrier.
Unless someone can provide a compelling reason for companies/scientists/etc to
give data or data access, there really isn't much else to discuss.

I think sites like mashery and dapper.net are going in the right direction, by
providing good licensing rights and monetization controls that can incent
large companies with reliable datasets to participate.

------
mrflip
There's largely no such thing as "closed source" data. Many of the
restrictions people claim on publicly distributed data are bogus: you cannot
claim copyright on a comprehensive collection of facts.
<http://www.iusmentis.com/databases/us/>
[http://blog.infochimps.org/2008/04/02/good-neighbors-and-
ope...](http://blog.infochimps.org/2008/04/02/good-neighbors-and-open-
grazing/) I don't think baseball is cracking down on people making money on
this, unless they infringed their (quite reasonable) hot news claims to the
real-time data.

Baseball is the leading example of why _giving away most of your data is the
best use of it_. The sport of baseball -- the way it's played on the field,
the way players are scouted and trained, and the way it's enjoyed as a fan
(Moneyball? Fantasy Sports?) -- have been revolutionized by amateurs making
use of free open data.

If you give out the great bulk of your data, people will be enhancing it with
metadata, building tools on top of it, and most importantly connecting it to
the rest of humanity's knowledge store and mining it for connections you'd
have never conceived. Giving out "up to last month" or "daily intervals" will
grow sharply the market for "real time" or "second-by-second". Baseball's
mission statement concerns bats, bases, butts and seats -- not visualizing
correlations among heterogeneous data stores. By releasing their data for free
they let the smartest people in the world have the opportunity to perform that
second task for free.

We're about to enter the age of ubiquitous information. Drawing these data
stores into open formats, making them discoverable, and interconnecting them
across knowledge domains presents explosive opportunities. But who will own
this data and what access will they allow? If you want to help ensure that the
answer is 'everyone' and 'all of it', come join the <http://infochimps.org>
project, a free open community effort to build an Allmanac of everything.

------
paddy_m
Two years ago I wrote <http://formula1db.com> , to teach myself sql. To get
the data, I had to screen scrape formula1.com . Once I had the data, learning
SQL became a joy. I haven't done much with the data since I built the site,
but I am considering open sourcing it.

What I don't understand is, why don't sport leagues open source their data.
What do they lose? Its a good think that people are so excited about your
sport, that they build custom apps based on it. Sadly sport leagues don't seem
to get it, I remember the MLB cracking down on a fan generated datatbase of
baseball statistics a while ago.

~~~
jsackmann
Actually, MLB data is fairly close to being open-sourced. Historical data IS
open-sourced (though not by MLB itself):

<http://baseball1.com/content/view/57/82/> <http://retrosheet.org/>

Current major and minor league data is available as well, though MLB will
crack down on anyone who is trying to make money off of derivative products.
Here's where you'll find it, as XML:

<http://gdx.mlb.com/components/game/>

One can do pretty cool stuff with all of it, and many people have, despite the
fact that we can't make money off of it:

<http://minorleaguesplits.com/> (my site)

<http://baseball.bornbybits.com/2008/pitchers.html> (analysis based on
detailed pitch speed / break information that MLB started collecting last
year.)

------
bct
It's so painfully obvious that this is a good idea, and yet people are
skeptical of the Semantic Web, which has been promoting this idea for almost a
decade (and doesn't require centralization). Why?

~~~
cousin_it
Here's three anti-SemWeb articles that try to answer your question:

<http://www.well.com/~doctorow/metacrap.htm>

<http://www.shirky.com/writings/semantic_syllogism.html>

[http://blahsploitation.blogspot.com/2005/09/i-always-
figured...](http://blahsploitation.blogspot.com/2005/09/i-always-figured-that-
at-least-rdf-was.html)

~~~
bct
Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

The Shirky article is a well-known strawman.

Thanks for the pointer to the last one, I'll read it when I get a chance.

~~~
cousin_it
I really shouldn't be doing this...

> Why is "metacrap" a problem for the semantic web, but not for data-
> Wikipedia?

Because Wikipedia is centralized, and the SemWeb isn't.

> The Shirky article is a well-known strawman.

DH3. Contradiction

~~~
bct
I don't want to turn this into a huge debate either, but those articles (and
uncritical readings of them) have set the web back years.

> Because Wikipedia is centralized, and the SemWeb isn't.

If data-Wikipedia and a television station are both publishing data about when
your favorite show is on that station, who are you more likely to believe?

Obviously you need to be careful about where your data comes from, but a
single centralized source is not necessarily more trustworthy than many
carefully selected sources.

Blind crawling isn't (and will probably not be) the norm for data collection
on the semantic web.

> DH3. Contradiction

Heh, got me there.

Shirky's thesis is based on the idea that making inferences from data is the
ultimate purpose of the semantic web.

But linked, machine-readable data--that is, the semantic web--is useful even
if inferencing is useless. I don't think this is a claim that needs evidence,
it should be fairly obvious.

Shirky's article's portrayal of the semantic web has little to do with the
real thing. Here's a much broader debunking of it:
<http://www.poorbuthappy.com/ease/semantic/>

~~~
cousin_it
I'm happy some Semantic Web proponents understand that blind crawling won't
work. But TimBL disagrees:

> I have a dream for the Web [in which computers] become capable of analyzing
> all the data on the Web – the content, links, and transactions between
> people and computers... The ‘intelligent agents’ people have touted for ages
> will finally materialize.

The debunking explicitly agrees with Shirky's conclusion, and should have
given more serious scrutiny to his premise. The RDF format deals with
"triples" precisely to enable inferences ("syllogisms"). Syllogisms are the
only thing the SemWeb brings to the table that wasn't there before. If we have
to pick sources and massage data by hand as you say, then I'll go with CSV
files.

~~~
bct
Where does TimBL say that "intelligent agents" will be blindly crawling?
Certainly agents have to follow links they haven't seen before (there wouldn't
be much point if they didn't), but following links provided by trusted sources
is vastly different from what Google does.

> The RDF format deals with "triples" precisely to enable inferences
> ("syllogisms").

As far as I know, this is not and has never been true.

RDF deals with triples because they're a small unit of data, which makes it
easy to take the chunks you want from one dataset and graft them onto another
set.

I suppose you can call matching URIs to graft one triple onto another a
syllogism, but it would be a stretch; if that's a syllogism then so is joining
two tables in a relational database. It has nothing in common with the
ridiculous examples Shirky uses.

> If we have to pick sources and massage data by hand as you say, then I'll go
> with CSV files.

Have fun merging data from multiple sources. RDF can't make this completely
painless, but it can make it easier than CSV files.

Your third article doesn't make much sense to me. How is RDF "semantically
committed"? An individual RDF vocabulary is "semantically committed", but so
is an individual XML schemas or a documented use of JSON. RDF (like XML and
JSON, and the generic tools for all 3) doesn't care what you put in it.

~~~
cousin_it
Me:

>> The RDF format deals with "triples" precisely to enable inferences
("syllogisms").

You:

> As far as I know, this is not and has never been true.

TimBL, <http://www.w3.org/DesignIssues/Semantic.html> :

> sometimes it is less than evident why one should bother to map an
> application in RDF. The answer is that we expect this data, while limited
> and simple within an application, to be combined, later, with data from
> other applications into a Web. Applications which run over the whole web
> must be able to use a common framework for combining information from all
> these applications. For example, access control logic may use a combination
> of privacy and group membership and data type information to actually allow
> or deny access. Queries may later allow powerful logical expressions
> referring to data from domains in which, individually, the data
> representation language is not very expressive.

I'm not sure if this quote supports my point of view or yours, or even if
there's any factual difference between our views.

~~~
bct
This has gotten kind of confused.

When I talk about merging data, I'm talking about taking two independent
documents:

    
    
       <brian> parentOf <bct>
       <brian> name 'Brian'
    

and

    
    
       <bct> name 'Brendan'
    

and being able to join those graphs on the <bct> node, to say that a person
named Brendan has a parent named Brian. This is what TimBL means by combining
data from multiple applications (IMO).

This is trivial for software to do and takes a lot of the effort out of
merging datasets. It's what makes the semantic web a web; you're linking
different datasets together. I don't see how Shirky's arguments apply here.

\--

When I say "inferencing", I mean something like Swish
[http://www.ninebynine.org/RDFNotes/Swish/Intro.html#ScriptEx...](http://www.ninebynine.org/RDFNotes/Swish/Intro.html#ScriptExample)
does.

Given two statements:

    
    
        <brian> parentOf <bct>
        <bct> gender <male>
    

and an appropriate set of rules, an inference engine can create a third
statement:

    
    
        <bct> sonOf <brian>
    

This is what I understand Shirky's article to be about. IMO the applications
of it are limited. It can also lead to the ridiculous results Shirky suggests.

Enabling inferences of this kind is neat, and it may be useful in the future,
but it's not what the semantic web is About.

~~~
cousin_it
Your first example takes

    
    
        <brian> parentOf <bct>
        <brian> name 'Brian'
        <bct> name 'Brendan'
    

and deduces

    
    
        'Brendan' hasParentNamed 'Brian'
    

How is this substantially different from the second example? Forgive me if I'm
thick; I'm honestly trying to understand.

~~~
bct
It's not deducing a third property "hasParentNamed".

It's joining the two graphs so that you can do a query like this:

    
    
        SELECT ?parentName WHERE
        {
          ?child name 'Brendan'
          ?parent parentOf ?child
          ?parent name ?parentName
        }
    

to find the name of Brendan's parent.

You're being quite patient with me, thanks. :)

~~~
cousin_it
Still not getting it, here's your second example in that syntax:

    
    
        SELECT ?son WHERE
        {
          <brian> parentOf ?son
          ?son gender <male>
        }
    

What's the fundamental difference? That one example yields a new RDF triple,
and the other yields a query result? Surely this is just a matter of
representation.

~~~
bct
Good point. I think you've changed my mind about the utility of inferencing
:).

The difference between querying and inferencing isn't what I was trying to
emphasise, though. My point was the difference between being designed for
making queries/inferences within a dataset, and being designed for joining
distinct datasets.

Querying within a dataset is easy: SQL, XPath, XQuery, LINQ, etc. You can
write rules for transforming any data model that you can query.

RDF isn't anything special in these areas (though I do think that SPARQL is an
awfully nice query language). What it gives you is a way to link and merge
datasets.

------
aneesh
"No one really wants factual data accuracy and completeness to be their
competitive advantage"

Some people actually do build their business on this. It's not the most
sustainable business model, but in today's world, good data can be - and is -
a competitive advantage. In an ideal world, should it be that way? Maybe not.
I think that's what this post is getting at.

------
cousin_it
Free vector map data is growing: <http://openstreetmap.org> . They recently
got donated the whole Netherlands.

~~~
craig-faber
Plus India. (It's not up yet.)

------
andreyf
I remember seeing this company that organized data... oh yeah, they also wrote
most of my textbook and aquired Reuters:

<http://en.wikipedia.org/wiki/The_Thomson_Corporation>

Could they really end up being the Encarta Encyclopedia of data?

------
dood
Sounds a lot like what I think Freebase [<http://www.freebase.com/>] are
trying to do.

------
omarseyal
I've thought about this before... Wikipedia covers unstructured data, but we
need something to be its analog for structured data. Others have pointed out
freebase (which kicks ass IMO), but there's also swivel
(<http://www.swivel.com/>). Swivel's correctness model, however, doesn't seem
to be the same open idea that freebase is. It seems more to be focused on data
"authorities" and "official" data providers as opposed to just an "accuracy of
the masses" system that the more open sites rely on.

------
danohuiginn
<http://theinfo.org/get/data> <http://www.ckan.net/package/list>

And Freebase, as mentioned by dood

------
asp742
There was a rumor that Google would host scientific data

[http://blog.wired.com/wiredscience/2008/01/google-to-
provi.h...](http://blog.wired.com/wiredscience/2008/01/google-to-provi.html)

I am not sure if this is true or what the status is.

Open Data Commons has an interesting license for open datasets

<http://www.opendatacommons.org>

------
pius
<http://dbpedia.org>

------
mdemare
Bret is spot on. Open data would unlock a vast amount of wealth. Other
suggestions: A collection of various kinds of texts, translated into 20-30
different languages. 100 million words (per language) would be fine. 20
minutes of text, spoken in thousands of different voices/accents.

~~~
david927
An open translation dictionary is a fantastic idea. You would have to be
careful to clarify the context. A word in one language can often translate to
several words in another language, for example. But I think it's do-able.

And I think you meant 100 thousand words per language. :-)

~~~
mdemare
I don't mean a dictionary (also a good idea), I meant texts: articles, novels,
blog posts, transcripts of conversations, etc.

As for dictionaries, there's wiktionary, but it's broken because it's based on
words, not meanings, so you'd need 30*29 translations for each word.

Mmm, maybe I should do it...

~~~
david927
But a good translation is really, really tough to do. It's an art form. Just
having a basic context isn't usually enough. "I'm going there to see you." Am
going on foot or by car? (Gehen vs. fahren; chodit vs jezdit.) And how well do
I know you? Each culture has a different point where they go from formal
(Vous/Sie/Vy) to informal (tu/du/ty), based on how well you know someone. I'm
telling you, it's tough.

------
zandorg
My degree thesis was on this subject, except this...

It stores the data as English text, parseable into Prolog predicates (ie,
key/value data chunks), and you have a special editor which shows you the
predicates generated by the English you just wrote.

Does this sound useful to anyone? I'd love to pick up on it.

The parser for English I wrote is in Lisp, and works very well (it's
recursive, etc).

And of course, you don't have to make a new database, you just do the usual
Wiki editing and just make sure the text you enter is parseable by the latest
parser.

------
nreece
A Dataset Catalog: [http://www.datawrangling.com/some-datasets-available-on-
the-...](http://www.datawrangling.com/some-datasets-available-on-the-web.html)

------
jexe
Finding high quality data is _very_ expensive, and very well-guarded with
license restrictions once you do cough up for it..

Seems like factual data will inevitably gravitate toward free, though. The
sooner the better.

------
Dylanfm
Regarding (some) stocks: [http://blog.infochimps.org/2008/03/20/stock-market-
dataset-i...](http://blog.infochimps.org/2008/03/20/stock-market-dataset-is-
up/)

------
yters
Have you heard of freebase? From what I've read online, I think they are doing
something like this.

------
neilc
There is also Swivel: <http://www.swivel.com/>

------
kirubakaran
<http://theinfo.org/>

------
nazgulnarsil
we need a wikipedia for ______ is generally true. the problem is getting all
the knowledgeable people involved with ______ to contribute their time.

------
redorb
the gate keepers, of this information are profiting off of it. As long as that
continues they will resist change.

