Hacker News new | past | comments | ask | show | jobs | submit login
We need a Wikipedia for data (bret.appspot.com)
66 points by bootload on Apr 9, 2008 | hide | past | web | favorite | 54 comments

Isn't Metaweb's Freebase (http://www.freebase.com/) doing this?

Check it out if you have not seen this before, these guys are Rockstars!!

Exactly my thought. Freebase is awesome, I just wish they didn't use JSON.

I like them more because they use JSON.

Why no JSON?

I'll be the first to admit it might just be me ;)

I prefer XML and haven't used JSON extensively enough to be proficient with it. If there are pluses to using JSON over XML I'm ignorant to them.

"Readmore" perhaps? ;)

HA, nice one! Any suggestions on where to start?

Here you go:

http://json.org http://en.wikipedia.org/wiki/JSON

I use JSON when I'm doing Ajax things. Jquery and Django both have methods to encode/ decode JSON.

Joel had an article on "commoditizing your complements": http://www.joelonsoftware.com/articles/StrategyLetterV.html . Of course we want to commoditize data, to raise the value of hackers (raise the expected return from hacking). On the other side, data companies want to commoditize hackers, to raise the value of data... a process we are sure to resent.

Absolutely, my first thought on reading this was "Why won't these data companies understand they should throw away their business so us coders can make some money?"

It's like freebase all right, but the article has something right about adoption. He points out some big company would have to donate a great starting dataset to drive adoption.

I think this is one problem with Freebase. Another problem I see is the structuring of the data, it is a hurdle to sharing.

Finally the last problem I see is Data is currently very much seen as a competitive advantage. When Google introduce functionalities to make corrections on Google Maps, I don’t think they do it to share with everyone else. They do it because they want Google Maps to be the destination with the best possible data. A wiki of data is a great idea, but I’m not sure we have the solution yet.

It's like freebase all right, but the article has something right about adoption. He points out some big company would have to donate a great starting dataset to drive adoption.


This improves the (easy) technical hurdles, but how about (very difficult) social, economic, and bureaucratic ones?

One of the biggest competitive advantages companies have is data. Like the article and previous comment already said, adoption is the hardest barrier. Unless someone can provide a compelling reason for companies/scientists/etc to give data or data access, there really isn't much else to discuss.

I think sites like mashery and dapper.net are going in the right direction, by providing good licensing rights and monetization controls that can incent large companies with reliable datasets to participate.

It's a start for sure and it's great that freebase does this. But what I refer to is more like Google gives all their business location listings to Freebase. I don't think this will happen, but maybe something of less value while being still significant.

There's largely no such thing as "closed source" data. Many of the restrictions people claim on publicly distributed data are bogus: you cannot claim copyright on a comprehensive collection of facts. http://www.iusmentis.com/databases/us/ http://blog.infochimps.org/2008/04/02/good-neighbors-and-ope... I don't think baseball is cracking down on people making money on this, unless they infringed their (quite reasonable) hot news claims to the real-time data.

Baseball is the leading example of why giving away most of your data is the best use of it. The sport of baseball -- the way it's played on the field, the way players are scouted and trained, and the way it's enjoyed as a fan (Moneyball? Fantasy Sports?) -- have been revolutionized by amateurs making use of free open data.

If you give out the great bulk of your data, people will be enhancing it with metadata, building tools on top of it, and most importantly connecting it to the rest of humanity's knowledge store and mining it for connections you'd have never conceived. Giving out "up to last month" or "daily intervals" will grow sharply the market for "real time" or "second-by-second". Baseball's mission statement concerns bats, bases, butts and seats -- not visualizing correlations among heterogeneous data stores. By releasing their data for free they let the smartest people in the world have the opportunity to perform that second task for free.

We're about to enter the age of ubiquitous information. Drawing these data stores into open formats, making them discoverable, and interconnecting them across knowledge domains presents explosive opportunities. But who will own this data and what access will they allow? If you want to help ensure that the answer is 'everyone' and 'all of it', come join the http://infochimps.org project, a free open community effort to build an Allmanac of everything.

Two years ago I wrote http://formula1db.com , to teach myself sql. To get the data, I had to screen scrape formula1.com . Once I had the data, learning SQL became a joy. I haven't done much with the data since I built the site, but I am considering open sourcing it.

What I don't understand is, why don't sport leagues open source their data. What do they lose? Its a good think that people are so excited about your sport, that they build custom apps based on it. Sadly sport leagues don't seem to get it, I remember the MLB cracking down on a fan generated datatbase of baseball statistics a while ago.

Actually, MLB data is fairly close to being open-sourced. Historical data IS open-sourced (though not by MLB itself):

http://baseball1.com/content/view/57/82/ http://retrosheet.org/

Current major and minor league data is available as well, though MLB will crack down on anyone who is trying to make money off of derivative products. Here's where you'll find it, as XML:


One can do pretty cool stuff with all of it, and many people have, despite the fact that we can't make money off of it:

http://minorleaguesplits.com/ (my site)

http://baseball.bornbybits.com/2008/pitchers.html (analysis based on detailed pitch speed / break information that MLB started collecting last year.)

They think that they can't sell the data if they open source it. They probably also have control issues.

They think that a popular application using "their data" is necessarily lucrative and they think that they should get a huge hunk of that money.

It's so painfully obvious that this is a good idea, and yet people are skeptical of the Semantic Web, which has been promoting this idea for almost a decade (and doesn't require centralization). Why?

Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

The Shirky article is a well-known strawman.

Thanks for the pointer to the last one, I'll read it when I get a chance.

I really shouldn't be doing this...

> Why is "metacrap" a problem for the semantic web, but not for data-Wikipedia?

Because Wikipedia is centralized, and the SemWeb isn't.

> The Shirky article is a well-known strawman.

DH3. Contradiction

I don't want to turn this into a huge debate either, but those articles (and uncritical readings of them) have set the web back years.

> Because Wikipedia is centralized, and the SemWeb isn't.

If data-Wikipedia and a television station are both publishing data about when your favorite show is on that station, who are you more likely to believe?

Obviously you need to be careful about where your data comes from, but a single centralized source is not necessarily more trustworthy than many carefully selected sources.

Blind crawling isn't (and will probably not be) the norm for data collection on the semantic web.

> DH3. Contradiction

Heh, got me there.

Shirky's thesis is based on the idea that making inferences from data is the ultimate purpose of the semantic web.

But linked, machine-readable data--that is, the semantic web--is useful even if inferencing is useless. I don't think this is a claim that needs evidence, it should be fairly obvious.

Shirky's article's portrayal of the semantic web has little to do with the real thing. Here's a much broader debunking of it: http://www.poorbuthappy.com/ease/semantic/

I'm happy some Semantic Web proponents understand that blind crawling won't work. But TimBL disagrees:

> I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers... The ‘intelligent agents’ people have touted for ages will finally materialize.

The debunking explicitly agrees with Shirky's conclusion, and should have given more serious scrutiny to his premise. The RDF format deals with "triples" precisely to enable inferences ("syllogisms"). Syllogisms are the only thing the SemWeb brings to the table that wasn't there before. If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.

Where does TimBL say that "intelligent agents" will be blindly crawling? Certainly agents have to follow links they haven't seen before (there wouldn't be much point if they didn't), but following links provided by trusted sources is vastly different from what Google does.

> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").

As far as I know, this is not and has never been true.

RDF deals with triples because they're a small unit of data, which makes it easy to take the chunks you want from one dataset and graft them onto another set.

I suppose you can call matching URIs to graft one triple onto another a syllogism, but it would be a stretch; if that's a syllogism then so is joining two tables in a relational database. It has nothing in common with the ridiculous examples Shirky uses.

> If we have to pick sources and massage data by hand as you say, then I'll go with CSV files.

Have fun merging data from multiple sources. RDF can't make this completely painless, but it can make it easier than CSV files.

Your third article doesn't make much sense to me. How is RDF "semantically committed"? An individual RDF vocabulary is "semantically committed", but so is an individual XML schemas or a documented use of JSON. RDF (like XML and JSON, and the generic tools for all 3) doesn't care what you put in it.


>> The RDF format deals with "triples" precisely to enable inferences ("syllogisms").


> As far as I know, this is not and has never been true.

TimBL, http://www.w3.org/DesignIssues/Semantic.html :

> sometimes it is less than evident why one should bother to map an application in RDF. The answer is that we expect this data, while limited and simple within an application, to be combined, later, with data from other applications into a Web. Applications which run over the whole web must be able to use a common framework for combining information from all these applications. For example, access control logic may use a combination of privacy and group membership and data type information to actually allow or deny access. Queries may later allow powerful logical expressions referring to data from domains in which, individually, the data representation language is not very expressive.

I'm not sure if this quote supports my point of view or yours, or even if there's any factual difference between our views.

This has gotten kind of confused.

When I talk about merging data, I'm talking about taking two independent documents:

   <brian> parentOf <bct>
   <brian> name 'Brian'

   <bct> name 'Brendan'
and being able to join those graphs on the <bct> node, to say that a person named Brendan has a parent named Brian. This is what TimBL means by combining data from multiple applications (IMO).

This is trivial for software to do and takes a lot of the effort out of merging datasets. It's what makes the semantic web a web; you're linking different datasets together. I don't see how Shirky's arguments apply here.


When I say "inferencing", I mean something like Swish http://www.ninebynine.org/RDFNotes/Swish/Intro.html#ScriptEx... does.

Given two statements:

    <brian> parentOf <bct>
    <bct> gender <male>
and an appropriate set of rules, an inference engine can create a third statement:

    <bct> sonOf <brian>
This is what I understand Shirky's article to be about. IMO the applications of it are limited. It can also lead to the ridiculous results Shirky suggests.

Enabling inferences of this kind is neat, and it may be useful in the future, but it's not what the semantic web is About.

Your first example takes

    <brian> parentOf <bct>
    <brian> name 'Brian'
    <bct> name 'Brendan'
and deduces

    'Brendan' hasParentNamed 'Brian'
How is this substantially different from the second example? Forgive me if I'm thick; I'm honestly trying to understand.

It's not deducing a third property "hasParentNamed".

It's joining the two graphs so that you can do a query like this:

    SELECT ?parentName WHERE
      ?child name 'Brendan'
      ?parent parentOf ?child
      ?parent name ?parentName
to find the name of Brendan's parent.

You're being quite patient with me, thanks. :)

Still not getting it, here's your second example in that syntax:

      <brian> parentOf ?son
      ?son gender <male>
What's the fundamental difference? That one example yields a new RDF triple, and the other yields a query result? Surely this is just a matter of representation.

Good point. I think you've changed my mind about the utility of inferencing :).

The difference between querying and inferencing isn't what I was trying to emphasise, though. My point was the difference between being designed for making queries/inferences within a dataset, and being designed for joining distinct datasets.

Querying within a dataset is easy: SQL, XPath, XQuery, LINQ, etc. You can write rules for transforming any data model that you can query.

RDF isn't anything special in these areas (though I do think that SPARQL is an awfully nice query language). What it gives you is a way to link and merge datasets.

Something like this is really what we need, and the thing that would be really revolutionary about the web. Otherwise, even though people talk about the information singularity and such, there isn't a really high useful information signal. Too much useless or mediocre information is worse than useless because it makes everyone more stupid.

But, this is also not only a technology solution. People have to make a choice themselves to filter and promote good information.

"No one really wants factual data accuracy and completeness to be their competitive advantage"

Some people actually do build their business on this. It's not the most sustainable business model, but in today's world, good data can be - and is - a competitive advantage. In an ideal world, should it be that way? Maybe not. I think that's what this post is getting at.

Free vector map data is growing: http://openstreetmap.org . They recently got donated the whole Netherlands.

Plus India. (It's not up yet.)

I remember seeing this company that organized data... oh yeah, they also wrote most of my textbook and aquired Reuters:


Could they really end up being the Encarta Encyclopedia of data?

Sounds a lot like what I think Freebase [http://www.freebase.com/] are trying to do.

I've thought about this before... Wikipedia covers unstructured data, but we need something to be its analog for structured data. Others have pointed out freebase (which kicks ass IMO), but there's also swivel (http://www.swivel.com/). Swivel's correctness model, however, doesn't seem to be the same open idea that freebase is. It seems more to be focused on data "authorities" and "official" data providers as opposed to just an "accuracy of the masses" system that the more open sites rely on.

There was a rumor that Google would host scientific data


I am not sure if this is true or what the status is.

Open Data Commons has an interesting license for open datasets


Bret is spot on. Open data would unlock a vast amount of wealth. Other suggestions: A collection of various kinds of texts, translated into 20-30 different languages. 100 million words (per language) would be fine. 20 minutes of text, spoken in thousands of different voices/accents.

An open translation dictionary is a fantastic idea. You would have to be careful to clarify the context. A word in one language can often translate to several words in another language, for example. But I think it's do-able.

And I think you meant 100 thousand words per language. :-)

I don't mean a dictionary (also a good idea), I meant texts: articles, novels, blog posts, transcripts of conversations, etc.

As for dictionaries, there's wiktionary, but it's broken because it's based on words, not meanings, so you'd need 30*29 translations for each word.

Mmm, maybe I should do it...

But a good translation is really, really tough to do. It's an art form. Just having a basic context isn't usually enough. "I'm going there to see you." Am going on foot or by car? (Gehen vs. fahren; chodit vs jezdit.) And how well do I know you? Each culture has a different point where they go from formal (Vous/Sie/Vy) to informal (tu/du/ty), based on how well you know someone. I'm telling you, it's tough.

My degree thesis was on this subject, except this...

It stores the data as English text, parseable into Prolog predicates (ie, key/value data chunks), and you have a special editor which shows you the predicates generated by the English you just wrote.

Does this sound useful to anyone? I'd love to pick up on it.

The parser for English I wrote is in Lisp, and works very well (it's recursive, etc).

And of course, you don't have to make a new database, you just do the usual Wiki editing and just make sure the text you enter is parseable by the latest parser.

Finding high quality data is very expensive, and very well-guarded with license restrictions once you do cough up for it..

Seems like factual data will inevitably gravitate toward free, though. The sooner the better.

Have you heard of freebase? From what I've read online, I think they are doing something like this.

There is also Swivel: http://www.swivel.com/

we need a wikipedia for ______ is generally true. the problem is getting all the knowledgeable people involved with ______ to contribute their time.

the gate keepers, of this information are profiting off of it. As long as that continues they will resist change.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact