

Startup idea for theinfo.org community - andreyf

I love Aaron's idea to create a community around people "build[ing] a Web of data". There may be potential for a startup to help them out, also...<p>Well organized data can be <i>very</i> valuable. When the data are so valuable that companies providing them have revenues in the billions [1], the costs of distribution are negligible, and there is a lot of competition in such fields. However, there must be many not-as-valuable data sets, the sale of which would cover gathering and organization costs, but the distribution costs of which make it not worthwhile (hosting, coding of payment processing backend, keeping track of legal issues, etc.).<p>If so, would it make sense to create a website which lets people scrape their own data streams (for example, tagged and organized texts of political speeches from around the world), focusing on the quality of the data, and letting the site take care of hosting and distribution? At the least, it would be a searchable repository of organized data sets. Optimistically, it could be the search engine of the semantic web... ;)<p>What do you think, ladies and gents?<p>1. The Thomson Corporation had revenues of $6.6 billion in 2006: http://en.wikipedia.org/wiki/The_Thomson_Corporation
======
fauigerzigerk
Yes, I think you're right that something like this could and should be done.
I've been collecting ideas and experimenting with stuff like that for a long
time. It's not easy. Finding datasets is not the big problem, so I think
tagging datasets doesn't help as much as it does with, say, photo collections.

The crucial thing is data quality. You basically have three kinds of public
datasets:

1) Academic ones, which are mostly high quality, but tend to be dusted and not
kept up to date.

2) High quality commercial datasets, which are expensive and tightly guarded.

3) Free datasets of mostly low quality. Yes you can use dapper to scrape it
and freebase to store it, but what's missing is a process to assure data
quality. That's what a community effort could provide or coordinate. Something
like apache.org for data. And there would have to be a way for non-programmers
to help, because with most datasets programmers are not the ones who know the
data best and the coding can be extremely dull. It's unbelievable how many
different ways there are to screw up data and how difficult it is to clean.
There's always some manual work left and you can't beat a pair of eyes (yet)
to spot errors.

There would also have to be a way for users of datasets to pay a reasonable
amount of money to have a particular dataset brought up to high quality
standards.

------
tocomment
Call me pessamistic but I always assume most data collections are illegal in
some way, violoating TOS, privacy, etc.

------
bayareaguy
Something like a community-based Dapper? <http://www.dapper.net>

------
whacked_new
freebase.com?

I don't follow them, but the vision seems similar. They launched about a year
ago.

