So what other datasets would you like to see/use that don't yet exist? Perhaps there's a startup idea for somebody in here...
2. open restaurant data -- comprehensive listings of restaurants with information like hours, address, possibly menus.
I found this old thread: http://news.ycombinator.com/item?id=911853 but couldn't find your contact information.
Restaurants, locations, contacts details, opening times, photos of the facade and sometimes of the inside, some menus.
We also have a lot of other data for things like bars (what beers they have, do they have a garden, how many TV screens) and other business types.
The obstacles we've encountered to opening the data are:
1) Fear. Parts of Yell fear that the data they spend time and money creating and collating will be scraped and the crown jewels given away for free.
2) Politics. Different parts of Yell view the opening of data differently and will fight over any move even if they are in general agreement... nothing gets done.
I am trying to argue that on #1 that if they really need to fear something it should be not moving with the times. And also through the use of things like API keys that we could rate limit to some extent to prevent major scraping occurring if they really think it's a big problem.
And on #2 it's currently part of their business to argue a lot it seems.
I believe that if they could see the value in opening data that both would be a non-issue and it would just be done. So I'm also working on trying to document examples of how people could use such an API that is beneficial to Yell. And on that note... if anyone has such examples I'd be glad to hear them.
I would like to see easy access to data from the Consumer Product Safety Commission (and all relevant organisations from around the world), and to be able to have this linked directly to products.
You could then take use your phone to take a picture of a bar code while in a store and an app would check this against the above data for historical information about the vendor.
In the US and UK I see some effort being made with the new government portals to download data, I just wish this practice would be come more common around the world.
I mean, really, I want to know what every person I meet of the opposite gender is thinking, but that dataset won't exist until I build more hardware.
You can build the dataset yourself by talking to them :)
But even a profile dump would be interesting, you could for example build a movie recommendation system based on it. As would aggregate statistics of which section of your profile raises the most interest, etc. (actually I'm guessing they might consider the later if someone asked them).
For instance, I searched for a while for a nearly complete dictionary of English verbs and their corresponding conjugations. My friend entitled this a "declension-ary."
(a) Most dictionaries are also declension-aries,
(b) I really like your word declension-ary,
and (c) Declension is what you do to nouns and is analogous to conjugation of verbs, which I think is what you are looking for :-)
For people like us who need these things, but don't want to do it alone their should be a sort of Wikipedia/Digg like distributed lexical database.
Everyone contributes data as it's acquired, and it is then verified by the users with a simple up/down vote for each child node.
The data could even be automatically acquired, and just manually verified as it is utilized.
Still an overwhelming task, but I think it would be worth it given all the fields that overlap with Natural Language Processing these days.
Also, sequencing still isn't actually cheap. The price is dropping quickly and its cost is reasonable for research, but it's not reasonable for "let's sequence everything" (yet)
Anyway, these were the kinds of discussions I had for a year and then decided to leave the world of semi-academic meta-bio research because there were too many "committees" and "boardS" and "experts" and nothing ever got done.
Anyhow, HIPAA only covers personally identifiable patient information so I think they'd be cool as long as it couldn't be traced back to the original person.
Although theoretically the DNA itself could be considered personally identifiable. I guess it'd be up to the courts to decide really.
Full Google search query logs - This would be a godsend for entrepreneurs. Imagine being able to figure out what things people are looking to buy that don't exist yet.
A giant pharmaceutical database.
Then average all of the samples together, shotgun the results (ala shotgun sequencing), and produce an overall average model of the relative capacitance between all the neurons in the human brain.
I'm sure it's a hopelessly naive idea, but man would it be cool to emulate an entire human brain, even if it was in 1/1000 or 1/100000 time.
Things like 'Condom sales rise, when house is a rerun.'
No idea what I'd do with it, but I'd find something to reference it against.
Transit-related data would be interesting too. Where people are travelling to/from, and how they do so.
I'm not sure what I'd do with either data set, but my sense of curiosity would have fun with them.
It would be great if it contained average dimensions for each, along with how each can vary.
Being able to how well each option works for people of varying degrees of health would also be useful. Some options may work for the morbidly obese but fail miserably with Olympic athletes.
Such data would be very difficult to collect but could help people and organizations make better choices.
Does this for the four biggest supermarkets in the uk.
Example would be: average hold time. But this is already somewhat available. I'm thinking there's tons of other data that is not available. Such as: do doctors at this clinic/medical practice send reminders when their patients are due for checkups?
So I'd have thought that grades are largely irrelevent in this case, you could have a really stupid person with rich parents etc
What killed dot-coms wasn't expensive chairs, it was a lack of sane customer development techniques... oh, and greed.
Is it available somewhere (apart from scraping) ?
corporate balance sheet
hotel occupancy rates