1. open financial tick data -- opentick used to have some, but that's gone as far as i can tell. also, one of the biggest problems with financial data is cleaning it. with an open dataset, people might be encouraged to clean the open dataset like people contribute patches to open source code.
2. open restaurant data -- comprehensive listings of restaurants with information like hours, address, possibly menus.
I've spent a non-trivial amount of effort trying to figure out how to do number two at scale, with a high level of accuracy, in a manner that could eventually be profitable. So far I'm not close.
Over at Yell we have such data and I'm trying to persuade them to open it up through a public facing API.
Restaurants, locations, contacts details, opening times, photos of the facade and sometimes of the inside, some menus.
We also have a lot of other data for things like bars (what beers they have, do they have a garden, how many TV screens) and other business types.
The obstacles we've encountered to opening the data are:
1) Fear. Parts of Yell fear that the data they spend time and money creating and collating will be scraped and the crown jewels given away for free.
2) Politics. Different parts of Yell view the opening of data differently and will fight over any move even if they are in general agreement... nothing gets done.
I am trying to argue that on #1 that if they really need to fear something it should be not moving with the times. And also through the use of things like API keys that we could rate limit to some extent to prevent major scraping occurring if they really think it's a big problem.
And on #2 it's currently part of their business to argue a lot it seems.
I believe that if they could see the value in opening data that both would be a non-issue and it would just be done. So I'm also working on trying to document examples of how people could use such an API that is beneficial to Yell. And on that note... if anyone has such examples I'd be glad to hear them.
It's pretty simple. I've never heard of Yell before now. If they'd opened up this data, than I would have. Especially if they only allowed usage under an attribution required license.
Your number 2 is interesting. My company actually basically does this. We crawl the UK web for business sites, incl. retail and hospitality/leisure and suck out contact details, addresses, products, services, food & drink items on menus, opening times as well as a few other things. We're looking at making available an API.
Cleaning financial tick data can be very tricky, because it depends very much on how you want to use the data for. If you're not careful any models you build on the cleaned data can fall apart in real world usage because they fail to take into account the abnormalities in market data.
There is so much data in government that could be useful for everybody. Just a quick brainstorm produced this:
I would like to see easy access to data from the Consumer Product Safety Commission (and all relevant organisations from around the world), and to be able to have this linked directly to products.
You could then take use your phone to take a picture of a bar code while in a store and an app would check this against the above data for historical information about the vendor.
In the US and UK I see some effort being made with the new government portals to download data, I just wish this practice would be come more common around the world.
I would love to have a complete tax filings for the US. Something like an anonymous list of people with revenue (with type) and taxes paid. It would very interesting to do some data mining and do some simulations of different tax strategies.
Use Google Maps: locate your intended area on the map, in the "More" drop menu check the "Real estate" box, then use the ensuing checkboxes to create a pretty detailed search result, including rentals and foreclosures. What you're asking for is all there, now.
I think some aggregate, separate list of rental properties would be nice. MLS typically covers the available properties for sale fairly thoroughly but renters are left scouring through craigslist, newspapers, and shoddy realtor sites.
The strange thing about that problem is that it seems entirely an infrastructure/middle-man problem, because the people ultimately producing the data (landlords) want it as public as possible, so they can rent out their properties. There are very few landlords who, if you managed to find them and ask, would not answer yes to, "may I include your rental listing in the Big Database Anything Can Access?"
I am a Realtor in Texas and this is something I've been talking about on twitter recently. I would absolutely love to see real estate data opened up for consumption by developers.
Out of curiosity, what would you do if you had access to all national real estate data? What MLS/IDX restrictions currently prevent you from what you want to do?
I'd like a full dump of all okcupid data, including how people rate me, when they rate me, their ratings, etc. I'd like to know how long they stay on my profile, how often they click to my profile, and what they're searching for.
I mean, really, I want to know what every person I meet of the opposite gender is thinking, but that dataset won't exist until I build more hardware.
"I mean, really, I want to know what every person I meet of the opposite gender is thinking, but that dataset won't exist until I build more hardware."
You can build the dataset yourself by talking to them :)
I'd assume they wouldn't release that data because of privacy implications, how someone rates you is private to the person rating you.
But even a profile dump would be interesting, you could for example build a movie recommendation system based on it. As would aggregate statistics of which section of your profile raises the most interest, etc. (actually I'm guessing they might consider the later if someone asked them).
I wish this were possible, but AFAIK it's not at this point. Someone correct me if I'm wrong, but they don't have an API to speak of and there isn't any means to link external resources so you could run hit trackers, analytics, etc.
While only slightly off-topic, I find myself more in need of curated information than raw data in my sidebar research.
For instance, I searched for a while for a nearly complete dictionary of English verbs and their corresponding conjugations. My friend entitled this a "declension-ary."
Yes, and inflection is the general case which includes declension and conjugation. It's unfortunate that most people are only aware of the other meaning of inflection, having to do with tone of voice.
Whenever someone gets something removed from their body at a research hospital, it gets put into a glass jar and stuck somewhere in the basement. It would be cool to sequence all of the dna in tissue along with the phenotype data (what is the tissue, what's wrong with it, why was it removed, etc) and put the data somewhere publicly accessible. I believe there are already pushes to do this but they are mired in academic and hospital bureaucracy.
Wouldn't you have to get a signed waiver from everyone?
Also, sequencing still isn't actually cheap. The price is dropping quickly and its cost is reasonable for research, but it's not reasonable for "let's sequence everything" (yet)
Yes, when I was involved with this... privacy was a major issue. But I think everyone was over thinking it. In this day and age when people twitter their turds, it seems like it would be less of an issue. The sequencing is cheap enough.
When I was involved with some people who wanted to do this, they had a security expert on the "board" who would bring up that study where given 2 (or 3?) pieces of information about a person, you can ID them. I guess just knowing the hospital and the sample and the date, you could ID the person.
Anyway, these were the kinds of discussions I had for a year and then decided to leave the world of semi-academic meta-bio research because there were too many "committees" and "boardS" and "experts" and nothing ever got done.
Oh right, Information Theory. There was a cool article on HN about that a while back, showing you only need something like 3 bits (zip code, birth date, and something else iirc) to identify anyone in the world.
I used to work in a mental hospital, and we were very serious about HIPAA.
Anyhow, HIPAA only covers personally identifiable patient information so I think they'd be cool as long as it couldn't be traced back to the original person.
Although theoretically the DNA itself could be considered personally identifiable. I guess it'd be up to the courts to decide really.
Product inventory for all retail stores - it would be really nice to be able to easily figure out where I can buy something locally without a lot of legwork.
Full Google search query logs - This would be a godsend for entrepreneurs. Imagine being able to figure out what things people are looking to buy that don't exist yet.
I'd like someone to take a bunch of recently deceased human brains and apply very small amounts of voltage to the individual neurons while the brain is in some sort of Magnetoencephalography device (SQUID, SERF, Whatever).
Then average all of the samples together, shotgun the results (ala shotgun sequencing), and produce an overall average model of the relative capacitance between all the neurons in the human brain.
I'm sure it's a hopelessly naive idea, but man would it be cool to emulate an entire human brain, even if it was in 1/1000 or 1/100000 time.
Interesting idea. Extending this, I would like to see comparisons between different brains. Some people are more intelligent than others in different ways, so perhaps by examining the relative differences, we can see how different people think differently.
All book sales. Not just bestsellers, but any book. If possible over time as well to watch as things become popular. A window into what people are reading would be fascinating to me.
Thanks for the tip about Bookscan! According to Wikipedia BookScan's US Consumer Market Panel currently covers approximately 75% of retail sales. That's pretty wide coverage, which is just what I had in mind. Amazon had occurred to me, but I didn't think they would have a large enough percent of total sales.
It's everyone's birthright to have access to whatever genealogical records exist that can help trace their ancestry. The data sets exist as public records all over the world, yet a handful of gatekeepers are getting rich selling subscriptions for the "service" of getting at public records. That should stop and these databases should be standardized and open to the public for free.
I would love to find a database that describes all structures in the body and how they are connected to each other, including how these vary (not everybody has the same number of muscles, for example).
It would be great if it contained average dimensions for each, along with how each can vary.
I'd like to see statistics on the relative success of all medical procedures. I think you ought to be able to see exactly how successful the different options are, including alternatives like massage, exercise and nutritional approaches.
Being able to how well each option works for people of varying degrees of health would also be useful. Some options may work for the morbidly obese but fail miserably with Olympic athletes.
Such data would be very difficult to collect but could help people and organizations make better choices.
Note that one problem here is that the success of individual surgeons is highly correlated with how often they do a procedure. I.e. a raw rate for surgical success won't necessarily tell you the odds of the surgeon you end up with.
I'd like complete pricing data for common purchases including both brick and mortar and online retailers. Eventually, I'd like a web site where I could enter my regular purchases (groceries + household goods) and it would tell me the cheapest way to acquire them (e.g. buy this list from the local supermarket, these items from walgreens, and order these 3 things from amazon). Obviously, the pricing data would need to be up to date esp. with regards to sales.
Data that allows the ranking of companies in any major industry by their adherence to best practices, especially consumer-friendly best practices.
Example would be: average hold time. But this is already somewhat available. I'm thinking there's tons of other data that is not available. Such as: do doctors at this clinic/medical practice send reminders when their patients are due for checkups?
Because of RateMyStudentRental, I'd love to see data that includes college student grade-point averages, retention/graduation rates, and quality of off-campus housing/landlord. My hunch is that the higher quality the housing, the better grades and higher retention/graduation rates the students have, but there is as of yet no dataset which includes the housing-quality metric.
Actually I can already tell you from my own dataset that quality of housing has very little correlation with rent amount off-campus. An insane percentage of great landlords undercharges significantly and doesn't realize it (or maybe it's the bad landlords overcharging). Yet, because of the nature of student housing and the level of enrollment, all housing still seems to get filled at the same speed, whether under- or overcharging.
I wouldn't be too sure about that. Dotcoms used to think that Aeron chairs would provide similar ROI. Now it's back to basics, excepting the actual computer.
Actually, sitting on bad chairs all day still effects productivity in a negative way. I've bought several Aerons because I sit all day, and I rely on my ass. It should be kept happy.
What killed dot-coms wasn't expensive chairs, it was a lack of sane customer development techniques... oh, and greed.
The gaming platform 'Steam' collects a large amount of data from peoples machines. While this isn't 'benchmark' data in terms of performance, it collects what hardware people are using so they have a good idea what the 'average' specification of their users computers are.
2. open restaurant data -- comprehensive listings of restaurants with information like hours, address, possibly menus.