Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What data set would you like to see/use, but doesn't exist?
37 points by jlangenauer on July 4, 2010 | hide | past | web | favorite | 73 comments
I was inspired by the Movity guys post showing how they're collecting a dataset for noise in the Tenderloin.

So what other datasets would you like to see/use that don't yet exist? Perhaps there's a startup idea for somebody in here...

1. open financial tick data -- opentick used to have some, but that's gone as far as i can tell. also, one of the biggest problems with financial data is cleaning it. with an open dataset, people might be encouraged to clean the open dataset like people contribute patches to open source code.

2. open restaurant data -- comprehensive listings of restaurants with information like hours, address, possibly menus.

I've spent a non-trivial amount of effort trying to figure out how to do number two at scale, with a high level of accuracy, in a manner that could eventually be profitable. So far I'm not close.

I would be interested in finding out more about your approach to this or chatting about ways to make this happen.

I found this old thread: http://news.ycombinator.com/item?id=911853 but couldn't find your contact information.

Over at Yell we have such data and I'm trying to persuade them to open it up through a public facing API.

Restaurants, locations, contacts details, opening times, photos of the facade and sometimes of the inside, some menus.

We also have a lot of other data for things like bars (what beers they have, do they have a garden, how many TV screens) and other business types.

The obstacles we've encountered to opening the data are: 1) Fear. Parts of Yell fear that the data they spend time and money creating and collating will be scraped and the crown jewels given away for free. 2) Politics. Different parts of Yell view the opening of data differently and will fight over any move even if they are in general agreement... nothing gets done.

I am trying to argue that on #1 that if they really need to fear something it should be not moving with the times. And also through the use of things like API keys that we could rate limit to some extent to prevent major scraping occurring if they really think it's a big problem.

And on #2 it's currently part of their business to argue a lot it seems.

I believe that if they could see the value in opening data that both would be a non-issue and it would just be done. So I'm also working on trying to document examples of how people could use such an API that is beneficial to Yell. And on that note... if anyone has such examples I'd be glad to hear them.

It's pretty simple. I've never heard of Yell before now. If they'd opened up this data, than I would have. Especially if they only allowed usage under an attribution required license.

Open restaurant data is slowly added into OpenStreetMap. I too think it is a great resource (or will be), same for shops, doctors etc.

Your number 2 is interesting. My company actually basically does this. We crawl the UK web for business sites, incl. retail and hospitality/leisure and suck out contact details, addresses, products, services, food & drink items on menus, opening times as well as a few other things. We're looking at making available an API.

1. The R Empirical Finance Taks View lists a number of packages with data - http://cran.r-project.org/web/views/Finance.html. Especially check out quantmod- http://www.quantmod.com.

Cleaning financial tick data can be very tricky, because it depends very much on how you want to use the data for. If you're not careful any models you build on the cleaned data can fall apart in real world usage because they fail to take into account the abnormalities in market data.

There is so much data in government that could be useful for everybody. Just a quick brainstorm produced this:

I would like to see easy access to data from the Consumer Product Safety Commission (and all relevant organisations from around the world), and to be able to have this linked directly to products.

You could then take use your phone to take a picture of a bar code while in a store and an app would check this against the above data for historical information about the vendor.

In the US and UK I see some effort being made with the new government portals to download data, I just wish this practice would be come more common around the world.

I would love to have a complete tax filings for the US. Something like an anonymous list of people with revenue (with type) and taxes paid. It would very interesting to do some data mining and do some simulations of different tax strategies.

All national real estate data with no MLS / IDX restrictions.

Use Google Maps: locate your intended area on the map, in the "More" drop menu check the "Real estate" box, then use the ensuing checkboxes to create a pretty detailed search result, including rentals and foreclosures. What you're asking for is all there, now.

I think some aggregate, separate list of rental properties would be nice. MLS typically covers the available properties for sale fairly thoroughly but renters are left scouring through craigslist, newspapers, and shoddy realtor sites.

The strange thing about that problem is that it seems entirely an infrastructure/middle-man problem, because the people ultimately producing the data (landlords) want it as public as possible, so they can rent out their properties. There are very few landlords who, if you managed to find them and ask, would not answer yes to, "may I include your rental listing in the Big Database Anything Can Access?"

I am a Realtor in Texas and this is something I've been talking about on twitter recently. I would absolutely love to see real estate data opened up for consumption by developers.

Out of curiosity, what would you do if you had access to all national real estate data? What MLS/IDX restrictions currently prevent you from what you want to do?

I'd like a full dump of all okcupid data, including how people rate me, when they rate me, their ratings, etc. I'd like to know how long they stay on my profile, how often they click to my profile, and what they're searching for.

I mean, really, I want to know what every person I meet of the opposite gender is thinking, but that dataset won't exist until I build more hardware.

"I mean, really, I want to know what every person I meet of the opposite gender is thinking, but that dataset won't exist until I build more hardware."

You can build the dataset yourself by talking to them :)

I'd assume they wouldn't release that data because of privacy implications, how someone rates you is private to the person rating you.

But even a profile dump would be interesting, you could for example build a movie recommendation system based on it. As would aggregate statistics of which section of your profile raises the most interest, etc. (actually I'm guessing they might consider the later if someone asked them).

I wish this were possible, but AFAIK it's not at this point. Someone correct me if I'm wrong, but they don't have an API to speak of and there isn't any means to link external resources so you could run hit trackers, analytics, etc.

While only slightly off-topic, I find myself more in need of curated information than raw data in my sidebar research.

For instance, I searched for a while for a nearly complete dictionary of English verbs and their corresponding conjugations. My friend entitled this a "declension-ary."

Being overly pedantic, I wanted to note that

(a) Most dictionaries are also declension-aries,

(b) I really like your word declension-ary,

and (c) Declension is what you do to nouns and is analogous to conjugation of verbs, which I think is what you are looking for :-)

Yes, and inflection is the general case which includes declension and conjugation. It's unfortunate that most people are only aware of the other meaning of inflection, having to do with tone of voice.

I've looked for similar things, and been unable to find them.

For people like us who need these things, but don't want to do it alone their should be a sort of Wikipedia/Digg like distributed lexical database.

Everyone contributes data as it's acquired, and it is then verified by the users with a simple up/down vote for each child node.

The data could even be automatically acquired, and just manually verified as it is utilized.

Still an overwhelming task, but I think it would be worth it given all the fields that overlap with Natural Language Processing these days.

Whenever someone gets something removed from their body at a research hospital, it gets put into a glass jar and stuck somewhere in the basement. It would be cool to sequence all of the dna in tissue along with the phenotype data (what is the tissue, what's wrong with it, why was it removed, etc) and put the data somewhere publicly accessible. I believe there are already pushes to do this but they are mired in academic and hospital bureaucracy.

Wouldn't you have to get a signed waiver from everyone?

Also, sequencing still isn't actually cheap. The price is dropping quickly and its cost is reasonable for research, but it's not reasonable for "let's sequence everything" (yet)

Yes, when I was involved with this... privacy was a major issue. But I think everyone was over thinking it. In this day and age when people twitter their turds, it seems like it would be less of an issue. The sequencing is cheap enough.

How about if the data collectors just agree to ignore any personally identifiable information? Probably more interested in the data aggregate anyway.

When I was involved with some people who wanted to do this, they had a security expert on the "board" who would bring up that study where given 2 (or 3?) pieces of information about a person, you can ID them. I guess just knowing the hospital and the sample and the date, you could ID the person.

Anyway, these were the kinds of discussions I had for a year and then decided to leave the world of semi-academic meta-bio research because there were too many "committees" and "boardS" and "experts" and nothing ever got done.

Oh right, Information Theory. There was a cool article on HN about that a while back, showing you only need something like 3 bits (zip code, birth date, and something else iirc) to identify anyone in the world.

There are also HIPPA laws to consider also if you opened up all that information. Not to mention the privacy concerns.

I used to work in a mental hospital, and we were very serious about HIPAA.

Anyhow, HIPAA only covers personally identifiable patient information so I think they'd be cool as long as it couldn't be traced back to the original person.

Although theoretically the DNA itself could be considered personally identifiable. I guess it'd be up to the courts to decide really.

Product inventory for all retail stores - it would be really nice to be able to easily figure out where I can buy something locally without a lot of legwork.

Full Google search query logs - This would be a godsend for entrepreneurs. Imagine being able to figure out what things people are looking to buy that don't exist yet.

I'd love to see open historical weather data by geography. It's definitely available but not that easy to get.

On a similar note, free realtime radar data could also be useful.

Product UPC code data. It's much more difficult than it should be to disambiguate a UPC code to a product name.

All medicines with their posology, side effects, etc.

A giant pharmaceutical database.

I'd like someone to take a bunch of recently deceased human brains and apply very small amounts of voltage to the individual neurons while the brain is in some sort of Magnetoencephalography device (SQUID, SERF, Whatever).

Then average all of the samples together, shotgun the results (ala shotgun sequencing), and produce an overall average model of the relative capacitance between all the neurons in the human brain.

I'm sure it's a hopelessly naive idea, but man would it be cool to emulate an entire human brain, even if it was in 1/1000 or 1/100000 time.

Interesting idea. Extending this, I would like to see comparisons between different brains. Some people are more intelligent than others in different ways, so perhaps by examining the relative differences, we can see how different people think differently.

A full dump of Wal-Mart's product inventory by store id and location updated in real-time

That would be fun. Especially when cross referenced with other data sets.

Things like 'Condom sales rise, when house is a rerun.'

I'm pretty sure that's close to existing; you just can't get your hands on it.

Tesco (third largest grocery retailer) have an API which offers that data.

All book sales. Not just bestsellers, but any book. If possible over time as well to watch as things become popular. A window into what people are reading would be fascinating to me.

Nielsen Bookscan collect that data but you have to licence it, alternatively you could get the data from amazon.

Thanks for the tip about Bookscan! According to Wikipedia BookScan's US Consumer Market Panel currently covers approximately 75% of retail sales. That's pretty wide coverage, which is just what I had in mind. Amazon had occurred to me, but I didn't think they would have a large enough percent of total sales.

TV listings and ratings data for the last 20 years.

No idea what I'd do with it, but I'd find something to reference it against.

It's everyone's birthright to have access to whatever genealogical records exist that can help trace their ancestry. The data sets exist as public records all over the world, yet a handful of gatekeepers are getting rich selling subscriptions for the "service" of getting at public records. That should stop and these databases should be standardized and open to the public for free.

I've wanted for a while to see a heatmap of the price of icecream trucks or hotdog vendors in Toronto in the summer.

Transit-related data would be interesting too. Where people are travelling to/from, and how they do so.

I'm not sure what I'd do with either data set, but my sense of curiosity would have fun with them.

Couldn't help but have mental images of a human junk-food pacman, chasing trucks.

I would love to find a database that describes all structures in the body and how they are connected to each other, including how these vary (not everybody has the same number of muscles, for example).

It would be great if it contained average dimensions for each, along with how each can vary.

I'd like to see statistics on the relative success of all medical procedures. I think you ought to be able to see exactly how successful the different options are, including alternatives like massage, exercise and nutritional approaches.

Being able to how well each option works for people of varying degrees of health would also be useful. Some options may work for the morbidly obese but fail miserably with Olympic athletes.

Such data would be very difficult to collect but could help people and organizations make better choices.

Note that one problem here is that the success of individual surgeons is highly correlated with how often they do a procedure. I.e. a raw rate for surgical success won't necessarily tell you the odds of the surgeon you end up with.

I'd like complete pricing data for common purchases including both brick and mortar and online retailers. Eventually, I'd like a web site where I could enter my regular purchases (groceries + household goods) and it would tell me the cheapest way to acquire them (e.g. buy this list from the local supermarket, these items from walgreens, and order these 3 things from amazon). Obviously, the pricing data would need to be up to date esp. with regards to sales.


Does this for the four biggest supermarkets in the uk.

Data that allows the ranking of companies in any major industry by their adherence to best practices, especially consumer-friendly best practices.

Example would be: average hold time. But this is already somewhat available. I'm thinking there's tons of other data that is not available. Such as: do doctors at this clinic/medical practice send reminders when their patients are due for checkups?

Because of RateMyStudentRental, I'd love to see data that includes college student grade-point averages, retention/graduation rates, and quality of off-campus housing/landlord. My hunch is that the higher quality the housing, the better grades and higher retention/graduation rates the students have, but there is as of yet no dataset which includes the housing-quality metric.

I'd imagine 'quality of housing' is directly proportional to 'cost of rent'

So I'd have thought that grades are largely irrelevent in this case, you could have a really stupid person with rich parents etc

Actually I can already tell you from my own dataset that quality of housing has very little correlation with rent amount off-campus. An insane percentage of great landlords undercharges significantly and doesn't realize it (or maybe it's the bad landlords overcharging). Yet, because of the nature of student housing and the level of enrollment, all housing still seems to get filled at the same speed, whether under- or overcharging.

I wouldn't be too sure about that. Dotcoms used to think that Aeron chairs would provide similar ROI. Now it's back to basics, excepting the actual computer.

Actually, sitting on bad chairs all day still effects productivity in a negative way. I've bought several Aerons because I sit all day, and I rely on my ass. It should be kept happy.

What killed dot-coms wasn't expensive chairs, it was a lack of sane customer development techniques... oh, and greed.

I'd like to see the twitter archive available on something like AWS. There's lots of cool data mining you could do with it.

Google's datasets. All of them. I mean, they exist, but not in completeness for anyone outside google.

Government data. Everything that is legally open, put in databases with a standard, open interface.

I'd love a HackerNews data dump. Really.

Is it available somewhere (apart from scraping) ?

Depending on what you want to do with it, but this might be what you're looking for: http://www.mattmazur.com/2010/03/six-months-of-hackernews-fr...

Every website's publicly accessible forms. Think Yahoo Pipes, except using forms that already exist.

Short revisit satellite imagery for outdoor fire detection.

http://activefiremaps.fs.fed.us/ is a good place to start. I think ESA has similar data but possibly not as easily available.

Aggregate benchmark data for computer hardware

The gaming platform 'Steam' collects a large amount of data from peoples machines. While this isn't 'benchmark' data in terms of performance, it collects what hardware people are using so they have a good idea what the 'average' specification of their users computers are.

See: http://store.steampowered.com/hwsurvey/

prescription drugs inventory, by pharmacy/city/state, updated daily

customs (imports/exports)

corporate balance sheet

hotel occupancy rates

government contracts

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact