

Ask HN: What data set would you like to see/use, but doesn't exist? - jlangenauer

I was inspired by the Movity guys post showing how they're collecting a dataset for noise in the Tenderloin.<p>So what other datasets would you like to see/use that don't yet exist? Perhaps there's a startup idea for somebody in here...
======
arthurdent
1\. open financial tick data -- opentick used to have some, but that's gone as
far as i can tell. also, one of the biggest problems with financial data is
cleaning it. with an open dataset, people might be encouraged to clean the
open dataset like people contribute patches to open source code.

2\. open restaurant data -- comprehensive listings of restaurants with
information like hours, address, possibly menus.

~~~
transatlantic
I've spent a non-trivial amount of effort trying to figure out how to do
number two at scale, with a high level of accuracy, in a manner that could
eventually be profitable. So far I'm not close.

~~~
arthurdent
I would be interested in finding out more about your approach to this or
chatting about ways to make this happen.

I found this old thread: <http://news.ycombinator.com/item?id=911853> but
couldn't find your contact information.

~~~
buro9
Over at Yell we have such data and I'm trying to persuade them to open it up
through a public facing API.

Restaurants, locations, contacts details, opening times, photos of the facade
and sometimes of the inside, some menus.

We also have a lot of other data for things like bars (what beers they have,
do they have a garden, how many TV screens) and other business types.

The obstacles we've encountered to opening the data are: 1) Fear. Parts of
Yell fear that the data they spend time and money creating and collating will
be scraped and the crown jewels given away for free. 2) Politics. Different
parts of Yell view the opening of data differently and will fight over any
move even if they are in general agreement... nothing gets done.

I am trying to argue that on #1 that if they really need to fear something it
should be not moving with the times. And also through the use of things like
API keys that we could rate limit to some extent to prevent major scraping
occurring if they really think it's a big problem.

And on #2 it's currently part of their business to argue a lot it seems.

I believe that if they could see the value in opening data that both would be
a non-issue and it would just be done. So I'm also working on trying to
document examples of how people could use such an API that is beneficial to
Yell. And on that note... if anyone has such examples I'd be glad to hear
them.

~~~
steveitis
It's pretty simple. I've never heard of Yell before now. If they'd opened up
this data, than I would have. Especially if they only allowed usage under an
attribution required license.

------
zmmz
There is so much data in government that could be useful for everybody. Just a
quick brainstorm produced this:

I would like to see easy access to data from the Consumer Product Safety
Commission (and all relevant organisations from around the world), and to be
able to have this linked directly to products.

You could then take use your phone to take a picture of a bar code while in a
store and an app would check this against the above data for historical
information about the vendor.

In the US and UK I see some effort being made with the new government portals
to download data, I just wish this practice would be come more common around
the world.

------
protomyth
I would love to have a complete tax filings for the US. Something like an
anonymous list of people with revenue (with type) and taxes paid. It would
very interesting to do some data mining and do some simulations of different
tax strategies.

------
dpcan
All national real estate data with no MLS / IDX restrictions.

~~~
tomhogans
I think some aggregate, separate list of rental properties would be nice. MLS
typically covers the available properties for sale fairly thoroughly but
renters are left scouring through craigslist, newspapers, and shoddy realtor
sites.

~~~
_delirium
The strange thing about that problem is that it seems entirely an
infrastructure/middle-man problem, because the people ultimately producing the
data (landlords) _want_ it as public as possible, so they can rent out their
properties. There are very few landlords who, if you managed to find them and
ask, would not answer yes to, "may I include your rental listing in the Big
Database Anything Can Access?"

------
stochastician
I'd like a full dump of all okcupid data, including how people rate me, when
they rate me, their ratings, etc. I'd like to know how long they stay on my
profile, how often they click to my profile, and what they're searching for.

I mean, really, I want to know what every person I meet of the opposite gender
is thinking, but that dataset won't exist until I build more hardware.

~~~
thibaut_barrere
"I mean, really, I want to know what every person I meet of the opposite
gender is thinking, but that dataset won't exist until I build more hardware."

You can build the dataset yourself by talking to them :)

------
brg
While only slightly off-topic, I find myself more in need of curated
information than raw data in my sidebar research.

For instance, I searched for a while for a nearly complete dictionary of
English verbs and their corresponding conjugations. My friend entitled this a
"declension-ary."

~~~
carbocation
Being overly pedantic, I wanted to note that

(a) Most dictionaries are also declension-aries,

(b) I really like your word declension-ary,

and (c) Declension is what you do to nouns and is analogous to conjugation of
verbs, which I think is what you are looking for :-)

~~~
benpbenp
Yes, and inflection is the general case which includes declension and
conjugation. It's unfortunate that most people are only aware of the other
meaning of inflection, having to do with tone of voice.

------
scottyallen
Product inventory for all retail stores - it would be really nice to be able
to easily figure out where I can buy something locally without a lot of
legwork.

Full Google search query logs - This would be a godsend for entrepreneurs.
Imagine being able to figure out what things people are looking to buy that
don't exist yet.

------
starkfist
Whenever someone gets something removed from their body at a research
hospital, it gets put into a glass jar and stuck somewhere in the basement. It
would be cool to sequence all of the dna in tissue along with the phenotype
data (what is the tissue, what's wrong with it, why was it removed, etc) and
put the data somewhere publicly accessible. I believe there are already pushes
to do this but they are mired in academic and hospital bureaucracy.

~~~
btmorex
Wouldn't you have to get a signed waiver from everyone?

Also, sequencing still isn't actually cheap. The price is dropping quickly and
its cost is reasonable for research, but it's not reasonable for "let's
sequence _everything_ " (yet)

~~~
starkfist
Yes, when I was involved with this... privacy was a major issue. But I think
everyone was over thinking it. In this day and age when people twitter their
turds, it seems like it would be less of an issue. The sequencing is cheap
enough.

~~~
SkyMarshal
How about if the data collectors just agree to ignore any personally
identifiable information? Probably more interested in the data aggregate
anyway.

~~~
starkfist
When I was involved with some people who wanted to do this, they had a
security expert on the "board" who would bring up that study where given 2 (or
3?) pieces of information about a person, you can ID them. I guess just
knowing the hospital and the sample and the date, you could ID the person.

Anyway, these were the kinds of discussions I had for a year and then decided
to leave the world of semi-academic meta-bio research because there were too
many "committees" and "boardS" and "experts" and nothing ever got done.

~~~
SkyMarshal
Oh right, Information Theory. There was a cool article on HN about that a
while back, showing you only need something like 3 bits (zip code, birth date,
and something else iirc) to identify anyone in the world.

------
dangoldin
I'd love to see open historical weather data by geography. It's definitely
available but not that easy to get.

~~~
seancron
On a similar note, free realtime radar data could also be useful.

------
Kilimanjaro
All medicines with their posology, side effects, etc.

A giant pharmaceutical database.

------
babyshake
Product UPC code data. It's much more difficult than it should be to
disambiguate a UPC code to a product name.

------
steveitis
I'd like someone to take a bunch of recently deceased human brains and apply
very small amounts of voltage to the individual neurons while the brain is in
some sort of Magnetoencephalography device (SQUID, SERF, Whatever).

Then average all of the samples together, shotgun the results (ala shotgun
sequencing), and produce an overall average model of the relative capacitance
between all the neurons in the human brain.

I'm sure it's a hopelessly naive idea, but man would it be cool to emulate an
entire human brain, even if it was in 1/1000 or 1/100000 time.

~~~
palish
Interesting idea. Extending this, I would like to see comparisons between
different brains. Some people are more intelligent than others _in different
ways_ , so perhaps by examining the relative differences, we can see how
different people think differently.

------
physcab
A full dump of Wal-Mart's product inventory by store id and location updated
in real-time

~~~
steveitis
That would be fun. Especially when cross referenced with other data sets.

Things like 'Condom sales rise, when house is a rerun.'

------
jeromec
All book sales. Not just bestsellers, but any book. If possible over time as
well to watch as things become popular. A window into what people are reading
would be fascinating to me.

~~~
ig1
Nielsen Bookscan collect that data but you have to licence it, alternatively
you could get the data from amazon.

~~~
jeromec
Thanks for the tip about Bookscan! According to Wikipedia BookScan's US
Consumer Market Panel currently covers approximately 75% of retail sales.
That's pretty wide coverage, which is just what I had in mind. Amazon had
occurred to me, but I didn't think they would have a large enough percent of
total sales.

------
steveitis
TV listings and ratings data for the last 20 years.

No idea what I'd do with it, but I'd find something to reference it against.

------
fakefred
It's everyone's birthright to have access to whatever genealogical records
exist that can help trace their ancestry. The data sets exist as public
records all over the world, yet a handful of gatekeepers are getting rich
selling subscriptions for the "service" of getting at public records. That
should stop and these databases should be standardized and open to the public
for free.

------
paulgb
I've wanted for a while to see a heatmap of the price of icecream trucks or
hotdog vendors in Toronto in the summer.

Transit-related data would be interesting too. Where people are travelling
to/from, and how they do so.

I'm not sure what I'd do with either data set, but my sense of curiosity would
have fun with them.

~~~
mahmud
Couldn't help but have mental images of a human junk-food pacman, chasing
trucks.

------
stretchwithme
I would love to find a database that describes all structures in the body and
how they are connected to each other, including how these vary (not everybody
has the same number of muscles, for example).

It would be great if it contained average dimensions for each, along with how
each can vary.

------
stretchwithme
I'd like to see statistics on the relative success of all medical procedures.
I think you ought to be able to see exactly how successful the different
options are, including alternatives like massage, exercise and nutritional
approaches.

Being able to how well each option works for people of varying degrees of
health would also be useful. Some options may work for the morbidly obese but
fail miserably with Olympic athletes.

Such data would be very difficult to collect but could help people and
organizations make better choices.

~~~
hga
Note that one problem here is that the success of individual surgeons is
highly correlated with how often they do a procedure. I.e. a raw rate for
surgical success won't necessarily tell you the odds of the surgeon you end up
with.

------
btmorex
I'd like complete pricing data for common purchases including both brick and
mortar and online retailers. Eventually, I'd like a web site where I could
enter my regular purchases (groceries + household goods) and it would tell me
the cheapest way to acquire them (e.g. buy this list from the local
supermarket, these items from walgreens, and order these 3 things from
amazon). Obviously, the pricing data would need to be up to date esp. with
regards to sales.

~~~
ig1
<http://www.mysupermarket.co.uk/>

Does this for the four biggest supermarkets in the uk.

------
natch
Data that allows the ranking of companies in any major industry by their
adherence to best practices, especially consumer-friendly best practices.

Example would be: average hold time. But this is already somewhat available.
I'm thinking there's tons of other data that is not available. Such as: do
doctors at this clinic/medical practice send reminders when their patients are
due for checkups?

------
JangoSteve
Because of RateMyStudentRental, I'd love to see data that includes college
student grade-point averages, retention/graduation rates, and quality of off-
campus housing/landlord. My hunch is that the higher quality the housing, the
better grades and higher retention/graduation rates the students have, but
there is as of yet no dataset which includes the housing-quality metric.

~~~
djhworld
I'd imagine 'quality of housing' is directly proportional to 'cost of rent'

So I'd have thought that grades are largely irrelevent in this case, you could
have a really stupid person with rich parents etc

~~~
JangoSteve
Actually I can already tell you from my own dataset that quality of housing
has very little correlation with rent amount off-campus. An insane percentage
of great landlords undercharges significantly and doesn't realize it (or maybe
it's the bad landlords overcharging). Yet, because of the nature of student
housing and the level of enrollment, all housing still seems to get filled at
the same speed, whether under- or overcharging.

------
ig1
I'd like to see the twitter archive available on something like AWS. There's
lots of cool data mining you could do with it.

------
savant
Google's datasets. All of them. I mean, they exist, but not in completeness
for anyone outside google.

------
radu_floricica
Government data. Everything that is legally open, put in databases with a
standard, open interface.

------
thibaut_barrere
I'd love a HackerNews data dump. Really.

Is it available somewhere (apart from scraping) ?

~~~
paulgb
Depending on what you want to do with it, but this might be what you're
looking for: [http://www.mattmazur.com/2010/03/six-months-of-hackernews-
fr...](http://www.mattmazur.com/2010/03/six-months-of-hackernews-front-page-
data/)

------
photon_off
Every website's publicly accessible forms. Think Yahoo Pipes, except using
forms that already exist.

------
helwr
prescription drugs inventory, by pharmacy/city/state, updated daily

customs (imports/exports)

corporate balance sheet

hotel occupancy rates

government contracts

------
jdc
Short revisit satellite imagery for outdoor fire detection.

~~~
adoyle
<http://activefiremaps.fs.fed.us/> is a good place to start. I think ESA has
similar data but possibly not as easily available.

------
zokier
Aggregate benchmark data for computer hardware

~~~
djhworld
The gaming platform 'Steam' collects a large amount of data from peoples
machines. While this isn't 'benchmark' data in terms of performance, it
collects what hardware people are using so they have a good idea what the
'average' specification of their users computers are.

See: <http://store.steampowered.com/hwsurvey/>

