Hacker News new | comments | ask | show | jobs | submit login
Ask HN: How or Where can I get data sets?
29 points by yalurker on Aug 15, 2009 | hide | past | web | favorite | 18 comments
I need a lot of data to make my application really work. Statistics without the underlying data would help some, but I really want to be able to do my own modeling/analysis on a data set.

I've been searching the web without luck. Does anyone know of any resources for acquiring valid data sets that can be used commercially? In my case specifically, I'm looking for data related to human health, exercise and weight loss.

infochimps.org (http://infochimps.org/search?query=weight) is an easy search in general--it certainly has many health-related datasets, but I didn't see any specifically on exercise. The census bureau (http://www.census.gov/hhes/www/hlthins/hlthins.html) also collects some health data, though a good part of its focus is on insurance.

Most datasets on the web are going to be latitudinal surveys of a population in time, though they may span several years. It sounds like you might want a specific kind of longitudinal study following individuals over time and tracking their individual weight changes and relating that to their personal exercise routines. As garish as it is, http://www.bettycjung.net/Phdata.htm actually has a wide variety of health data links and is probably your best bet.

Finally, a few more links from my bookmarks which look like they might help you (though I haven't looked through them to be sure): http://www.who.int/whosis/en/, http://www.lib.berkeley.edu/PUBL/stats.html, and http://phpartners.org/health_stats.html. Good luck!

Aaron Swartz (of reddit fame, and sometimes news.YC visitor) started http://theinfo.org/

amazon has large dataset. http://aws.amazon.com/publicdatasets/

uc irvine machine learning repository. http://archive.ics.uci.edu/ml/ it's where the netflix dataset is headed. you'll find many datasets there that have been used in scientific papers as well, so you can benchmark whatever tests you want to run.

the data wrangling blog also has a pretty good list of datasets: http://www.datawrangling.com/some-datasets-available-on-the-...

The guardian newspaper has a lot of interesting datasets; plus a blog and flickr group for those interested in using the datasets for data visualisation.


Generally speaking, data analysis has to be tuned to the specific question and data you're interested in. What are you trying to do with this data? Would random numbers work just as well?

Also, I worry about the health-related question part - aggregating data sets of medical data is a serious business; it's not just a case of pooling it all together. If there is really a specific question of interest, either gather data from a valid meta-analysis (like a Cochrane review) or stick to a single sufficiently-large dataset so that you can understand the biases in the data collection.


is a site with a lot of different types of data sets, including health data. Their mission is to "liberate the world's data and make it useful so new insights can be discovered and shared." Whatever that means.

machine learning repository here: http://archive.ics.uci.edu/ml/

spider + scraper ?

I had this idea for a long time in my to do list: a place for everybody to download datasets of any kind, from list of countries, states, cities, names, etc, where everybody can upload a dataset and make money from every download in any format, csv, xml, json, yaml, etc.

Small sets will be free, as anybody can find them anywhere online, but they will bring traffic. Huge sets can cost lots of money, since it would be easier to just pay instead of typing the whole set by hand in a database.

For example, I was looking for all prescription medicines to build a med website, guess what, I would have paid good money to get all that info in an easily consumable format.

And like that, millions of datasets for as many different scenarios you can imagine.

The name of all known stars? you betcha we have it!

All movies ever filmed? yep, just $9.95

Is it a viable business? I have no crystal ball, but it has been a need of mine, many many times...

Well, if the semantic web dream comes true, all someone would have to do is create a SPARQL query and run it against the web to obtain the dataset.

There's a company called AggData that scrapes and sells location datasets: http://aggdata.com

Geocommons Finder http://finder.geocommons.com/ has quite a lot of free location data.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact