

Ask HN: How or Where can I get data sets? - yalurker

I need a lot of data to make my application really work.  Statistics without the underlying data would help some, but I really want to be able to do my own modeling/analysis on a data set.<p>I've been searching the web without luck.  Does anyone know of any resources for acquiring valid data sets that can be used commercially?  In my case specifically, I'm looking for data related to human health, exercise and weight loss.
======
lotze
infochimps.org (<http://infochimps.org/search?query=weight>) is an easy search
in general--it certainly has many health-related datasets, but I didn't see
any specifically on exercise. The census bureau
(<http://www.census.gov/hhes/www/hlthins/hlthins.html>) also collects some
health data, though a good part of its focus is on insurance.

Most datasets on the web are going to be latitudinal surveys of a population
in time, though they may span several years. It sounds like you might want a
specific kind of longitudinal study following individuals over time and
tracking their individual weight changes and relating that to their personal
exercise routines. As garish as it is, <http://www.bettycjung.net/Phdata.htm>
actually has a wide variety of health data links and is probably your best
bet.

Finally, a few more links from my bookmarks which look like they might help
you (though I haven't looked through them to be sure):
<http://www.who.int/whosis/en/>,
<http://www.lib.berkeley.edu/PUBL/stats.html>, and
<http://phpartners.org/health_stats.html>. Good luck!

------
andreyf
Aaron Swartz (of reddit fame, and sometimes news.YC visitor) started
<http://theinfo.org/>

------
DenisM
amazon has large dataset. <http://aws.amazon.com/publicdatasets/>

------
radu_floricica
<http://www.freebase.com/> ?

------
physcab
uc irvine machine learning repository. <http://archive.ics.uci.edu/ml/> it's
where the netflix dataset is headed. you'll find many datasets there that have
been used in scientific papers as well, so you can benchmark whatever tests
you want to run.

the data wrangling blog also has a pretty good list of datasets:
[http://www.datawrangling.com/some-datasets-available-on-
the-...](http://www.datawrangling.com/some-datasets-available-on-the-web)

------
psyklic
<http://data.gov/>

------
paulsb
The guardian newspaper has a lot of interesting datasets; plus a blog and
flickr group for those interested in using the datasets for data
visualisation.

<http://www.guardian.co.uk/data-store>

------
Estragon
Generally speaking, data analysis has to be tuned to the specific question and
data you're interested in. What are you trying to do with this data? Would
random numbers work just as well?

~~~
frossie
Also, I worry about the health-related question part - aggregating data sets
of medical data is a serious business; it's not just a case of pooling it all
together. If there is really a specific question of interest, either gather
data from a valid meta-analysis (like a Cochrane review) or stick to a single
sufficiently-large dataset so that you can understand the biases in the data
collection.

------
EastSmith
Ask this question here: <http://groups.google.com/group/get-theinfo/topics>

------
RK
<http://www.swivel.com/>

is a site with a lot of different types of data sets, including health data.
Their mission is to "liberate the world's data and make it useful so new
insights can be discovered and shared." Whatever that means.

------
swapspace
<http://delicious.com/pskomoroch/dataset>

------
ssn
See <http://kevinchai.net/datasets/>

------
pjonesdotca
machine learning repository here: <http://archive.ics.uci.edu/ml/>

------
jacquesm
spider + scraper ?

------
TweedHeads
I had this idea for a long time in my to do list: a place for everybody to
download datasets of any kind, from list of countries, states, cities, names,
etc, where everybody can upload a dataset and make money from every download
in any format, csv, xml, json, yaml, etc.

Small sets will be free, as anybody can find them anywhere online, but they
will bring traffic. Huge sets can cost lots of money, since it would be easier
to just pay instead of typing the whole set by hand in a database.

For example, I was looking for all prescription medicines to build a med
website, guess what, I would have paid good money to get all that info in an
easily consumable format.

And like that, millions of datasets for as many different scenarios you can
imagine.

The name of all known stars? you betcha we have it!

All movies ever filmed? yep, just $9.95

Is it a viable business? I have no crystal ball, but it has been a need of
mine, many many times...

~~~
wicknicks
Well, if the semantic web dream comes true, all someone would have to do is
create a SPARQL query and run it against the web to obtain the dataset.

