

Where can I get large datasets open to the public? - helwr
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

======
physcab
Asking "What datasets are available to me?" is sometimes the wrong question. A
better way of going about the problem is asking something more specific like
"How can I create a heat-map of U.S poverty?" The reason why the latter is
better is that it not only focuses your attention on something do-able but it
actually teaches you more about data analysis than just searching for
datasets.

For example, to solve the question above you are going to be asking yourself
the following followup questions:

1) Where do I get a map of the U.S?

2) How do I make a heat-map?

3) How do I feed in my own data into this heat map?

4) What colors do I use?

5) Can I do this real-time? Do I need a database? What language do I use?

6) Whats a FIPS code?

7) How do I find a poverty dataset with FIPS codes?

8) This poverty dataset doesn't have FIPS codes, but I can join it with this
other dataset that does have FIPS codes.

~~~
buddydvd
Open datasets are hard to come by. It's potentially easier to find problems to
solve by looking at the available datasets than seeking datasets for the
problems you wish to solve.

~~~
physcab
When you want to create a new website do you start by looking at a bunch of
clip art and images? Browsing through datasets such as the ones listed in this
thread leaves me overwhelmed. I've never had any difficulty finding open
datasets. If there is a dataset I need that doesn't exist or costs money, I
find a way to create it from scratch.

~~~
spiffytech
Datasets aren't always quested for. Sometimes you want a specific dataset to
solve a specific problem, and sometimes you'll take any dataset just to see
what you can learn, or to make interesting infographics. Lists like this help
the latter situation.

------
machinespit
data.gov and other US gov data sites are getting severe cuts even though
they're saving money
([http://www.federalnewsradio.com/?nid=35&sid=2327798](http://www.federalnewsradio.com/?nid=35&sid=2327798))

Very upsetting for fans of open / accessible (government) data.

FWIW, petition at <http://sunlightfoundation.com/savethedata/>

~~~
tybris
Google or Amazon should offer to sponsor it and make the data accessible in
their respective cloud computing platforms. There's tons of potential for data
analysis / consultancy companies to work on this data and it's too big to
process anywhere else.

~~~
wizard_2
The major cost of the data is gathering it not distributing it. But I agree
this is something that needs to be archived.

------
iamelgringo
Hackers & Founders SV is hosting a hackathon[1] in two weeks at the Hacker
Dojo in Mountain View. It's going to be geared towards working with Factual's
open data API.

Factual's[2] goal is to provide an API to connect all those available data
sets, and they have a fairly impressive list of data sets available. Factual
is very interested in hearing what datasets you want to work with, and they
are willing to bust ass to get them available before the hackathon.

We still have around 40 RSVP slots open. You can register here:
<http://factualhackathon.eventbrite.com/>

</shameless plug>

[1] <http://www.hackersandfounders.com/events/16535156/>

[2] <http://www.factual.com/>

[3] <http://factualhackathon.eventbrite.com/>

~~~
spoiledtechie
how does factual make their money?

~~~
iamelgringo
It's free for developers, but if you want premium access, or if you're a large
corporation, then they have a paid version.

------
bigiain
<http://jacquesmattheij.com/Free%2C+Public+Data+Sets> And discussion:
<http://news.ycombinator.com/item?id=2165497>

------
bOR_
<http://www.hiv.lanl.gov/content/index>

For sentimental value: HIV sequence data (and other data) from 1980 till now.
Did my thesis on these ;-).

In general, there is an enormous amount of gene sequence data around, not just
HIV.

<http://www.ncbi.nlm.nih.gov/sites/>

Whole genome sequences of eukaryotes (including humans):
<http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi>

~~~
Anon84
Is there any HIV sequence data indexed by patient? I mean, sequences of
strains extracted from the same patient at time points in time?

I would email you directly about this, but you don't have any contact
information :(

------
shii
<http://www.reddit.com/r/datasets/>

------
svag
Previous discussions:

<http://news.ycombinator.com/item?id=2165497>
<http://news.ycombinator.com/item?id=764982>
<http://news.ycombinator.com/item?id=1024966>

------
espeed
Linked Data Sets
[http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingO...](http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets)

Web Services Directory
<http://www.programmableweb.com/apis/directory/1?sort=mashups>

------
raghus
Also, check out <http://aws.amazon.com/datasets>

------
drblast
Edit: Whoops, I thought this was an "Ask HN." The below post still stands for
anyone who finds it useful.

The U.S. Census has an extremely well-documented large data set:

<http://www2.census.gov/census_2000/datasets/>

And the documentation is here:

<http://www.census.gov/prod/cen2000/doc/sf1.pdf>

The software that they provide to go through the data is crappy, however (90's
era).

I have an equally crappy but more useful to a computer scientist Common Lisp
program that will pull out specific fields from the data set based on a list
of field names. If you want that, I can dig it up for you.

Also, before you start parsing this, it's worthwhile to read the documentation
to find out how the files are laid out, and what each field really means.
These files are not relational databases, so if you're looking at it through
those lenses, confusion will result. In particular, some things are already
aggregated within the data set.

------
barefoot
How many of these allow me to create for-profit websites with them?

------
Maro
There's a startup called kaggle.com that is all about hosting data mining
competitions around datasets, like netflix.

------
buss
<http://aws.amazon.com/publicdatasets/> which includes my former advisor's
dataset (UF sparse matrix collection) which includes a matrix or two from my
research.

------
latch
I believe Steven Levitt used the Fatality Analysis Reporting System (FARS)
from the national highway traffic safety administration (NHTSA) for his
seatbelts vs carseats work:

ftp://ftp.nhtsa.dot.gov/fars/

------
nowarninglabel
At <http://build.kiva.org> there are some nice datasets in the "data
snapshots" section. I have high hopes we will be releasing a lot more data.

------
brandnewlow
On that topic, anyone have any suggestions for the easiest way to prepopulate
a directory of local businesses in the U.S.?

~~~
jbermudes
Yelp has an API that returns business data in a given geographic area. You
could probably get a list of zipcodes from wikipedia and then just loop
through that.

~~~
achompas
Wouldn't be close to a list of local businesses--only those that are customer-
facing. Yelp has little coverage for B2B-focused businesses.

------
arethuza
UK Government data sets: <http://data.gov.uk/>

------
shafqat
We provide API access to more than 20 million articles (headlines, excerpts).
People have done all sorts of interesting things with it -
<http://platform.newscred.com>.

------
kordless
Infochimps?

------
thesuperformula
You can find many large datasets here, <http://beta.fcc.gov/data/download-fcc-
datasets> , some are over a gigabyte.

------
plannerball
Freebase?

------
mrzerga
microsoft azure - they have some large datasets...

