
Free, Public Data Sets - iisbum
http://jacquesmattheij.com/Free%2C+Public+Data+Sets
======
mindcrime
If anyone is looking for more datasets, see:

<http://datasets.reddit.com>

<http://opendata.reddit.com>

and

[http://www.quora.com/Where-can-I-get-large-datasets-open-
to-...](http://www.quora.com/Where-can-I-get-large-datasets-open-to-the-
public?q=dataset)

for some good lists of available stuff.

------
bravura
get.theinfo is the best way to find data sets. They are a bunch of data
hoarders who can help you: <http://groups.google.com/group/get-theinfo/?pli=1>

I always ask there if I can't find what I'm looking for.

Here are more and more data sets. These are general data sets. Email me if you
have a specific data set in mind (e.g. web-as-corpus, spam, images, social,
reviews, etc.). I have a big file of information.

    
    
        http://theinfo.org/
        http://infochimps.org/datasets
        http://ckan.org [Comprehensive Knowledge Archive Network]
        http://www.datawrangling.com/some-datasets-available-on-the-web.html
        http://del.icio.us/pskomoroch/dataset
        http://www.reddit.com/r/datasets/
        http://news.ycombinator.com/item?id=1242029
        http://www.reddit.com/r/opendata
        http://www.trustlet.org/wiki/Repositories_of_datasets
        http://www.daniel-lemire.com/blog/data-for-data-mining/
        http://www.quantlet.org/mdbase/
        http://datamob.org/
        http://freebase.com/
        http://infochimp.info/ics/data/ripd/www-personal.umich.edu/~mejn/netdata/
        http://www.archive-it.org/public/all_collections
    
        Large:
            http://www.ckan.net/tag/read/size-large
            http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx
    

Web as corpus:

    
    
        Good instructions:
            http://corpus.leeds.ac.uk/internet.html#description
        http://sslmit.unibo.it/~baroni/bootcat.html
    
        http://www.drni.de/wac-tk/index.php/Documentation
    

etc. Email me if you need more <http://cleaneval.sigwac.org.uk/>
[http://liste.sslmit.unibo.it/pipermail/sigwac/2007-November/...](http://liste.sslmit.unibo.it/pipermail/sigwac/2007-November/000041.html)
<http://wacky.sslmit.unibo.it/doku.php?id=>
<http://clic.cimec.unitn.it/marco/research.html>

------
seancron
Here's some more links to data sets:

<http://radar.oreilly.com/2010/03/open-data-pointers.html>

[http://www.datawrangling.com/some-datasets-available-on-
the-...](http://www.datawrangling.com/some-datasets-available-on-the-web)

<http://del.icio.us/pskomoroch/dataset>

<http://infochimps.com/collections/datamob> (and the other collections on the
site)

<http://www.data.gov/>

------
zipdog
The wikipedia dump is great, but I've started using <http://wiki.dbpedia.org/>
which has an API to query the dumps.

Thanks for these, iisbum. I wish more public data was available in db, xml or
similar structures - too often I find myself scraping government sites or pdfs
to get the tables I need

------
adw
We've got quite a lot of public economic data: <http://timetric.com/>.

If you're up to something in the economic data space we'd love to talk. Happy
to take this to email (andrew@timetric.com) if anyone's interested.

~~~
hessenwolf
I looked at the site, and I see some data but I didn't find what I would have
hoped for. I couldn't find yield curves, and historical exchange rates <i> up
to <i/> today (available on the ecb site in xml format). Certainly I would
have thought yield curves were a front page item.

Things that would be very cool would be 1. financial statements in a database
format. I know you can scrape this but I don't know if they are available
legitimately? 2\. Historial Implied volatilities and historical observed
volatilities.

~~~
adw
<http://timetric.com/dataset/exchange_rates_forex_europe/> for the exchange-
rate data, at least.

~~~
hessenwolf
Okay - it's there...

Is it your site? Are you going to add yield curves?

------
gtani
<http://www.kdnuggets.com/datasets/index.html>

<http://lib.stat.cmu.edu/datasets/>

<http://datamob.org/>

------
sosuke
Heh, a day after he leaves HN he makes the first page. He will still be here
whether he visits the site or not.

------
cstuder
And I recently discovered Google Refine, for cleaning up messy datasets.

<http://code.google.com/p/google-refine/>

~~~
LiveTheDream
née Freebase GridWorks [http://blog.freebase.com/2010/11/10/google-refine-
previously...](http://blog.freebase.com/2010/11/10/google-refine-previously-
freebase-gridworks-2-0-announced/)

------
agentultra
What about <http://ckan.org/> ?

The Comprehensive Knowledge Archive Network! Pretty sweet resource really.

~~~
djsun
The CKAN software is a platform for hosting data and metadata, but as far as I
see, <http://ckan.org> does not actually list data sets.

~~~
pudo
try <http://ckan.net> for the data, <http://ckan.org> is for the software
behind it :)

------
dmpayton
Kinda surprised no one has mentioned Factual. I'm using some of their diabetes
data for my side-startup.

<http://www.factual.com/>

~~~
casperc
Their write that most the data is available for download. I can't find it
anywhere though, only the various APIs. Have they remove the possibility of
downloading the data?

------
steveklabnik
Don't forget Stack Overflow! <http://data.stackexchange.com/>

------
svag
There is also the IMDB database in various format provided by IMDB itself
here: <http://www.imdb.com/interfaces>

Edit: Although the use of this database is not free, I believe for personal
use is just fine to download and experiment...

------
agbell
Non-Free Google data:

[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13)

This data set, contributed by Google Inc., contains English word n-grams and
their observed frequency counts.

------
hvs
Don't forget the Lahman Baseball Database with information from 1871-2010

<http://baseball1.com/statistics/>

~~~
jamwt
And, for very detailed play-by-play data for decades of games, check out
retrosheet: <http://www.retrosheet.org/game.htm>

------
balakk
<https://datamarket.azure.com/>

Some free, some paid.

------
damoncali
<http://infochimps.com> also has a bunch.

------
pwenzel
For those interested in transit data, check out the GTFS Data Exchange, a
directory of many agencies' scheduling and map data, following the Google
Transit Feed Specification.

<http://www.gtfs-data-exchange.com/>

------
tszming
Open Directory RDF Dump: <http://rdf.dmoz.org/>

------
Perceval
For international relations data, Correlates of War hosts a number of data
sets: <http://www.correlatesofwar.org/Datasets.htm>

------
joubert
I have links to a few govt.-provided data sets at <http://elev.at>

------
jcr
United Nations stats (lots of goodies)

<http://unstats.un.org>

some free, some paid

<http://infochimps.com/>

AIS Data (Marine Traffic)

<http://www.aishub.net/>

<http://www.marinetraffic.com/ais/>

And there's a great list of sources on Quora

[http://www.quora.com/Where-can-I-get-large-datasets-open-
to-...](http://www.quora.com/Where-can-I-get-large-datasets-open-to-the-
public)

------
llimllib
<http://www.gapminder.org/data/>

------
l0nwlf
OpenStreetMap data : <http://wiki.openstreetmap.org/wiki/Planet.osm>

Geonames : <http://download.geonames.org/export/dump/>

OS Open Data (UK Specific) :
<http://www.ordnancesurvey.co.uk/oswebsite/opendata/>

------
LiveTheDream
I track datasets that I come across at
<http://www.delicious.com/tobym/dataset>

------
lkozma
<http://news.ycombinator.com/item?id=1493768>

------
nico_h
<http://www.naturalearthdata.com/> From the website : Natural Earth is a
public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales
as tightly integrated vector and raster data ...

------
WildUtah
Does anybody have precinct-level election results for the USA? A set for
recent elections would be great for public access redistricting apps that will
become relevant this year.

------
toisanji
anyone know of a dataset that has dates for when companies when companies
registered or announced in the news? For example I would like to see the data
hackernews was launched.

------
jmtame
i've had trouble finding geographical boundaries on neighborhoods in U.S.
cities (e.g. downtown areas and residential neighborhoods). anyone know where
i can find this?

~~~
kfranken
It's not exactly neighborhoods, but the US Census TIGER database has block and
blockgroup boundaries with associated demographic data. You could probably
synthesize that into "neighborhood" definitions.
[http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.ht...](http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.html)

------
eli
Some US Gov't data sites no one else mentioned:

<http://data.govloop.com/> has data and lots of pointers to local government
data.

Also I'm surprised no one mentioned Carl Malamud's site:
<http://public.resource.org/> \- Lots of US gov't and legal data in friendly
formats.

------
mcauser
Heaps of useful info: <http://www.nationmaster.com>

------
random42
I'd prepared (based on other datasets) a smallish movie tweet dataset. You may
find it useful, if working with tweets and/or reviews.

<https://github.com/mohitranka/TwitterSentimentCorpora>

------
youknow
CIA World Factbook (demographics, geography, communications, government,
economy, military stats of countries):

<https://www.cia.gov/library/publications/download/>

------
_topher
Thank you all for posting links and links to links to datasets, I have an
unrelenting interest in data aggregation and machine learning, and didn't even
know where to start. So helpful, and I am no longer stuck. :)

------
fedd
do all of them have some uniformed api? that would be great, ideally. query
and cache all of them on demand from your own app without additional
programming.

bookmarked and shared this thread.

------
nivertech
I looking for free public domain large high-resolution imaging datasets.

Something like satellite imagery, medical imaging, semiconductor masks and
wafers photos or CAD files, etc.

Any pointers?

~~~
brainid
Here are medical imaging datasets I am aware of: Neuroimaging (see
<http://www.nitrc.org> for others) OASIS <http://www.oasis-brains.org/> ADNI
<http://adni.loni.ucla.edu/> (huge dataset, requires application) OpenfMRI
<http://openfmri.org/> EEG <http://eeg.pl/epi>

Some other applications, example CT Colonography <http://www.acrin.org/>

------
kaffeinecoma
This is a real treasure to come across. I hope we'll keep seeing jacquesm's
blog postings here.

Anyone know of any publicly available song lyric databases?

------
random42
Anyplace I can find _small_ free web spam dataset? ( for commercial use, sorry
:( )

All the datasets I found on www, are Huge (in double digit GBs..).

------
wladimir
Wow, useful stuff. This thread goes into my bookmarks.

