
Data sets released by Google - supo
http://svonava.com/post/62186512058/datasets-released-by-google
======
aidanf
If you want to play around with data, here's another good list of open/free
datasets:
[http://bitly.com/bundles/hmason/1](http://bitly.com/bundles/hmason/1)

~~~
supo
Nice, thanks! I wish there was a cleaned up repository of datasets like these,
in a unified format, directly accessible to a public MapReduce engine like
Elastic at AWS.

------
gtani
here's some other data hubs/search engines, endless lists:

[http://datahub.io/](http://datahub.io/)

[http://blog.bigml.com/2013/02/28/data-data-data-thousands-
of...](http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-
data-sources/)

[http://tm.durusau.net/?p=39312](http://tm.durusau.net/?p=39312)

[http://dvn.iq.harvard.edu/dvn/](http://dvn.iq.harvard.edu/dvn/)

_____________

this subreddit seems like a decent place to ask questions

[http://www.reddit.com/r/datasets](http://www.reddit.com/r/datasets)

------
imurray
Another one from Google, 1000 scanned books for OCR and other scanned document
processing research:
[http://commondatastorage.googleapis.com/books/icdar2007/READ...](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)

------
thangalin
[http://bitly.com/bundles/hmason/1](http://bitly.com/bundles/hmason/1)

[http://commondatastorage.googleapis.com/books/icdar2007/READ...](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)

[http://datahub.io/](http://datahub.io/)

[http://blog.bigml.com/2013/02/28/data-data-data-thousands-
of...](http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-
data-sources/)

[http://iatiregistry.org/](http://iatiregistry.org/)

[http://open.undp.org/](http://open.undp.org/)

[http://data.worldbank.org/](http://data.worldbank.org/)

[https://explore.data.gov/catalog/raw/](https://explore.data.gov/catalog/raw/)

[http://www.data.gov/opendatasites](http://www.data.gov/opendatasites)

[http://data.gov.be/datasets](http://data.gov.be/datasets)

[http://opencorporates.com/](http://opencorporates.com/)

[http://glasspockets.org/work/reportingcommitment/api.html](http://glasspockets.org/work/reportingcommitment/api.html)

[http://thedata.harvard.edu/dvn/](http://thedata.harvard.edu/dvn/)

[http://www.reddit.com/r/datasets](http://www.reddit.com/r/datasets)

[http://archive.ics.uci.edu/ml/](http://archive.ics.uci.edu/ml/)

[http://cleandatahub.org/](http://cleandatahub.org/)

[http://datacatalogs.org/](http://datacatalogs.org/)

[http://archive.org/details/oxford-2005-facebook-
matrix](http://archive.org/details/oxford-2005-facebook-matrix)

------
X4
BitTorrent Please! Why does it cost so much? They grabbed our data for free
and they have enough free Bandwidth. Let's assume they are greedy, then they
could at least offer it through BitTorrent. DVD's for that amount of data is
ridiculous. I don't even have a DVD-Reader…

 __ _Can 't afford buying all that + shipping to Europe, but would like to
play with the Data for my NLP Project._ __

~~~
skun
I agree ! I too can't afford it but would really love to play around with that
data because i'm just beginning to learn about NLP and stuff. I too feel that
shouldn't have been priced and not in a DVD!

------
agibsonccc
Here's another good one.

[http://archive.ics.uci.edu/ml/](http://archive.ics.uci.edu/ml/)

------
avidas
Here is a good one, [http://cleandatahub.org/](http://cleandatahub.org/) They
are trying to aggregate cleaned data sets across the web.

------
PaulHoule
no links...

Remember the days when people used to make links on the web because they
weren't greedy with their pagerank?

At least Google left us some machine learning data sets after they took all
the links. You just can't find them because nobody links to them.

~~~
supo
I'm sorry for not making it more obvious, but each bullet point in the list
ends with a link.

------
ChikkaChiChi
Fantastic links throughout this thread.

When playing with new programming languages instead of a 'todo' list I always
end up building an XKCD password generator. Interestingly enough, I've never
found a frequency/comprehension list worth using to populate it for public
consumption.

------
option_greek
Is there any data set that embodies human relationships with every day objects
?

------
ma2rten
Also:

[https://code.google.com/p/word2vec/#Pre-
trained_entity_vecto...](https://code.google.com/p/word2vec/#Pre-
trained_entity_vectors_with_Freebase_naming)

------
kineticfocus
The ML competition site Kaggle should also get a mention here.
[http://www.kaggle.com/competitions](http://www.kaggle.com/competitions)

------
chatman
Where is the Web1T dataset? Would you not consider it useful for Machine
Learning?

~~~
cardine
I think this list only includes free datasets.

