
Interesting Data Sets for Statistics - aficionado
http://rs.io/2014/05/29/list-of-data-sets.html
======
mikecb
In related news, the Global Database of Events, Language, and Tone (GDELT) is
now available in bigquery, for free. [1]

[1] [http://googlecloudplatform.blogspot.com/2014/05/worlds-
large...](http://googlecloudplatform.blogspot.com/2014/05/worlds-largest-
event-dataset-now-publicly-available-in-google-bigquery.html)

------
kevinwang
The Reddit data isn't actually the top 2.5 million posts - it's the top 1000
posts of each of the top 2500 subreddits. An important distinction to make if
anyone's planning to do statistical analyses on the set.

------
sytelus
Surprisingly it doesn't mention HN itself which is a treasure trove of data. I
know there is APIs to download HN content but is there a permanent location
for HN data dump (like StackOverflow do their data dump on Internet Archive)?
This is a great article, BTW, anyone who wants some cool projects to do in
data mining and machine learning.

~~~
j2kun
I currently have a subset of one year's worth of all HN stories and comment
trees, organized by story, but it's on my local machine. Where is a good place
to post it? It's quite big, on the order of multiple GB.

The problem (if you want an easy scraper) is that the HN API limits you to 1k
requests per hour. So it took me about 10 days of continuous running and
restarting because of random crashes to get all the data.

------
privong
Looks like an interesting list of datasets, but it's such a large number that
it's tough to get a feel for what all is in it (without reading a lot of the
entries). I wonder if some sort of organized table might be a way to present
the information in a more skimmable fashion.

------
Hortinstein
I think it will be really interesting in a few years when people start some
in-depth analysis about the bitcoin blockchain(though some is going on today).
If Bitcoin hits mainstream adoption it may be the first time ever someone can
run analysis on a complete financial system. Not even including the
applications built on top of the block chain.

~~~
Hortinstein
followup, this is the kind of stuff i was alluding to:

[http://www.technologyreview.com/view/527906/data-mining-
reve...](http://www.technologyreview.com/view/527906/data-mining-reveals-the-
factors-driving-the-price-of-bitcoins/)

------
mxfh
For starters and people who miss their R sample sets there is a pretty good
maintained archive of 731 of them available as CSV at
[http://vincentarelbundock.github.io/Rdatasets/](http://vincentarelbundock.github.io/Rdatasets/)

Index:
[http://vincentarelbundock.github.io/Rdatasets/datasets.html](http://vincentarelbundock.github.io/Rdatasets/datasets.html)

Github:
[https://github.com/vincentarelbundock/Rdatasets](https://github.com/vincentarelbundock/Rdatasets)

------
sytelus
So someone put 2.5 million Reddit posts on Github. I was thinking about doing
same for the HN data I've downloaded (1.3 million stories ~ 1.7GB of json).

Does Github has any restrictions on hosting data files like this?

~~~
jmpe
Their guidelines suggest to use dropbox for that:

[https://help.github.com/articles/what-is-my-disk-
quota](https://help.github.com/articles/what-is-my-disk-quota)

[https://help.github.com/articles/working-with-large-
files](https://help.github.com/articles/working-with-large-files)

------
shobhitverma
I have always wanted to create a Data Science training course which finds the
right dataset to expose the power of the technique in question. I think this
dataset will give me a good start. Thank you!

~~~
armenarmen
I would be way interested in your course!

------
ejain
This looks like a good resource. But be sure to understand all the implicit
assumptions in each data set before announcing your amazing discoveries!

------
joshu
Some stuff I hadn't seen before.

------
atestu
This is a gold mine, thank you!

------
toxik
Good article!

