
A dataset of every Reddit comment - avinassh
TLDR; magnet link:<p><pre><code>    magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&amp;dn=reddit%5Fdata&amp;tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&amp;tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80
</code></pre>
Someone has already put it on Google Big Query - https:&#x2F;&#x2F;bigquery.cloud.google.com&#x2F;table&#x2F;fh-bigquery:reddit_comments.2015_05<p>Link to original Reddit thread - https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;datasets&#x2F;comments&#x2F;3bxlg7&#x2F;i_have_every_publicly_available_reddit_comment&#x2F;
======
minimaxir
As linked below, I've played around a bit with this dataset:
[https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...](https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/relationship_between_reddit_comment_score_and/)

BigQuery is the best interface for it. Can resolve queries on the entire
dataset in less than a few seconds (however, you only get 1TB processing free
per month. Since the full dataset is ~285GB, you only get 4 queries per month.
Plan ahead on the May 2015 dataset, which is only 8GB.)

I can answer any other questions that people have.

~~~
daave
> Since the full dataset is ~285GB, you only get 4 queries per month.

That's only true if your 4 queries need to read every single column.

One of the big advantages of BigQuery's column-oriented storage is that you
only pay to read the columns that are actually needed to answer your query.

For example, this query to extract the top 10 authors only cost me 19GB to run
(and took 7.0s):

    
    
      SELECT
        author,
        COUNT(*) AS COUNT
      FROM
        TABLE_QUERY([fh-bigquery:reddit_comments], "table_id CONTAINS '20' AND LENGTH(table_id)<8")
      GROUP EACH BY
        author
      ORDER BY
        COUNT DESC
      LIMIT
        10;
    
      author,COUNT
      [deleted],228425822
      AutoModerator,3677774
      conspirobot,575576
      ModerationLog,547671
      autowikibot,402076
      PoliticBot,388395
      imgurtranscriber,360248
      dogetipbot,358093
      qkme_transcriber,301968
      TweetPoster,293309

~~~
minimaxir
Good point.

Although if you're doing analysis on the body column itself, it'll use the
majority proportion of the data set, of course.

------
danso
Not trying to throw cold water on this...but are the Reddit execs OK with
this? I mean, Twitter and Facebook would most likely issue takedowns for
similar kinds of data dumps...but is there something in the Reddit API TOS
that says this is OK?...I wouldn't be surprised if there a fairly liberal
license, as that would be aligned with the early spirit of the site. And an
onerous TOS would have likely curbed the active user-run bot ecosystem that
helps manage and monitor Reddit's discussions.

~~~
hackcasual
I believe reddit comments are made under a CC license

~~~
mattrepl
Closest thing to an answer I've found is from the licensing page:
[https://www.reddit.com/wiki/licensing](https://www.reddit.com/wiki/licensing)

But it doesn't specifically address collection and distribution of API results
as a dataset.

(From the page:)

A licensing agreement is required in order to:

* use the reddit API for commercial purposes. Use of the API is considered "commercial" if you are earning money from it, including via in-app advertising or in-app purchases. Open source use is generally considered non-commercial.

* use the reddit alien logo ("snoo") in your app or for its thumbnail. Any new apps you create must be approved as well before usage. The circular "r" logo is reserved solely for use by reddit, Inc.

* allow users to subscribe to reddit gold via in-app purchases. If your platform allows for it, we encourage you to work with us to make this happen. We see gold purchases as a way for you to help reddit and to give back to the reddit community.

~~~
danellis
> use the reddit alien logo ("snoo") in your app or for its thumbnail

Do any of the apps actually do this, though? (Especially the free ones.)

~~~
anastasds
UReddit does, but may not be a great example because I received permission to
use the logo in a private conversantion with hueypriest years ago (as long as
I didn't try to monetize the project).

------
alexggordon
I used this[0] query to find the top ten[1] most downvoted comments of all
time on Reddit.

The most downvoted comment[2] is ironically in iAMA, by a mod of iAMA (ironic
because of the recent drama).

I'd find the top ten most upvoted, but I ran out of free bandwidth on BigQuery
:(.

[0]
[https://gist.github.com/alexggordon/7b56353dcf8044a7a5f9](https://gist.github.com/alexggordon/7b56353dcf8044a7a5f9)

[1] [https://drive.google.com/file/d/0Bzxo-UKxFmN-
eWticy1BR2tCRDQ...](https://drive.google.com/file/d/0Bzxo-UKxFmN-
eWticy1BR2tCRDQ/view?usp=sharing)

[2]
[https://www.reddit.com/r/IAmA/comments/s5guk/iam_bad_luck_br...](https://www.reddit.com/r/IAmA/comments/s5guk/iam_bad_luck_brian_ama/c4b8m3u)

~~~
etblg
Here's the top ten most upvoted using that query with desc instead of asc:

[http://pastebin.com/Zvg6mdjZ](http://pastebin.com/Zvg6mdjZ)

~~~
sova
I don't even want to spoil any bit of the surprises that await you, dear
readers. Thank you for BigQuerying this

------
avinassh
Some insights:

Relationship between Reddit Comment Score and Comment Length for 1.66 Billion
Comments [0] and the Github repo [1]. Reddit cliques N°2 - deeper into the
subs [2]

[0] -
[https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...](https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/relationship_between_reddit_comment_score_and/)

[1] - [https://github.com/minimaxir/reddit-comment-
length](https://github.com/minimaxir/reddit-comment-length)

[2] -
[https://www.reddit.com/r/dataisbeautiful/comments/3cofju/red...](https://www.reddit.com/r/dataisbeautiful/comments/3cofju/reddit_cliques_n2_deeper_into_the_subs_oc/)

------
avinassh
Clickables:

[https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_c...](https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_comments.2015_05)

[https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_eve...](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)

------
jesrui
There was also someone that created a dump of all HN data a while ago:
[https://github.com/sytelus/HackerNewsData](https://github.com/sytelus/HackerNewsData)

~~~
minimaxir
I have a GitHub repository for getting all the HN data the hard way using
Python and the Algolia API, and storing them in PostgreSQL:
[https://github.com/minimaxir/get-all-hacker-news-
submissions...](https://github.com/minimaxir/get-all-hacker-news-submissions-
comments)

------
ers35
I wanted the data in a form I could easily query, so I wrote a program to
convert the JSON to an SQLite database:
[https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc](https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc)

I have the program running on Amazon EC2 right now converting the whole
dataset. I plan to upload the database to the Internet Archive when it
completes.

~~~
ers35
Here it is:
[https://archive.org/details/2015_reddit_comments_corpus_sqli...](https://archive.org/details/2015_reddit_comments_corpus_sqlite)

------
nathan_f77
I think it's fascinating that if anyone uses this dataset to train an AI, some
tiny piece of my personality might make a contribution.

~~~
kristopolous
It's going in my /raid/datasets with about 1.4TB of other comment data. So the
answer is going to be "yes" ... well probably by Tuesday morning it will be.

I've used older reddit dumps before ... it's not great data for most things,
but there's an awful lot of it!

~~~
zo1
What are you going to use it for? If you don't mind letting us in on some
details, however vague albeit interesting.

~~~
kristopolous
sure. [http://frustrometer.com](http://frustrometer.com)

~~~
zo1
Interesting, thanks!

------
whoopdedo
Could this be used to model the lifecycle of a Reddit post? What makes one
thread get more votes and be replied to more often versus another similar
thread posted at the same time. Is there a time when a comment is more likely
to be voted for, based on wall clock or relative to the start of the thread.
What topics are more likely to receive votes than others. Is there a minimum
time between when reposts can be posted and still get upvotes. What are the
most reposted things on Reddit. Why do people keep voting for reposts.

------
mbell
Wow, that is much, much, smaller than I thought it would be.

Anyone have a recent link on Reddit's infrastructure? Given the small size I
would think it easily fits in memory so I'm a bit curious how they handle it.

~~~
sdrothrock
This is the newest (2010) one I know of, and probably the one everyone's seen
(I think it was HN back when it was first posted:
[http://highscalability.com/blog/2010/5/17/7-lessons-
learned-...](http://highscalability.com/blog/2010/5/17/7-lessons-learned-
while-building-reddit-to-270-million-page.html)

------
akhilcacharya
This should be pretty great for research - the Reddit API is actually fairly
limited in my experience, I wasn't even able to get my full comment history
last time I tried.

------
tspike
This could be amazing as input to a question answering engine.

~~~
Houshalter
I actually built an IRC bot that did this once. It searched reddit for your
question, took the first result, and posted the top comment. It worked very
well for certain kinds of questions. Especially if it came from the better
subreddits like askscience, but even just reddit in general.

I improved on it a lot with a whitelist of subreddits and some machine
learning to select the best thread. But I was only touching on what is
possible with that data.

The scope of the discussions on reddit is huge. Despite a lot of jokes in the
comments here, the quality of the comments on average, is pretty decent. And
the metadata like subreddit and score are extremely useful for filtering it
down more.

~~~
Houshalter
Sample of conversation with it. Not cherry picked, just showing some of the
good and the bad:

[https://i.imgur.com/LDD9isL.png?1](https://i.imgur.com/LDD9isL.png?1)

~~~
avinassh
awesome! how does one get started programming 'intelligent' things? I know how
to program, haven't coded anything that is 'intelligent'. I would really love
to know what all the things/concepts I should know to build a bot like yours.
Thanks!

------
visarga
I grep-ed my own comments in it and only shows 39 out of thousands. Is it
really complete?

------
Houshalter
I really want to get my full commenting history from this, but the download is
enormous. Anyone know how I could go about this with cloud services?

~~~
stuck_in_the_ma
Hey there! I'm the one that uploaded the original dataset. I'm creating API
endpoints that will easily allow you to do this. It should be completed in 1-2
weeks.

~~~
stuck_in_the_ma
I just realized my username on here is ... oh god.

------
TTPrograms
This would be a really amazing way to make factually backed statements about
the nature of Reddit for news sources given the recent publicity.

I.e. the frequency of comments of a certain nature, typical karma scores for
those comments, breakdown by subreddit etc.

~~~
Lerc
You have to be careful determining the nature of a certain comment
algorithmically.

I saw one analysis of subreddits that checked for negativity. /r/PathOfExile
came in as one of the most negative subreddits, which could easily be turned
into a narrative about gaming culture.

To a person familiar with the context it seems far more likely that game
concepts were skewing the data. Discussions involved killing, physical damage,
life leech, killing, dying etc. Not to mention creature names themselves
Devourers, Plummeting Ursa, Miscreations,

------
linkmotif
Does anyone know if this includes the spam? This dataset is obviously very
interesting in its own right without spam, but far more useful for me would be
all of it, including the spam.

EDIT: regardless, does anyone know if similar datasets with spam?

~~~
jedberg
If they grabbed it via the API then no, it doesn't. It also won't include a
lot of older stuff which is no longer accessible through the API.

------
iaw
This is an awesome dump, what types of analysis could we use this for?

tspike mentioned a QA engine which would be awesome, what else can people
think of?

------
roflchoppa
If anyone is working with the data, we should see how many users actually
hivemind comment on things vs OC

~~~
minimaxir
You can't determine what is OC from comments, only from submission titles.

But by sheer coincidence, I have made a chart comparing average scores for OC
submissions vs. non-OC submissions a few months ago:
[https://www.reddit.com/r/dataisbeautiful/comments/2rv76z/oc_...](https://www.reddit.com/r/dataisbeautiful/comments/2rv76z/oc_reddit_submissions_receive_almost_double_the/)

~~~
stuck_in_the_ma
Hey there! I know you! Posts are coming soon!

~~~
minimaxir
Good to hear! :)

------
fit2rule
Anyone know how big it is yet?

~~~
minimaxir
Dataset on BigQuery is 265GB.

~~~
fit2rule
I wasn't able to get access to it on BigQuery .. the progress bar just sat and
sat .. can you confirm that its usable on BigQuery - I'd sure love to play
with this data.

------
joelthelion
Does someone have the equivalent for reddit links (with associated title)?

------
colordrops
Does this data include private subreddits?

------
thomasfromcdnjs
Someone should try make a sentient twitter bot that learns from the Reddit
data

~~~
avinassh
How does one go about doing that? I know how to program, but I have written
anything which do 'intelligent' stuff.

~~~
thomasfromcdnjs
I have no idea either, I would also love for someone to chime in with some
first steps.

~~~
Ianvdl
Markov chains would probably be a good start.

See reddit.com/r/subredditsimulator (not always safe for work)

------
avinashgupta
its good

------
curiousjorge
how to access the BigQuery data?

I get this screen

Welcome to BigQuery! What is BigQuery?

but it doesn't show the data.

~~~
minimaxir
You'll need to set up a project in BigQuery first. (you don't have to give
billing information to use the free quota)

~~~
curiousjorge
got it running!

I just did a query for my reddit handle and it took 6.5 seconds to retrieve
all of my comments. Kind of a snowden moment but this is super interesting,
first time I played with Big Query.

Would love to run some google API for sentiment analysis.

------
justwannasing
Since 80% of all Reddit comments are vitriol, nonsense by 15-year old kids,
why would anyone do this?

~~~
ska
How does complaining about the SNR on reddit in a way that decreases the SNR
here help anyone?

