Hacker News new | past | comments | ask | show | jobs | submit login
A dataset of every Reddit comment
294 points by avinassh on July 11, 2015 | hide | past | web | favorite | 91 comments
TLDR; magnet link:

Someone has already put it on Google Big Query - https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05

Link to original Reddit thread - https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/

As linked below, I've played around a bit with this dataset: https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...

BigQuery is the best interface for it. Can resolve queries on the entire dataset in less than a few seconds (however, you only get 1TB processing free per month. Since the full dataset is ~285GB, you only get 4 queries per month. Plan ahead on the May 2015 dataset, which is only 8GB.)

I can answer any other questions that people have.

> Since the full dataset is ~285GB, you only get 4 queries per month.

That's only true if your 4 queries need to read every single column.

One of the big advantages of BigQuery's column-oriented storage is that you only pay to read the columns that are actually needed to answer your query.

For example, this query to extract the top 10 authors only cost me 19GB to run (and took 7.0s):

    TABLE_QUERY([fh-bigquery:reddit_comments], "table_id CONTAINS '20' AND LENGTH(table_id)<8")


Good point.

Although if you're doing analysis on the body column itself, it'll use the majority proportion of the data set, of course.

I want all the url submissions in a given subreddit, but all I can find in the tables is "link_id". How do I map link_ids to urls?

There is no submission data yet.

/u/Stuck_in_the_Matrix commented on another thread that he's working on it.

Is this data in a format that allows to recover threads? I.e., comment X was a response to comment Y ?

parent_id is a column, so you could recreate hierarchy by JOINING a table on itself.

Not trying to throw cold water on this...but are the Reddit execs OK with this? I mean, Twitter and Facebook would most likely issue takedowns for similar kinds of data dumps...but is there something in the Reddit API TOS that says this is OK?...I wouldn't be surprised if there a fairly liberal license, as that would be aligned with the early spirit of the site. And an onerous TOS would have likely curbed the active user-run bot ecosystem that helps manage and monitor Reddit's discussions.

I believe reddit comments are made under a CC license

Closest thing to an answer I've found is from the licensing page: https://www.reddit.com/wiki/licensing

But it doesn't specifically address collection and distribution of API results as a dataset.

(From the page:)

A licensing agreement is required in order to:

* use the reddit API for commercial purposes. Use of the API is considered "commercial" if you are earning money from it, including via in-app advertising or in-app purchases. Open source use is generally considered non-commercial.

* use the reddit alien logo ("snoo") in your app or for its thumbnail. Any new apps you create must be approved as well before usage. The circular "r" logo is reserved solely for use by reddit, Inc.

* allow users to subscribe to reddit gold via in-app purchases. If your platform allows for it, we encourage you to work with us to make this happen. We see gold purchases as a way for you to help reddit and to give back to the reddit community.

> use the reddit alien logo ("snoo") in your app or for its thumbnail

Do any of the apps actually do this, though? (Especially the free ones.)

UReddit does, but may not be a great example because I received permission to use the logo in a private conversantion with hueypriest years ago (as long as I didn't try to monetize the project).

According to Reddit's User Agreement user content:

"You retain the rights to your copyrighted content or information that you submit to reddit ('user content') except as described below."

And the exceptions just state that Reddit has a perpetual irrevocable worldwide license.

So it seems like there's no default license and others don't have any automatic rights to use the content. Does this assessment seem correct? In practice, it may not be a big problem, particularly for academic research and such, but I'm guessing there are some uses that might cause problems.

Good find, I think I got reddit confused with stackoverflow or wikipedia. If there's no clear assignment of copyright to api users, I would imagine that would be problematic for 3rd party app makers.

Isn't this true only if the 3rd party apps are saving the data and/or redistributing it? Apps should be able to request data from reddit just as a normal user would through their browser, since reddit has permission to use (and serve) that content.

Nit: I don't think copyright assignment is the correct term here. That refers to transferring the ownership of the root rights to the material, but you're just referring to granting specific rights under a license.

Not sure if the comments are under any license.

I used this[0] query to find the top ten[1] most downvoted comments of all time on Reddit.

The most downvoted comment[2] is ironically in iAMA, by a mod of iAMA (ironic because of the recent drama).

I'd find the top ten most upvoted, but I ran out of free bandwidth on BigQuery :(.

[0] https://gist.github.com/alexggordon/7b56353dcf8044a7a5f9

[1] https://drive.google.com/file/d/0Bzxo-UKxFmN-eWticy1BR2tCRDQ...

[2] https://www.reddit.com/r/IAmA/comments/s5guk/iam_bad_luck_br...

Here's the top ten most upvoted using that query with desc instead of asc:


I don't even want to spoil any bit of the surprises that await you, dear readers. Thank you for BigQuerying this

Some insights:

Relationship between Reddit Comment Score and Comment Length for 1.66 Billion Comments [0] and the Github repo [1]. Reddit cliques N°2 - deeper into the subs [2]

[0] - https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...

[1] - https://github.com/minimaxir/reddit-comment-length

[2] - https://www.reddit.com/r/dataisbeautiful/comments/3cofju/red...

There was also someone that created a dump of all HN data a while ago: https://github.com/sytelus/HackerNewsData

I have a GitHub repository for getting all the HN data the hard way using Python and the Algolia API, and storing them in PostgreSQL: https://github.com/minimaxir/get-all-hacker-news-submissions...

I wanted the data in a form I could easily query, so I wrote a program to convert the JSON to an SQLite database: https://gist.github.com/ers35/3b615a75fa0ed5e6d5cc

I have the program running on Amazon EC2 right now converting the whole dataset. I plan to upload the database to the Internet Archive when it completes.

Yes, please do that!

I think it's fascinating that if anyone uses this dataset to train an AI, some tiny piece of my personality might make a contribution.

You might enjoy http://reddit.com/r/subredditsimulator, a markov chain powered simulation of Reddit, fueled by all the different subreddits.

It's going in my /raid/datasets with about 1.4TB of other comment data. So the answer is going to be "yes" ... well probably by Tuesday morning it will be.

I've used older reddit dumps before ... it's not great data for most things, but there's an awful lot of it!

What are you going to use it for? If you don't mind letting us in on some details, however vague albeit interesting.

Interesting, thanks!

Could this be used to model the lifecycle of a Reddit post? What makes one thread get more votes and be replied to more often versus another similar thread posted at the same time. Is there a time when a comment is more likely to be voted for, based on wall clock or relative to the start of the thread. What topics are more likely to receive votes than others. Is there a minimum time between when reposts can be posted and still get upvotes. What are the most reposted things on Reddit. Why do people keep voting for reposts.

Wow, that is much, much, smaller than I thought it would be.

Anyone have a recent link on Reddit's infrastructure? Given the small size I would think it easily fits in memory so I'm a bit curious how they handle it.

This is the newest (2010) one I know of, and probably the one everyone's seen (I think it was HN back when it was first posted: http://highscalability.com/blog/2010/5/17/7-lessons-learned-...

This should be pretty great for research - the Reddit API is actually fairly limited in my experience, I wasn't even able to get my full comment history last time I tried.

This could be amazing as input to a question answering engine.

I actually built an IRC bot that did this once. It searched reddit for your question, took the first result, and posted the top comment. It worked very well for certain kinds of questions. Especially if it came from the better subreddits like askscience, but even just reddit in general.

I improved on it a lot with a whitelist of subreddits and some machine learning to select the best thread. But I was only touching on what is possible with that data.

The scope of the discussions on reddit is huge. Despite a lot of jokes in the comments here, the quality of the comments on average, is pretty decent. And the metadata like subreddit and score are extremely useful for filtering it down more.

Sample of conversation with it. Not cherry picked, just showing some of the good and the bad:


awesome! how does one get started programming 'intelligent' things? I know how to program, haven't coded anything that is 'intelligent'. I would really love to know what all the things/concepts I should know to build a bot like yours. Thanks!

This is awesome, I love the "current drama on reddit" answer..

I'd be very interested to learn more about your project and findings. Is that bot still alive?

I turned it back on for the time being. It can be found at irc.snoonet.org at #mybots


EDIT: Reddit changed their api, give me a moment to fix it.

EDIT2: It works now!

AMAbot - is that the one?

Yes. It automatically replies to anything you say in the chat on that channel.

If it's open source I'll love to take a look

I pastebined the code. It's terrible. I should rewrite it. But here it is: http://pastebin.com/CM9u17jq

Nice, I see you use neural networks, can you explain a bit how you are training them?

It searches reddit for whatever the user queries. Reddit returns up to 100 threads. I then choose a thread and take the top comment.

The neural network predicts which thread is most likely to produce a satisfying answer. The main features are number of n-gram matches with the question, the score, the number of comments, and some other metadata.

It's far from optimal but it does improve it a bit.

I think Reddit was one of the places they had to wall off Watson from data-mining, because it devolved into foul-mouthed memes. I wish they hadn't, nothing would be better than a billion dollar piece of technology deciding to reenact Sean Connery Jeopardy skits, on Jeopardy.

[2001 HAL voice]: I have a question about the penis mightier. Does it work?

It's not the hard to filter put the foul-mouthed content

But what about phrases with double meanings (aka: Fun With English):

Children make nutritious snacks

Not necessarily an issue in the grand scheme of things.

I would love to feed my megahal with that.

I grep-ed my own comments in it and only shows 39 out of thousands. Is it really complete?

I really want to get my full commenting history from this, but the download is enormous. Anyone know how I could go about this with cloud services?

Hey there! I'm the one that uploaded the original dataset. I'm creating API endpoints that will easily allow you to do this. It should be completed in 1-2 weeks.

I just realized my username on here is ... oh god.


here it is, only selected the body of your comments though, out of 4005 comments... http://s000.tinyupload.com/?file_id=09238386475544637092

here is the query that I used:

       author = 'Houshalter'
this is where you can run another query: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...

I did the same for my account and some comments are definitely missing... The first one that I cross-referenced is missing (5 month old and 1 point comment). But I have other comments with only 1 point and also 5 months old and those are there.

Thank you so much! I'd like to try it myself, but your link redirects me to some getting started page.

Any idea why some comments are missing? I haven't checked to see if they are all there, as far as I know they are.

I know that any in closed subreddits or comments that were removed by mods might be missing. But if you can access the comment's permalink without being signed in, then it should be in your data.

No problem. I believe that you need to create a project first, for the page to work.

You were right and that comment has been deleted for some reason.

This would be a really amazing way to make factually backed statements about the nature of Reddit for news sources given the recent publicity.

I.e. the frequency of comments of a certain nature, typical karma scores for those comments, breakdown by subreddit etc.

You have to be careful determining the nature of a certain comment algorithmically.

I saw one analysis of subreddits that checked for negativity. /r/PathOfExile came in as one of the most negative subreddits, which could easily be turned into a narrative about gaming culture.

To a person familiar with the context it seems far more likely that game concepts were skewing the data. Discussions involved killing, physical damage, life leech, killing, dying etc. Not to mention creature names themselves Devourers, Plummeting Ursa, Miscreations,

Does anyone know if this includes the spam? This dataset is obviously very interesting in its own right without spam, but far more useful for me would be all of it, including the spam.

EDIT: regardless, does anyone know if similar datasets with spam?

If they grabbed it via the API then no, it doesn't. It also won't include a lot of older stuff which is no longer accessible through the API.

This is an awesome dump, what types of analysis could we use this for?

tspike mentioned a QA engine which would be awesome, what else can people think of?

If anyone is working with the data, we should see how many users actually hivemind comment on things vs OC

You can't determine what is OC from comments, only from submission titles.

But by sheer coincidence, I have made a chart comparing average scores for OC submissions vs. non-OC submissions a few months ago: https://www.reddit.com/r/dataisbeautiful/comments/2rv76z/oc_...

Hey there! I know you! Posts are coming soon!

Good to hear! :)

Anyone know how big it is yet?


One bz2 file of comments per month.

Dataset on BigQuery is 265GB.

I wasn't able to get access to it on BigQuery .. the progress bar just sat and sat .. can you confirm that its usable on BigQuery - I'd sure love to play with this data.

Does someone have the equivalent for reddit links (with associated title)?

Does this data include private subreddits?

Someone should try make a sentient twitter bot that learns from the Reddit data

People have been trying to make sentient bots of any kind for some time now.

I'd be extremely impressed with a somewhat general AI at the insect or nematode level, let alone human.

How does one go about doing that? I know how to program, but I have written anything which do 'intelligent' stuff.

"The Analytics Edge" from edx [1], which is running now, could interest you. Great MOOC by the way. There is even a lecture around this idea [2].

[1] https://www.edx.org/course/analytics-edge-mitx-15-071x-0 [2] https://www.youtube.com/watch?v=JzaoJZNCVWA

I have no idea either, I would also love for someone to chime in with some first steps.

Markov chains would probably be a good start.

See reddit.com/r/subredditsimulator (not always safe for work)

its good

how to access the BigQuery data?

I get this screen

Welcome to BigQuery! What is BigQuery?

but it doesn't show the data.

You'll need to set up a project in BigQuery first. (you don't have to give billing information to use the free quota)

got it running!

I just did a query for my reddit handle and it took 6.5 seconds to retrieve all of my comments. Kind of a snowden moment but this is super interesting, first time I played with Big Query.

Would love to run some google API for sentiment analysis.

Since 80% of all Reddit comments are vitriol, nonsense by 15-year old kids, why would anyone do this?

How does complaining about the SNR on reddit in a way that decreases the SNR here help anyone?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact