Hacker Newsnew | comments | show | ask | jobs | submit login
A dataset of every Reddit comment
165 points by avinassh 9 hours ago | 52 comments
TLDR; magnet link:

    magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80
Someone has already put it on Google Big Query - https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05

Link to original Reddit thread - https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/






As linked below, I've played around a bit with this dataset: https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...

BigQuery is the best interface for it. Can resolve queries on the entire dataset in less than a few seconds (however, you only get 1TB processing free per month. Since the full dataset is ~285GB, you only get 4 queries per month. Plan ahead on the May 2015 dataset, which is only 8GB.)

I can answer any other questions that people have.


> Since the full dataset is ~285GB, you only get 4 queries per month.

That's only true if your 4 queries need to read every single column.

One of the big advantages of BigQuery's column-oriented storage is that you only pay to read the columns that are actually needed to answer your query.

For example, this query to extract the top 10 authors only cost me 19GB to run (and took 7.0s):

  SELECT
    author,
    COUNT(*) AS COUNT
  FROM
    TABLE_QUERY([fh-bigquery:reddit_comments], "table_id CONTAINS '20' AND LENGTH(table_id)<8")
  GROUP EACH BY
    author
  ORDER BY
    COUNT DESC
  LIMIT
    10;

  author,COUNT
  [deleted],228425822
  AutoModerator,3677774
  conspirobot,575576
  ModerationLog,547671
  autowikibot,402076
  PoliticBot,388395
  imgurtranscriber,360248
  dogetipbot,358093
  qkme_transcriber,301968
  TweetPoster,293309

Good point.

Although if you're doing analysis on the body column itself, it'll use the majority proportion of the data set, of course.


Is this data in a format that allows to recover threads? I.e., comment X was a response to comment Y ?

parent_id is a column, so you could recreate hierarchy by JOINING a table on itself.

Not trying to throw cold water on this...but are the Reddit execs OK with this? I mean, Twitter and Facebook would most likely issue takedowns for similar kinds of data dumps...but is there something in the Reddit API TOS that says this is OK?...I wouldn't be surprised if there a fairly liberal license, as that would be aligned with the early spirit of the site. And an onerous TOS would have likely curbed the active user-run bot ecosystem that helps manage and monitor Reddit's discussions.

I believe reddit comments are made under a CC license

Closest thing to an answer I've found is from the licensing page: https://www.reddit.com/wiki/licensing

But it doesn't specifically address collection and distribution of API results as a dataset.

(From the page:)

A licensing agreement is required in order to:

* use the reddit API for commercial purposes. Use of the API is considered "commercial" if you are earning money from it, including via in-app advertising or in-app purchases. Open source use is generally considered non-commercial.

* use the reddit alien logo ("snoo") in your app or for its thumbnail. Any new apps you create must be approved as well before usage. The circular "r" logo is reserved solely for use by reddit, Inc.

* allow users to subscribe to reddit gold via in-app purchases. If your platform allows for it, we encourage you to work with us to make this happen. We see gold purchases as a way for you to help reddit and to give back to the reddit community.


According to Reddit's User Agreement user content:

"You retain the rights to your copyrighted content or information that you submit to reddit ('user content') except as described below."

And the exceptions just state that Reddit has a perpetual irrevocable worldwide license.

So it seems like there's no default license and others don't have any automatic rights to use the content. Does this assessment seem correct? In practice, it may not be a big problem, particularly for academic research and such, but I'm guessing there are some uses that might cause problems.


Good find, I think I got reddit confused with stackoverflow or wikipedia. If there's no clear assignment of copyright to api users, I would imagine that would be problematic for 3rd party app makers.

Not sure if the comments are under any license.

Beyonce's Publicist is on it!

I used this[0] query to find the top ten[1] most downvoted comments of all time on Reddit.

The most downvoted comment[2] is ironically in iAMA, by a mod of iAMA (ironic because of the recent drama).

I'd find the top ten most upvoted, but I ran out of free bandwidth on BigQuery :(.

[0] https://gist.github.com/alexggordon/7b56353dcf8044a7a5f9

[1] https://drive.google.com/file/d/0Bzxo-UKxFmN-eWticy1BR2tCRDQ...

[2] https://www.reddit.com/r/IAmA/comments/s5guk/iam_bad_luck_br...


Some insights:

Relationship between Reddit Comment Score and Comment Length for 1.66 Billion Comments [0] and the Github repo [1]. Reddit cliques N°2 - deeper into the subs [2]

[0] - https://www.reddit.com/r/dataisbeautiful/comments/3cjyvb/rel...

[1] - https://github.com/minimaxir/reddit-comment-length

[2] - https://www.reddit.com/r/dataisbeautiful/comments/3cofju/red...


There was also someone that created a dump of all HN data a while ago: https://github.com/sytelus/HackerNewsData

I have a GitHub repository for getting all the HN data the hard way using Python and the Algolia API, and storing them in PostgreSQL: https://github.com/minimaxir/get-all-hacker-news-submissions...

Could this be used to model the lifecycle of a Reddit post? What makes one thread get more votes and be replied to more often versus another similar thread posted at the same time. Is there a time when a comment is more likely to be voted for, based on wall clock or relative to the start of the thread. What topics are more likely to receive votes than others. Is there a minimum time between when reposts can be posted and still get upvotes. What are the most reposted things on Reddit. Why do people keep voting for reposts.

Clickables:

https://bigquery.cloud.google.com/table/fh-bigquery:reddit_c...

https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_eve...


I think it's fascinating that if anyone uses this dataset to train an AI, some tiny piece of my personality might make a contribution.

Wow, that is much, much, smaller than I thought it would be.

Anyone have a recent link on Reddit's infrastructure? Given the small size I would think it easily fits in memory so I'm a bit curious how they handle it.


This could be amazing as input to a question answering engine.

I think Reddit was one of the places they had to wall off Watson from data-mining, because it devolved into foul-mouthed memes. I wish they hadn't, nothing would be better than a billion dollar piece of technology deciding to reenact Sean Connery Jeopardy skits, on Jeopardy.

[2001 HAL voice]: I have a question about the penis mightier. Does it work?


It's not the hard to filter put the foul-mouthed content

But what about phrases with double meanings (aka: Fun With English):

Children make nutritious snacks


Not necessarily an issue in the grand scheme of things.

I actually built an IRC bot that did this once. It searched reddit for your question, took the first result, and posted the top comment. It worked very well for certain kinds of questions. Especially if it came from the better subreddits like askscience, but even just reddit in general.

I improved on it a lot with a whitelist of subreddits and some machine learning to select the best thread. But I was only touching on what is possible with that data.

The scope of the discussions on reddit is huge. Despite a lot of jokes in the comments here, the quality of the comments on average, is pretty decent. And the metadata like subreddit and score are extremely useful for filtering it down more.


If it's open source I'll love to take a look

I'd be very interested to learn more about your project and findings. Is that bot still alive?

I turned it back on for the time being. It can be found at irc.snoonet.org at #mybots

https://kiwiirc.com/client/irc.snoonet.org/mybots

EDIT: Reddit changed their api, give me a moment to fix it.

EDIT2: It works now!


I would love to feed my megahal with that.

This should be pretty great for research - the Reddit API is actually fairly limited in my experience, I wasn't even able to get my full comment history last time I tried.

This would be a really amazing way to make factually backed statements about the nature of Reddit for news sources given the recent publicity.

I.e. the frequency of comments of a certain nature, typical karma scores for those comments, breakdown by subreddit etc.


You have to be careful determining the nature of a certain comment algorithmically.

I saw one analysis of subreddits that checked for negativity. /r/PathOfExile came in as one of the most negative subreddits, which could easily be turned into a narrative about gaming culture.

To a person familiar with the context it seems far more likely that game concepts were skewing the data. Discussions involved killing, physical damage, life leech, killing, dying etc. Not to mention creature names themselves Devourers, Plummeting Ursa, Miscreations,


I really want to get my full commenting history from this, but the download is enormous. Anyone know how I could go about this with cloud services?

Hey there! I'm the one that uploaded the original dataset. I'm creating API endpoints that will easily allow you to do this. It should be completed in 1-2 weeks.

I just realized my username on here is ... oh god.

how to access the BigQuery data?

I get this screen

Welcome to BigQuery! What is BigQuery?

but it doesn't show the data.


You'll need to set up a project in BigQuery first. (you don't have to give billing information to use the free quota)

got it running!

I just did a query for my reddit handle and it took 6.5 seconds to retrieve all of my comments. Kind of a snowden moment but this is super interesting, first time I played with Big Query.

Would love to run some google API for sentiment analysis.


This is an awesome dump, what types of analysis could we use this for?

tspike mentioned a QA engine which would be awesome, what else can people think of?


If anyone is working with the data, we should see how many users actually hivemind comment on things vs OC

You can't determine what is OC from comments, only from submission titles.

But by sheer coincidence, I have made a chart comparing average scores for OC submissions vs. non-OC submissions a few months ago: https://www.reddit.com/r/dataisbeautiful/comments/2rv76z/oc_...


Hey there! I know you! Posts are coming soon!

Anyone know how big it is yet?

160GB.

One bz2 file of comments per month.


Dataset on BigQuery is 265GB.

Someone should try make a sentient twitter bot that learns from the Reddit data

How does one go about doing that? I know how to program, but I have written anything which do 'intelligent' stuff.

"The Analytics Edge" from edx [1], which is running now, could interest you. Great MOOC by the way. There is even a lecture around this idea [2].

[1] https://www.edx.org/course/analytics-edge-mitx-15-071x-0 [2] https://www.youtube.com/watch?v=JzaoJZNCVWA


I have no idea either, I would also love for someone to chime in with some first steps.

Markov chains would probably be a good start.

See reddit.com/r/subredditsimulator (not always safe for work)


People have been trying to make sentient bots of any kind for some time now.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: