
Full Reddit Submission Corpus now available for 2006 thru August 2015 - voltagex_
https://www.reddit.com/r/datasets/comments/3mg812/full_reddit_submission_corpus_now_available_2006/
======
Houshalter
This is not as cool as the comment dataset, but it's still pretty cool. I
think in the future AIs will make extensive use of these kinds of datasets.
It's amazing how much useful information is in there.

Last year I made a simple IRC question answering bot that just searched reddit
for your question and returned the top comment of the top result. It worked
surprisingly well. It's amazing how many questions someone else has asked on
reddit before. And the comment quality is usually pretty good. Sample
conversation with it:
[https://i.imgur.com/LDD9isL.jpg](https://i.imgur.com/LDD9isL.jpg)

I improved on it a lot with a whitelist of subreddits and some machine
learning to select the best thread. But I was only touching on what is
possible with that data.

You can play with it here:
[https://kiwiirc.com/client/irc.snoonet.org/mybots](https://kiwiirc.com/client/irc.snoonet.org/mybots)

EDIT: It does only work well with certain kinds of questions. Thing's that
would have been asked before and without too many unique keywords. I see some
people ask it really unique questions, or talk to it like a regular chatbot
without even asking questions. And then get frustrated when it returns
nonsense. There are a lot of improvements that could be made with this with
natural language processing and stuff. But right now it's pretty simple.

~~~
nbadg
Do you have any of the source on Github? I'd be very interested in taking a
look and potentially using it as an excuse to get playing with a natural
language toolkit (in all my spare time haha)

~~~
Houshalter
Yes
[https://github.com/Houshalter/amabot](https://github.com/Houshalter/amabot)

~~~
khakimov
Could you share
"C:\\\Users\\\Daniel\\\Documents\\\Programming\\\AMAbotData.txt" as well or it
is not neccessary to run the bot?

~~~
nbadg
Presumably that's the large dataset he was referring to.

~~~
Houshalter
Ok I added it and removed some of the stuff that was configured specifically
for me. It _should_ work. Please report any and all issues. I'm not sure if
javascript supports local filepaths so you might need to edit it.

------
TorKlingberg
I wonder if an analysis of this and the comment corpus could quantifiably show
shifts in the opinions of the Reddit "hivemind" since the start. Has the
Reddit userbase turned more conservative over the years, or is that just my
impression?

~~~
creshal
There sure as hell is lot more fear-mongering since the refugee "crisis"
started.

~~~
scintill76
Why quotes on crisis? I'm not challenging it, because I also find it weird
that overnight the phrase started popping up in the news daily. And what is
the fear-mongering going on at Reddit?

~~~
creshal
> And what is the fear-mongering going on at Reddit?

The usual:

• The Arabs will outbreed us! (Before or after the Turks will outbreed us? Or
was it the Somalis? I lost track.)

• The refugees will cripple our economy! (153 million EU citizens are on tax-
funded pensions or unemployment benefits. Please tell me how a million
refugees is going to make a dent in that.)

~~~
theorique
They don't need to outbreed us - they are just sending us their displaced
population, costing €700K per head in social services for "integration". And
of those refugees "integrated" at vast expense, how many are already ISIS
operatives? How many will be radicalized and return with full bellies and full
bank accounts to fight for jihad in Syria and Iraq?

Europe is doing the wrong thing.

------
minimaxir
Last year, I did a statistical analysis with the same data and can confirm
that the data is robust: [http://minimaxir.com/2014/12/reddit-
statistics/](http://minimaxir.com/2014/12/reddit-statistics/)

The Comment data set is already on BigQuery, allowing for quick analysis
without having to download that corpus (example:
[https://www.reddit.com/r/bigquery/comments/3kfnmq/reddit_sub...](https://www.reddit.com/r/bigquery/comments/3kfnmq/reddit_subreddits_dataset_900000_subreddits/)
)

Once the Submission dataset is also on BigQuery, I'll write a blog post with
more info on how to use it.

------
runlevel1
Submissions Corpus Magnet Link:

    
    
        magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969
    

Comments Corpus Magnet Link

    
    
        magnet:?xt=urn:btih:32916ad30ce4c90ee4c47a95bd0075e44ac15dd2&dn=RC%5F2015-01.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Fopen.demonii.com%3A1337&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969
    
    

Note: The comment corpus was built a few months ago (2015-01), so the two
datasets don't overlap completely.

------
lysium
I'm not quite sure how I feel about it. Yes, that data was publicly available,
but it 'feels' different that there is now an unalterable snapshot available
to the public. No more comment deletion.

~~~
grey-area
There's also a backup copy in Utah with all your comments from every site tied
to common selectors like your usernames, emails, passwords, ip, credit cards
etc. No doubt that'll leak at some point too.

So comment deletion was never going to save you.

~~~
tdkl
But we cannot download that one.

This is actually what has been bothering me for a while now. In the past where
search engines (well at least the big one) weren't that thorough, you didn't
mind if something stayed on the Internet, since it wasn't easily accessible.

But it's a major difference if it's indexed/searchable or downloadable.

~~~
hippo8
This is why, I think we need to make ICT (Information Communication
Technology) a mandatory subject in our schools.

Teach kids about the internet. Many people are still under the assumption just
deleting something would make it disappear for ever.

Maybe a lot of people from our generation are doomed to sharing too many
things online, but we can at least save the next generation from themselves.

You make a mistake 10 years ago, may be a few close friends in your town know
about you. Now you make a mistake, the whole world has access to that
information.

Government regulations, bans are not going to do anything to stop the spread
of information, we need to educate people to protect themselves from their own
selves.

~~~
eru
Judging by how much of a hash lots of schools are doing with literacy and
numeracy (and we've been teaching these things for ages), I don't have high
hopes for ICT.

------
tomtoise
I've spent some time turning this dataset into a Torrent - magnet link is
here. If the Reddit OP doesn't want it to be shared in this way for whatever
reason, I will of course kill it. Just trying to save him some bandwidth.

magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80

~~~
voltagex_
magnet:?xt=urn:btih:9941b4485203c7838c3e688189dc069b7af59f2e&dn=RS%5Ffull%5Fcorpus.bz2&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&ws=[http://reddit-
data.s3.amazonaws.com/RS_full_corpus.bz2](http://reddit-
data.s3.amazonaws.com/RS_full_corpus.bz2) adds the S3 link as a web seed.

~~~
tomtoise
That's pretty cool. So would it pull the data from the web or the seedbox?
Sorry - I'm fairly newbish with p2p technology.

~~~
voltagex_
Both, if the client is new enough. I'm not sure what weight is given to each.
You have no contact details and the site on your keybase profile is dead - can
you get in touch? I'd like to experiment with this.

------
jffry
The same author just posted about a realtime stream of comment data:
[https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_d...](https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_data_is_available_including_comments/)

------
dssddsds
Now we wait for someone to train a RNN and produce a fake but believable
reddit parody with generated titles :)

~~~
cmdli
Have you seen this?

[https://www.reddit.com/r/subredditsimulator](https://www.reddit.com/r/subredditsimulator)

~~~
dssddsds
Some of them are amazing - "TIL Robin Williams died a year ago yesterday,
Donald Trump and Bernie Sanders Will Consider Legalizing Dank Maymays if Jet
Fueled"

------
kzhahou
Correlate the comment corpus against LinkedIn DB and other public data
sources, and one could create an auto-dox system.

------
afandian
Crossref did some analysis on when DOIs are submitted to reddit. DOIs are
identifiers / persistent links, mostly for scholarly content.

[http://crosstech.crossref.org/2015/09/dois-in-
reddit.html](http://crosstech.crossref.org/2015/09/dois-in-reddit.html)

HN thread here:
[https://news.ycombinator.com/item?id=10303295](https://news.ycombinator.com/item?id=10303295)

------
stared
Is it it possible to directly use it with Spark? (I.e. without downloading it
from AWS.) I mean, if the file is public, does it mean that there is a s3n://
address and it is accessible?

------
marcosscriven
I couldn't see from the source what form the data are in? It mentions what
properties are available, but is it a DB snapshot? Are 'submission objects'
only text?

~~~
stuck_in_the_ma
Hi there. I'm the one that released this. The data is just a bunch of JSON
objects separated by new lines (\n). It's basically the exact same info you
would get in JSON format from Reddit's API.

------
yuvipanda
Does anyone know what license this would be under?

------
mbrutsch
Damn shame this excludes all my censored content.

