
What are the Hidden Communities of Reddit? - eli_awry
http://www.cs.utexas.edu/~elie/networks.html#/sfw-reddit
======
DanBC
I'm not sure what "hidden" means in the title. See, eg,
(<http://www.reddit.com/r/proana>). There are a bunch of these closed groups.

The author's work seems really useful for detecting spam. There are some
people / bots who post a lot of specialist content. They only ever post links
to content on domains that pay when visitors click links. These domains have a
lot of ads. There's no other interaction on the site.

_NOT SAFE FOR WORK_:

This user (<http://www.reddit.com/user/walfa2>) only posts content from sites
which pay when viewers see the images. The domains have heavy ad content, with
popups etc.

Here's an example domain:

(<http://www.reddit.com/domain/img1.picfoco.com/>)

Once you find one user you can find a bunch of these domains, and the other
users posting to those domains, and thus find a few more domains.

With a bit of tinkering you could should a colour coded chart of spam domains;
of users that only post content from those domains; and users that never make
replies but only make top level comments.

That could be run once a week and (with human oversight) used to remove
content which is not good for reddit.

~~~
true_religion
It's not entirely obvious that posting from a domain that incentizes traffic
is a bad thing.

If the posts are upvoted by the community, then it should be seen as a good
and not a negative.

One of the oddities of reddit as compared to other social sites is that
content owners and traffickers are looked-down upon simply because they can
profit from attention.

~~~
DanBC
> It's not entirely obvious that posting from a domain that incentizes traffic
> is a bad thing.

I agree. To me it's not a problem. But unfortunately some of these posts leak
into unsuitable subreddits. In a NSFW porn subredddit it's not much of a
problem when someone links to a site heavy with horrible porn ads; but when
that link is posted to a non-porn subreddit it's more of a problem.

And, really, Reddit is better as a community rather than a dump for links. So
people who have no interaction with the site other than dumping links can be a
problem. Being paid when people visit those links can make them more of a
problem. They have no interest in Reddit.

~~~
true_religion
> when that link is posted to a non-porn subreddit it's more of a problem.

I think that's a seperate concern. Moderators delete bad posts, and the spam
filter operates on a per-subreddit basis.

> Reddit is better as a community rather than a dump for links

I'd agree its certainly different as a community than say Digg, but I'm not
sure if its for the better because reddit freezes out content creators from
using Reddit to syndicate their content.

People posting links to stuff copied onto a pay-per-click image sharing site
are pretty far down the food chain. What worries me is that essentially
content-creators can't use Reddit without becoming a part of Reddit's
community and 'paying their dues' as it is.

Reddit's mods and users have banned or severely punished even major newspapers
who wanted to just post their articles on Reddit, and not become Redditors
themselves.

~~~
pavel_lishin
> What worries me is that essentially content-creators can't use Reddit
> without becoming a part of Reddit's community and 'paying their dues' as it
> is. > Reddit's mods and users have banned or severely punished even major
> newspapers who wanted to just post their articles on Reddit, and not become
> Redditors themselves.

A lot of those content creators toe the line between marketing their content
on Reddit, and spamming it.

I have a moderately high karma on Reddit, and sometimes get private messages
asking me to help people get submissions upvoted in r/politics, and other
subreddits. It almost always sounds shady, and I almost always reply with the
same thing - post quality content relevant to the subreddit with good titles.

Newspapers trying to post content to reddit without "paying their dues" is
like trying to get free advertising. If you have to pay for ads with money,
what's wrong with paying for reddit views with "dues" and becoming part of the
community?

~~~
true_religion
Essentially the only way to pay your dues in Reddit is to post content apart
from your own, and make bunches of comments to prove that you are 'real
person'.

That's problematic because content creators don't want to help the
competition, and they have real jobs making stuff so don't want to make reddit
their part time job.

Advertisement comes with certain guarantees---pay X amount and you get Y
amount of impressions or clicks. Becoming part of a community has no guarantee
of success, and even once successful Reddit can always turn against you if
they view your posting 1 article per day as "spam" because you work for the
newspaper who makes the article.

------
NZ_Matt
I vaguely remember several years back Reddit added the option for users to
allow their subreddit and votes data to be used for research purposes with the
hope of building a recommendation engine similar to this. Does anyone know if
anything came from that? It would be great if the dataset was publicly
available.

Edit: Here are the original threads, I don't think the project got very far.
[http://www.reddit.com/r/announcements/comments/ddz0s/reddit_...](http://www.reddit.com/r/announcements/comments/ddz0s/reddit_wants_your_permission_to_use_your_data_for/)

[http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_hel...](http://www.reddit.com/r/redditdev/comments/dtg4j/want_to_help_reddit_build_a_recommender_a_public/)

~~~
naner
The option is opt-in (which is good, the fickle reddit community would revolt
otherwise) which means almost nobody uses it. If Reddit would remind their
users frequently (e.g. at the end of popular Reddit Blog posts as an aside) or
reward people for enabling the option (free Reddit Gold for a week, etc.) I'm
sure many more people would sign up.

EDIT: (Sorry for all the parentheticals.)

~~~
dmix
Why not just anonymize the user data part?

There are startups selling _health data_ this way, I don't think it would be
so bad for subreddit subscription data.

~~~
eli
I don't think Reddit users would go for that. Also, it's surprisingly
difficult to anonymize data effectively without removing nearly all of it.

~~~
otakucode
I'm not so sure about that. I believe the problem has been solved, the
solution just isn't widely known yet. I read a paper on arxiv probably a year
ago that describes a method that seems pretty straightforward and secure, but
I've never seen anything about it since. It involved, essentially, throwing
out any records which could actually contribute to a change in a statistical
measure. You basically end up finding what aspects of the data are actually
identifiable, and throw out any records that contain that. It's guaranteed not
to screw up your observations because, by definition, if something is
statistically significant it has to show up often enough that it CANT be used
to single out a source.

~~~
gwern
Given the dismal history of anonymization, a paper on arvix is roughly up
there with a blogger saying 'I've proven p!=np'...

> It's guaranteed not to screw up your observations because, by definition, if
> something is statistically significant it has to show up often enough that
> it CANT be used to single out a source.

What's 'statistically significant' here? The usual p<0.05 convention? You
realize that there can be multiple measurements or pieces of data all of which
individually have p>0.05 but together have p<<0.05... Information leakage
should be measured in bits, not p-values.

(This kind of aggregation is one of the benefits of approaches like meta-
analysis.)

------
gurkendoktor
OT - both Safari (w/o Flash) and Google Chrome max out all CPU cores as long
as this site is open. The visualisation might need an upper limit on the work
it is doing per second...

~~~
mikegioia
I'm on chrome without flash (ubuntu) and I took a screenshot of chrome using
172% of the cpu!

~~~
eli_awry
Here's a copy of the post with just text and screenshots:
<http://www.cs.utexas.edu/~elie/noscriptnetworks.html> .

I developed this on Chrome on a Mac and Chromium on Ubuntu, and it worked on
both of those. Sorry it's giving you problems.

------
dmix
I'd be curious to see the connection between politics/economics and other
subreddits.

Such as what subreddits are /r/ liberals, conservatives, libertarians,
anarchists, etc likely to follow?

Are liberals commonly in /r/trees? Are libertarians big on /r/economics? Are
conservatives avoiding /r/wtf and /r/trees?

~~~
eli_awry
Those subs weren't really active enough for me to get significant data in the
time I was scraping (Reddit doesn't let you go back very far.) Great idea for
a future post though.

------
the_cat_kittles
This is one only a handful of graphvis-esque visuals that ACTUALLY conveys
information effectively, as far as I have seen. Not to mention it is really
interesting info! Nice work!

------
razkul
Awesome data. Really interesting to look at, and great presentation.

But there are a few things that kinda bother me with this:

The problem I can find with this data is that it isn't a representation of the
reddit hidden communities as a whole, just the hidden communities of those who
actually post (only 20% of Reddit).

A question I have is whether these are two-way connections with the groups.
It's not clear exactly how the analysis is done 100% (perhaps I missed this
portion), but could connections between subreddits be generated by there being
a lot of people who post in a very tiny subreddit also posting in a larger
subreddit? This means that though someone may like Large Subreddit A, they may
not like the more specific Subreddit B. But a lot who like Subreddit B like
Subreddit A.

~~~
eli_awry
I combatted this in two ways - first, I only looked at the top 433 reddits.

There are always going to be the same people cross-subscribed between A and B
as between B and A. This graph is _not_ of the number of people cross-
subscribed between two reddits - it's of the sum (number of people cross-
subscribed)/(users in A) + (number of people cross-subscribed)/(users in B).
So if a lot of people in a tiny subreddit are cross-subscribed, they get a big
boost from the first term, but almost no boost if they make up a tiny sliver
of subscribers to reddit B.

------
msds
I did a similar thing with all of the departments of the UW:
<http://www.sorens.in/posts/2012-8-11-uw-courses>

------
1wheel
Really cool! Couple of comments:

1\. I'm assuming you downloaded comment threads from the front page of each
the subreddits you looked at and then looked at the subreddit each of the
posters had commented in. How many requests did you end up making?

2\. Did you hand select the subreddits you analysed? If so, what criteria were
you looking for?

3\. Have you thought about doing any more research into this area? I made
<http://redditgraphs.com/> and was looking into ways of guessing a user's age
& gender based on their commenting history. I found some papers about similar
sites:

twitter: <http://www.aclweb.org/anthology-new/D/D11/D11-1120.pdf>

blogspot:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136.9952&rep=rep1&type=pdf)

youtube:
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/38143.pdf)
(This one looks the most promising; using their methods, treat subreddits as
youtube videos to create more accurate profiles of communities and users. They
also examine the propagation of speech patterns which capture the spread of
some memes.)

Unfortunately, reddit doesn't have user profiles or name-like user names (so
there isn't an easily available training set) and I was having difficulties
organizing and analyzing the large amount of data I was downloading, so I put
the project aside. There has been basically no research done specific to
reddit
([http://scholar.google.com/scholar?as_ylo=2008&q=reddit+d...](http://scholar.google.com/scholar?as_ylo=2008&q=reddit+demographics&hl=en&as_sdt=0,14))
which is surprising to me because of its size and unique subreddit system.

4\. If you want to examine the spread of memes, you need access to old
threads. <http://stattit.com/> is the best way of getting around the reddit
API's 1000 most recent post limitation.

5\. Last month, a similar data set (which only looked at reddit) was collected
- I think you're trying to do something different and your presention is much
better, but you might be interested in the discussion:
[http://www.reddit.com/r/TheoryOfReddit/comments/126pth/scrap...](http://www.reddit.com/r/TheoryOfReddit/comments/126pth/scraped_110k_comments_from_45000_users_in_527/)

~~~
eli_awry
I looked at about 60,000 distinct users. But you're right about my overall
strategy. I chose all of the subreddits with over some number of subscribers
(I forget what the number was now.)I ended up with 433 subs. I filtered out
the current default subreddits from this visualization.

One thing I was wondering in terms of reddit research - have you looked into
this at all - is that they have users check a specific box if they are ok with
their voting data being used for research - even if it's already public. My
question then is this - is it somehow wrong to use (already-public) data for
research? Anyway, I talk about my original aims for the project in some other
comments.

Thanks for the link to stattit. My strategy for getting enough threads for my
other project was just to keep a slow scraper running for a month and then go
back to it - stattit will be incredibly helpful.

~~~
1wheel
> One thing I was wondering in terms of reddit research - have you looked into
> this at all - is that they have users check a specific box if they are ok
> with their voting data being used for research - even if it's already
> public. My question then is this - is it somehow wrong to use (already-
> public) data for research? Anyway, I talk about my original aims for the
> project in some other comments.

Based on the dozens (at least) of papers published each year that use twitter
data, I'm pretty sure it's kosher to use public posts. You might want to
double check with your irb though. Depending on how you present the
information, so users might be concerned about their privacy - I wrote a bot
that replied to people posting variations of 'your comment history' with a
link to the referenced person's redditgraph and several people said they were
creeped out by it (a little more here, if your interested:
<http://www.roadtolarissa.com/redditgraphs-retrospective/>).

Depending on what you are looking for the rate limit might slow you down a
lot; you might want to contact the site admins:

> tl;dr If you need old data, we'd much rather work out a way to get you a
> data dump than to have you scrape.

[https://groups.google.com/forum/?fromgroups=#!topic/reddit-d...](https://groups.google.com/forum/?fromgroups=#!topic/reddit-
dev/y_BaqD3QPeU)

------
Kluny
Insanely fascinating. Keep working and adding more graphs and stuff. Everyone
is going look for their favorite subreddit first, then see how common it is
for members of that subreddit to be in to other things they are into.

For instance, I usually read /r/bicycles, but also programming, motorcycles,
cars, and 2xc. How many other people have that unique mix of interests?

------
TGJ
The bottom interactive graph is kinda neat. Setting zero friction and minimal
spring tension and gravity center turns the whole thing into a spheroidal
structure much like the accretion of objects in space.

------
toadi
Good work for the visualization of the data. Take a look at
<http://www.datapointed.net/visualizations/> his visuals are superb.

------
skadamat
I go to UT and am on the FAI newsletter and totally get your emails!

------
rhizome
Adrian Chen thanks you.

------
mahesh_rm
Isn't r/WTF missing from this picture?

~~~
eli_awry
Indeed. I took out all of the default subreddits because they added too much
noise - everyone starts out subscribed to all of them.

~~~
corin_
Can you not just flip the switch and get equally useful data - don't look for
people who subscribe to a default, but to those who unsubscribe from it?

~~~
ninetax
IIRC that can be hard to tell since you can't get the subreddits that people
are subscribed to, just the ones they comment on, or post in. Many people are
subscribed to r/wtf but don't post or comment in it.

------
jrochkind1
Hi Eli, neat work!

