
What interests Reddit? A network analysis of 84M comments by 200K users - alexcasalboni
http://markallenthornton.com/blog/what-interests-reddit/
======
faizshah
I'm working on a project relevant to this. Does anyone know if the author has
shared this data set anywhere? Or does anyone know of any data sets that could
be used for developing mixture models to classify users into interest groups
(like photographers, programmers etc)?

~~~
aroch
If I'm remembering correctly, the raw data was given under an NDA/DND as a one
time only deal. There was a subreddit associated with the data and collection
but its since been banned.

~~~
minimaxir
I was the one with this data. The subreddit which discussed provided the data
went private.

I have the raw data but it's infeasable to distribute due to size and the
original source got in trouble for making it easily accessible.

------
stared
I like a lot such analyses based on the network structure (not long ago I made
something similar for Stack Exchange -
[http://stared.github.io/tagoverflow/;](http://stared.github.io/tagoverflow/;)
continuation of my older one [https://github.com/stared/tag-graph-map-of-
stackexchange/wik...](https://github.com/stared/tag-graph-map-of-
stackexchange/wiki)).

Though, technology-wise, it is one use-case where SVG beats pixel graphics,
both in terms of usability and interface (whether it is custom D3.js or
something graph-oriented as [http://sigmajs.org/](http://sigmajs.org/)).

~~~
wamatt
if you set tag coloring to "% answered", an interesting pattern emerges

[http://i.imgur.com/ZLmWHrq.png](http://i.imgur.com/ZLmWHrq.png)

responsiveness of the community in order of most to least

\- oldschool hacker (c/c++,bash,perl,regex)

\- web dev (jquery, javascript, html, css)

\- app dev (ios, objectivec, android, java)

------
jedberg
Fun fact: We did this exact analysis at reddit many years ago, and used it to
figure out which subreddits were related to each other. We never got around to
productizing it, unfortunately, but the idea was to use it to suggest new
reddits to you.

~~~
sinemetu11
I guess this might get into some special sauce territory, but was there a
specific reason why this type of recommendation system was deprecated?

~~~
Houshalter
All the reddit subreddit recommenders I've seen produce garbage
recommendations. Outside of a handful of popular, general subreddits which
everyone already knows about, everything is niche special interest stuff that
you need to find on your own.

~~~
bduerst
Subreddit specificity is so messily complex that it would be very difficult to
do any recommendations based on your own subscriptions. Without reddit's
cooperation in categorization (unlikely) it's probably not going to happen.

------
hooo
I find these network visualizations nice to look at, but not all that
insightful. They're generally hard to read and track relations outside of the
main clusters. Am I missing something?

~~~
th0ma5
No I don't think so. A lot of people call these things "hairballs," and
probably a more useful interface would be some kind of faceted browser that
allows you to do pivots and look at aggregate stats of the various lenses you
can put on top of a graph. Additionally, measurements such as node separation,
"betweenness," or perhaps even looking at common chain patterns are probably
more useful ways of trying to dissect graph structures.

~~~
Chronic31
Let's say you compute how distant/similar two concepts are. Then what? You
update a link on Wikipedia?

~~~
th0ma5
Yeah I don't know! :D I guess I was talking about graph processing in general:
[https://en.wikipedia.org/wiki/Betweenness_centrality](https://en.wikipedia.org/wiki/Betweenness_centrality)

------
SwellJoe
What interests reddit? Casual racism and misogyny. Also cats.

Seriously though, it's interesting how interconnected _some_ things can be in
this view. I'm not sure what sense I can make of those interconnections,
though. Mousing around, while being a very frequent redditor (so my own neural
network is making connections based on experience), I can kinda infer order
out of things like the "government->state" topics connected to "force" and
"property" among others (hints at the libertarian-leaning general population),
and the "women" topic connecting to a whole host of stuff...the cyan colored
section off to the top right might even kinda hint at the casual misogyny
thing (which was a "ha ha only serious" kind of joke), with words like
"bullshit", "logic", "proof", "assumption", "reasoning", and "evidence", being
connected to "women" but not to "men".

But, without having spent years on reddit, and without my particular flavor of
reddit (the subs I'm subscribed to), maybe I'd interpret the data very
differently. I never quite know how to interpret network graphs like this,
honestly, short of for things that _are_ networks. i.e. a computer network
topology on a graph shows useful data...the hops from one machine to the next.
When connecting up one word to the next, it seems difficult to draw meaningful
conclusions. Like my interpretation of the meaning of
"government->state->property" as being a hint at the libertarian leanings of
many subreddits, or the connection of "women->reasoning->evidence" as being a
hint of many redditors belief that women are illogical liars (which is the
impression many of my female friends have of reddit, in general, particularly
when topics like date rape or the "friend zone" come up). Is that actually the
context in which these connections are made? I wouldn't really know how to
check. It'd be cool to be able to drill down to conversations in which the
connections where made, but presenting that in a coherent UI seems
challenging.

~~~
thejaredhooper
I agree. It would be nice to drill down into the data in order to further
analyze everything. I also feel there was a particular sort of censorship in
the dataset, an indicator of which was the explicit racist and misogynistic
words that were absent. There was a large lack of swears and bad terms in this
analysis (bitch being a particularly obvious cut) and I, for one, see examples
of these slurs prevalently used by young men far too often on the site.

Perhaps the data was tailored when it was provided to the analyst, or it was
censored after reception, but this felt too "PG-13" for an analysis of
reddit's "interests".

------
6stringmerc
I get the feeling that Conde Nast may not like this type of approach when
they're not directly profiting from it. A study of the language between the
SFW and NSFW type tags might be pretty interesting, or, well, not very
pleasant. I did participate in a couple music communities for a while, but
there's something in the stew over there that I'm glad I closed my account and
never looked back. YMMV.

~~~
brandonwamboldt
Contrary to popular belief, Condé Nast no longer owns Reddit.

Since 2012, Reddit operates as an independent company (Advanced Publications,
the parent company of Condé Nast is a majority share holder though).

See: [http://www.redditblog.com/2013/08/reddit-myth-
busters_6.html...](http://www.redditblog.com/2013/08/reddit-myth-
busters_6.html#independent-reddit-inc)

~~~
thieving_magpie
So they don't own it, but they are the majority share holder? That doesn't
feel very independent. Maybe I'm misunderstanding.

~~~
brandonwamboldt
You are indeed misunderstanding.

Reddit is an independent entity, not a subsidiary of Condé Nast (like it used
to be) and not a subsidiary of Advanced Publications (like it used to be).

It is an independent corporation, with it's own board of directors, and
control of its own finances.

Just being a majority stakeholder doesn't mean you control the company either.
There are a lot of details like share types and company by-laws that determine
that.

~~~
bhayden
It is probably safe to say Condé Nast has a huge influence in who the board of
directors are, and therefore controls reddit still.

~~~
bradleyjg
Not Conde Nast, Conde Nast's owners, Advance Publications, or when you really
get down to it, the Newhouse family.

And the grandparent saying that Advance Publications is "only" a majority
shareholder is a little deceptive. The shareholders are Advance Publications,
current and former employees (as part of a ESOP) and a small residual
ownership of angels in the original company.

While it is true that a majority owner can't just do whatever it wants, the
rules protect the financial interests of minority shareholders, mostly in the
context of takeovers, not the editorial independence of employees. If Si and
Donald decided they really didn't like the NSFW part of reddit I think they
could get rid of it.

~~~
dublinben
Did they not just raise many millions of dollars in a new investment round?
Did those new investors not receive a percentage of equity in Reddit Inc. as
it is organized today?

~~~
bradleyjg
You're right, mea culpa. It was a $50M investment on a reported $500M
valuation. So 10% to the new investors (with a possibly defunct plan to give
10% of that, i.e. 1%, to the site's users), and 90% split between Advance
Publications, ESOP, and the legacy angels (reported at less than 1% of the
pre-investment total).

As for the ESOP percentage, all I've found is a reference in Forbes that
describes it as a "sizable minority".

------
erroneousfunk
Small point: Is it really considered "scraping" (" I scraped approximately 84
million comments") if you used a Python library that uses the Reddit API, not
the actual site directly?

------
fspacef
Salute the effort put into this, quite thought provoking

~~~
okasaki
Maybe I'm just stupid, but I don't see anything thought provoking.

In fact I feel that a better way to see what redditors are interested in would
be to just find (there may even be stats on reddit on this) the ~50 most
active subreddits.

~~~
zipppy
If the 50 most active only constituted 10% of all reddit activity, though,
they wouldn't necessarily paint an accurate picture of all of reddit.

Maybe nothing could paint that picture, but if there are themes prevalent
independent of the subreddit topics themselves, this kind of analysis could
shed light on them.

------
grabcocque
Misogyny seems to be big in Reddit comments.

~~~
seany
You misspelled misandry.

~~~
sliverstorm
I'm thinking misanthropy is more accurate

