Quantifying and Visualizing the Reddit Hivemind

sheensleeves · on Oct 10, 2015

May be relevant: https://www.reddit.com/r/no_sob_story/ This is a subreddit of images with the possibly made of story removed. No story about the girlfriend etc.

hugh4 · on Oct 10, 2015

This is actually really interesting.

I'd love to see it for voat and HN to get a comparison.

Excavator · on Oct 10, 2015

There's these two for HN:

http://blog.datadive.net/which-topics-get-the-upvote-on-hack...

http://insightmine.com/hacking-y-combinator/

minimaxir · on Oct 10, 2015

HN is a bit harder since there is a high frequency of idiosyncratic titles.

leeleelee · on Oct 10, 2015

I think this type of analysis would be more telling if "phrases" were looked at instead of words. I.e. 2-word and 3-word combinations.

minimaxir · on Oct 10, 2015

That's the next step, although I've had difficulty using bigrams in BigQuery.

leeleelee · on Oct 10, 2015

Also, a couple other things...

(1) You might be better off writing a small python script for this. That is one heck of a query you wrote. I used to have a job where I regularly wrote queries like this, and they took up to an hour to run sometimes. When I discovered python I never turned back.

(2) This type of analysis has a flaw, so be careful what conclusions you draw from it. What you are doing describes the data, but it (a) does not identify a causal relationship between keywords and submission scores and (b) would likely hold very little predictive power. If you were to form a new set of data from simulating a random walk process to generate titles and upvote/downvote submissions, this analysis would also yield apparent "hive mind keywords" but obviously there is no underlying causal relationship in that case.

You should look up the topic of "cross validation". The easiest thing you could do and the best "first step" would be to take the reddit data and split it in half (while maintaining consistency obviously) so that you have two groups of data. Then perform your analysis on each group, and compare results.

Another method would be to take repeated random subsamples and perform your analysis. See if you get consistent results.

minimaxir · on Oct 11, 2015

I answer both of these in the conclusion:

> All in all, this is still just a first step for analyzing the importance of keywords in Reddit submission [...] but [the next steps] require very significant and very expensive computing power.

Of course I want to use cross-validation and other things like that, but it's slightly harder to do on a 200GB dataset.

KhalilK · on Oct 10, 2015

I still can't get this particular website to load.

minimaxir · on Oct 10, 2015

Do you happen to be behind a proxy or something like that?

KhalilK · on Oct 10, 2015

Nope, even tried different OSs (Windows and Linux)

Chrome shows the request as Pending, tried disabling my ad-blockers, still an endless "Loading".