Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Quantifying and Visualizing the Reddit Hivemind (minimaxir.com)
36 points by tomkwok on Oct 10, 2015 | hide | past | favorite | 11 comments



May be relevant: https://www.reddit.com/r/no_sob_story/ This is a subreddit of images with the possibly made of story removed. No story about the girlfriend etc.


This is actually really interesting.

I'd love to see it for voat and HN to get a comparison.



HN is a bit harder since there is a high frequency of idiosyncratic titles.


I think this type of analysis would be more telling if "phrases" were looked at instead of words. I.e. 2-word and 3-word combinations.


That's the next step, although I've had difficulty using bigrams in BigQuery.


Also, a couple other things...

(1) You might be better off writing a small python script for this. That is one heck of a query you wrote. I used to have a job where I regularly wrote queries like this, and they took up to an hour to run sometimes. When I discovered python I never turned back.

(2) This type of analysis has a flaw, so be careful what conclusions you draw from it. What you are doing describes the data, but it (a) does not identify a causal relationship between keywords and submission scores and (b) would likely hold very little predictive power. If you were to form a new set of data from simulating a random walk process to generate titles and upvote/downvote submissions, this analysis would also yield apparent "hive mind keywords" but obviously there is no underlying causal relationship in that case.

You should look up the topic of "cross validation". The easiest thing you could do and the best "first step" would be to take the reddit data and split it in half (while maintaining consistency obviously) so that you have two groups of data. Then perform your analysis on each group, and compare results.

Another method would be to take repeated random subsamples and perform your analysis. See if you get consistent results.


I answer both of these in the conclusion:

> All in all, this is still just a first step for analyzing the importance of keywords in Reddit submission [...] but [the next steps] require very significant and very expensive computing power.

Of course I want to use cross-validation and other things like that, but it's slightly harder to do on a 200GB dataset.


I still can't get this particular website to load.


Do you happen to be behind a proxy or something like that?


Nope, even tried different OSs (Windows and Linux)

Chrome shows the request as Pending, tried disabling my ad-blockers, still an endless "Loading".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: