
Quantifying and Visualizing the Reddit Hivemind - tomkwok
http://minimaxir.com/2015/10/reddit-topwords/
======
sheensleeves
May be relevant:
[https://www.reddit.com/r/no_sob_story/](https://www.reddit.com/r/no_sob_story/)
This is a subreddit of images with the possibly made of story removed. No
story about the girlfriend etc.

------
hugh4
This is actually really interesting.

I'd love to see it for voat and HN to get a comparison.

~~~
Excavator
There's these two for HN:

[http://blog.datadive.net/which-topics-get-the-upvote-on-
hack...](http://blog.datadive.net/which-topics-get-the-upvote-on-hacker-news/)

[http://insightmine.com/hacking-y-combinator/](http://insightmine.com/hacking-
y-combinator/)

------
leeleelee
I think this type of analysis would be more telling if "phrases" were looked
at instead of words. I.e. 2-word and 3-word combinations.

~~~
minimaxir
That's the next step, although I've had difficulty using bigrams in BigQuery.

~~~
leeleelee
Also, a couple other things...

(1) You might be better off writing a small python script for this. That is
one heck of a query you wrote. I used to have a job where I regularly wrote
queries like this, and they took up to an hour to run sometimes. When I
discovered python I never turned back.

(2) This type of analysis has a flaw, so be careful what conclusions you draw
from it. What you are doing _describes_ the data, but it (a) does not identify
a causal relationship between keywords and submission scores and (b) would
likely hold very little predictive power. If you were to form a new set of
data from simulating a random walk process to generate titles and
upvote/downvote submissions, this analysis would also yield apparent "hive
mind keywords" but obviously there is no underlying causal relationship in
that case.

You should look up the topic of "cross validation". The easiest thing you
could do and the best "first step" would be to take the reddit data and split
it in half (while maintaining consistency obviously) so that you have two
groups of data. Then perform your analysis on each group, and compare results.

Another method would be to take repeated random subsamples and perform your
analysis. See if you get consistent results.

~~~
minimaxir
I answer both of these in the conclusion:

> _All in all, this is still just a first step for analyzing the importance of
> keywords in Reddit submission [...] but [the next steps] require very
> significant and very expensive computing power._

Of course I want to use cross-validation and other things like that, but it's
slightly harder to do on a 200GB dataset.

------
KhalilK
I still can't get this particular website to load.

~~~
minimaxir
Do you happen to be behind a proxy or something like that?

~~~
KhalilK
Nope, even tried different OSs (Windows and Linux)

Chrome shows the request as Pending, tried disabling my ad-blockers, still an
endless "Loading".

