
Using Doc2Vec to Suggest SubReddits - jmportilla
http://www.reddit2vec.com
======
sdrothrock
This is pretty neat, but the biggest problem for me is the case sensitivity;
reddit itself doesn't use case sensitivity, so it's hard to remember the exact
capitalization of a subreddit name.

~~~
jmportilla
Yeah, I know its super annoying. Hopefully I'll have time to update the model
with lowercase names sometime next week.

------
utunga
Hi!

Great work. I guess my question is - do you use 'averaging' of word vectors or
the Chinese Restaurant process - to get to sub reddit vectors. You describe
the Chinese Restaurant process as a "more sophisticated method" that you "can"
use, but in my experiments with word2vec and reddit
([https://github.com/utunga/gensimred](https://github.com/utunga/gensimred)) I
quickly discovered that simple averaging just does not work. Averaging has
this awful 'revert to mean' thing that turns all the paragraph vectors into a
sort of bland gray goo where they are all the same.

If you did use Chinese Restaurant process (I love that phrase - brings back
memories of an occasion at a Dim Sum restaurant where this almost literally
happened) it'd be great to see any source code you may feel like releasing ;_)
... well, it can't hurt to ask..

~~~
jmportilla
I used the gensim Doc2Vec implementation. You can check out some of the source
code here:
[https://github.com/jmportilla/Reddit2Vec](https://github.com/jmportilla/Reddit2Vec)

~~~
utunga
Hi... Thanks for that. Awesome and much appreciated.

------
joelthelion
Very cool. Little tip: use "-funny" to get high-quality subs :)

------
Yadi
Awesome seeing someone use the reddit dataset :)!

Wouldn't a w2v as a recommender for the user might have been better?

Taking user's comments/likes/subreddits as a feature.

~~~
jmportilla
I think your thinking of just a classic collaborative filtering recommendation
system.A simple w2v system would take into account all words, then have to be
filtered by words that are equal to subreddits. Although, I may have
misunderstood your suggestion.

------
riffraff
neat, I'd suggest considering spaces as "+" i.e. "cats awww" should be the
same as "cats+awww" I guess :)

------
haxiomic
Nice idea :), works well. Spotted a small typo in the examples:

pcmasterace+mac should be pcmasterrace+mac (missing an r)

~~~
jmportilla
Thanks! I'll fix it

