
Show HN: Sense2vec model trained on all 2015 Reddit comments - syllogism
https://sense2vec.spacy.io
======
henningpeters
Accompanying blog post is at [https://spacy.io/blog/sense2vec-with-
spacy](https://spacy.io/blog/sense2vec-with-spacy)

------
minimaxir
As someone who's tried to quantify the Reddit hivemind by analyzing word usage
([http://minimaxir.com/2015/10/reddit-
topwords/](http://minimaxir.com/2015/10/reddit-topwords/)), I think the spaCy
semantic approach is a much more robust approach to the Reddit data corpus.

However, I'm not sure if I agree with the use of phrase similarity as a
indicator of Reddit hivemind behavior, which was discussed in the
complementary blog post. The tool is more of an indicator of the writing
styles of Reddit's primary demographics (Male, 18-30) and phrases which are
coincident, instead of weighting the _importance_ of given phrases to Reddit
discussion.

------
rockmeamedee
Word vectors are really intriguing to me, and I have a few questions:

If you also trained a model on a different news aggregator's comments, would
it be possible to "match up" the meaning vector spaces and see differences in
what meanings each community ascribes to words?

Additionally, could you determine sentiments from the positions of the words
in the meaning space? One example, looking for underlying assumptions, like
associations of the words 'black person' and 'criminal'. Or idk, man + sex =
player, but woman + sex = slut.

Would it be possible to go higher level and see how much a corpus agrees with
"free markets are good" based on its word positions?

It seems like word2vec has the potential to bring Sapir-Whorf to a whole new
level.

~~~
syllogism
Let's say we want to know how usage differs in one subreddit vs another. If
you just train two entirely separate models, you end up with two vector
spaces. The meanings within each space are entirely relative --- the absolute
positions obviously aren't significant. You can try to learn a mapping, but
the transform is not necessarily linear. (Interesting empirical question
there...)

What you need is a 'seam' that connects the two vector spaces. I would do it
like this.

Train a single model, with the words decorated by their subreddit. So you have
combat:/r/gaming and combat:/r/history as different tokens. Then you have
shared tokens which aren't decorated in this way.

------
MasterScrat
Now can't you experiment with algebra operations on those vectors, like "queen
minus female plus male = king"?

~~~
syllogism
This is supported in the underlying data, but we don't expose it in the UI at
the moment. We'll probably do something a bit different to help people make
those queries.

~~~
j2kun
Would it be possible for you to make the trained vectors available for
download and independent analysis without having to rescrape and retrain the
model?

~~~
MasterScrat
FYI there are Reddit data dumps publicly available, no need to kill Reddit
scraping it:

[https://archive.org/details/2015_reddit_comments_corpus](https://archive.org/details/2015_reddit_comments_corpus)

~~~
tartakovsky
Awesome.

------
andrewtbham
This is very cool. I have played with the word2vec download a lot and I have
been sorta amazed. If you had a lot of money... like google... it would be
cool to train word vectors on the whole internet.

something like this definitely makes word net obsolete. :-(
[http://wordnetweb.princeton.edu/perl/webwn](http://wordnetweb.princeton.edu/perl/webwn)

~~~
rspeer
word2vec embeddings tend to be improved when you add WordNet (Faruqui et al.,
2015): [https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-
word-...](https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-word-
vectors.pdf)

------
1ris
It's impressive how good it is.

But there is still room for improvement. Searching "Haskell" leads to Clojure
and C++. Yes, these are both Programming Languages, but out of all I
personally wouldn't have said C++. :)

"Scheme" leads to Haskell, witch is very fitting, and Prolog, witch seems to
fit as a "university language" aswell.

For "Agda" it outputs "typeclasses" at 74%. This is a much discussed topic.
But for a "truly" semantic understand It should know that both "Agda" and
"Haskell" are in the same category and that "typeclasses" is a property that
elments in this category have or don't have.

Still, very impressive. But not the singularity jet.

~~~
verroq
I would have said C++ because the template system is a functional programming
language.

------
gradi3nt
The first name to show up when I search 'evil man' is 'Mr Rogers'...what the
hell, reddit.

(first name, not first word, had to scroll down a bit, past words like 'Nazi
soldier' and 'evildoer')

~~~
realusername
I was half-expecting this but Comcast is associated with Swastika and Nazi.

------
andrewtbham
The band names are particularly cool. It is sorta like a recommendation
engine. If you like Pink floyd, you might like Led Zepplin.

------
jldugger
Pretty good at picking up the Smash Bros Jargon:

[https://sense2vec.spacy.io/?dash_dance|NOUN](https://sense2vec.spacy.io/?dash_dance|NOUN)

------
thekingshorses
NSFW doesn't work:
[https://sense2vec.spacy.io/?NSFW%7CNOUN](https://sense2vec.spacy.io/?NSFW%7CNOUN)

~~~
neo2006
try RTFM, doesn't work too!!

~~~
singham
Try ftfy

------
fla
Submitting 'NSFW' (without quotes) will display mangled urls.

~~~
Houshalter
Not necessarily a bug on their end, those urls are actually from a bot,
AutoWikibot. Which uses the word "NSFW" in every post, so all of it's
formatting is highly correlated with it:
[https://www.reddit.com/user/AutoWikibot](https://www.reddit.com/user/AutoWikibot)

The last line (Parent commenter can toggle NSFW or delete...) has this
formatting:

    
    
        ^Parent ^commenter ^can [^toggle ^NSFW](/message/compose?to=autowikibot&subject=AutoWikibot NSFW toggle&message=%2Btoggle-nsfw+ct3omf8) ^or[](#or) [^delete](/message/compose?to=autowikibot&subject=AutoWikibot Deletion&message=%2Bdelete+ct3omf8)^. ^Will ^also ^delete ^on ^comment ^score ^of ^-1 ^or ^less. ^| [^(FAQs)](/r/autowikibot/wiki/index) ^| [^Mods](/r/autowikibot/comments/1x013o/for_moderators_switches_commands_and_css/) ^| [^Call ^Me](/r/autowikibot/comments/1ux484/ask_wikibot/)
    

You can see how that formatting might fuck up their nice natural language
processor and tokenizer.

------
ecesena
Very cool!

I've been working for a while on extracting "semantic" from naked text (mostly
news). One of the big limits of word2vec is that the semantic should be
related with words proximity in sentences, which works in some cases, but not
always.

I'll give an example for all: travel/geography. If you query something like
Italy [1], the results are other European states. But if you're looking for
news in Italy, or to plan a vacation to Italy, or for some Italian food... or
anything related to Italy itself, you probably don't expect "Spain" to be the
first result.

It would be nice to have some sort of easy way in word2vec to define domains
and their relationship with words proximity in sentences, to overcome
situations like this one.

[1]
[https://sense2vec.spacy.io/?Italy%7CGPE](https://sense2vec.spacy.io/?Italy%7CGPE)

~~~
nl
Don't think of Word2Vec (etc) as a search - it's a different thing.

It's more like a recommender: If I like Pizza in Napoli where should I go in
Spain and what should I eat there?

Italy:Pizza -> Spain:?

Italy:Napoli -> Spain:?

------
dbunkerx
Tried something similar a while back, only using Sparks Word2Vec
implementation, though just looking at individual organizations as used in
different subreddits. It is surprising how far Word2Vec can take you in
deriving word similarity.

Used POS tagging in a previous post, though not with Word2Vec since I wasn't
sure if differences like duck verb vs. duck noun would improve the result
because the placement of verb vs. noun would already be different. Though
certainly an interesting approach, I'm wondering if going backwards might
yield better results for the POS tagger as well, since verb vs noun would span
disparate word clusters.

1) [http://dbunker.github.io/2016/01/05/spark-word2vec-on-
reddit...](http://dbunker.github.io/2016/01/05/spark-word2vec-on-reddit/)

------
Grue3
Strange, I enter "Harry Kane" and it shows me various footballers (none of
which are on the same team as him). But then I enter "Jamie Vardy" and there
are no results whatsoever, even though he's been HUGE on /r/soccer in 2015.

~~~
syllogism
Hmm. Interesting. I wonder whether we're missing some data. I didn't do so
much to verify that. Thanks.

------
stared
And interactive sentence parsing:
[https://api.spacy.io/displacy/index.html?full=A+machine+lear...](https://api.spacy.io/displacy/index.html?full=A+machine+learning+how+to+do+machine+learning).

------
aledalgrande
Indeed accurate:
[https://sense2vec.spacy.io/?turd%7CNOUN](https://sense2vec.spacy.io/?turd%7CNOUN)

~~~
taneq
Compare and contrast:
[https://sense2vec.spacy.io/?dank_meme|NOUN](https://sense2vec.spacy.io/?dank_meme|NOUN)

~~~
aledalgrande
Wow, didn't know the term "dank" before. Although Urban Dictionary says it is
a term used by stoners and hippies. ;)

------
empath75
Oh this is fun if you look up politically charged words like 'thug' or even
something that should be pretty harmless like 'girls'.

~~~
Houshalter
I don't see anything surprising about those searches. E.g. "thug" gets
"gangster" which seems pretty fitting. "girls" gets "female friends", which is
semantically very close.

------
Palomides
is it just me or does it not search when you hit enter?

~~~
syllogism
The server logs say everything's working, so I hope not! Are you still having
trouble?

Sometimes I get tricked when I enter a query that has the same top result as
the one that's currently displayed. Then it looks like the results haven't
changed, but further down the list, they have.

~~~
ssalenik
Nothing happens for me where I press enter either, I have to click on the
search icon (firefox on linux).

~~~
syllogism
Working on this. Thanks.

Edit: We have this fixed but we're reluctant to roll it out. Two days ago when
we did an AWS deploy, they replaced healthy machines with unhealthy ones, and
we had an hour of outage.

If you want to fix this on the client, I think the following quick fix works.
At the bottom of the sense2vec script
[https://sense2vec.spacy.io/js/sense2vec.js](https://sense2vec.spacy.io/js/sense2vec.js)
, change:

    
    
        input.addEventListener('keydown', function() {
           if(event.keyCode == 13) run();
        });
    

to

    
    
        input.addEventListener('keydown', function(event) {
           if(event.keyCode == 13) run();
        });

~~~
nl
Also noting that linking URLs with the pipe symbol (eg
[https://sense2vec.spacy.io/?Autechre|PERSON](https://sense2vec.spacy.io/?Autechre|PERSON))
will result in the "|PERSON" being dropped from the search box.

------
cowsandmilk
tried "netflix and chill", didn't find anything? That to me is the phrase of
the year that is more than the sum of its parts.

~~~
syllogism
The underlying linguistics of a "phrase" here are sort of narrow. What we did
is retokenize the text so that entities and basic noun phrases are merged into
a single token. "Netflix and chill" is analysed as multiple tokens, so it
doesn't come up as a query result.

------
IshKebab
This is really cool. The next logical thing is to extend it to allow multiple
meanings for the same word with the same type. For example " _lead_ is a heavy
element" vs "I own a dog _lead_ ". Not sure how you'd do that without
explicitly giving each unique word an ID and manually annotating the training
data.

------
danso
Great to see another use case of spacy...I've been wanting to give it a go
ever since OP pitched it as an alternative to NLTK.

------
natch
Looking at the blog post, I wonder why spaCy thinks Barack Obama is spelled
Barrack Obama. It forces spellings to the more canonical word spelling, maybe,
regardless of context? Edit: or maybe somebody slipped and inserted a stray
keystroke when editing the blog post?

------
PeCaN
Interesting side effect: it has learned to give somewhat decent anime
recommendations. At the very least it recognizes some basic genres to an
extent (for example, TTGL is close to Code Geass and Kill la Kill, while
Kokoro Connect is close to Clannad and Toradora).

~~~
1ris
Try "Haruhi" and "Madoka". IMHO impressive.

~~~
PeCaN
It does characters pretty well too. For most popular series that I tried, with
the weird exception of Gintama, searching any of the main characters would
pull up the rest of the main cast as highly related.

It seems to have learned the relationship between characters and shows, though
it always ranks the Monogatari characters as relatively close to any character
search.

------
finishingmove
Searched for "Best Iron Maiden album" and it was smart enough not to pick one.
Then I tried "Seinfeld" and it returned Seinfeld as the 10th result behind
some other sitcoms, including IT Crowd (that was an oddball!) :)

~~~
finishingmove
Interesting, the result seems to be for Jerry Seinfeld. If I try
Seinfeld|WORK_OF_ART, I get a bunch of sitcoms but not Seinfeld.

------
Houshalter
Weirdly, their dataset doesn't include the word "thirteen" or "13". and it's
not a matter of excluding numbers, they have "twelve", "12", and "14" and
"fourteen".

------
soft_dev_person
Cool. It seems to distinguish words with and without trailing period for some
reason ("MySQL" and "MySQL."), which seems a bit too specific. Perhaps a good
reason to, though.

~~~
syllogism
The reason isn't good --- it's just a tokenization problem.

------
biot
Any chance you'll also have a Markov chain generator? Bonus points if you can
pick your subreddit, give it a starting topic, and let 'er rip.

~~~
rangibaby
I think that's exactly what this does:
[https://www.reddit.com/r/SubredditSimulator/](https://www.reddit.com/r/SubredditSimulator/)

------
andrewchambers
I think the server just returns nothing when it is overloaded.

I searched "dank memes" and got "steel beems". Other searches all failed.

------
beyondcompute
[http://imgur.com/2S0eCnY](http://imgur.com/2S0eCnY)

------
Kiro
This is awesome! I would pay for an API.

------
bawana
FOUND NOTHING for 'dogs f--king' where the dashes need to substituted for the
obvious letters

------
the_cat_kittles
just a random hilarious oddity- i wonder why it associates ricky rubio with
pokemon?
[https://sense2vec.spacy.io/?Ricky_Rubio%7CPERSON](https://sense2vec.spacy.io/?Ricky_Rubio%7CPERSON)

edit: and now it doesn't?

------
drakmail
It crashes on "/r/AsianFetish" phrase with 500 error :-)

------
rocky1138
Searched for something NSFW and didn't get any results :(

~~~
xyzzy123
Looks like the parser got confused by the NSFW link tag.

------
gleb
Doesn't work as advertised for [Obama]

~~~
syllogism
?

[https://sense2vec.spacy.io/?Barack_Obama%7CPERSON](https://sense2vec.spacy.io/?Barack_Obama%7CPERSON)

[https://sense2vec.spacy.io/?Obama%7CPERSON](https://sense2vec.spacy.io/?Obama%7CPERSON)

~~~
gleb
I get:
[https://sense2vec.spacy.io/?Obama%7CORG](https://sense2vec.spacy.io/?Obama%7CORG)

First result is GOP

~~~
nl
It is showing things which are used in English in the same way as "Obama" when
the word "Obama" is used as an organization.

Clearly "Obama" isn't an organization, but it does a decent job. Consider the
sentence "Obama’s unsolicited advice that Congress should “go ahead and vote”
has only hardened the GOP majority’s resistance."

