Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Sense2vec model trained on all 2015 Reddit comments (spacy.io)
225 points by syllogism on Feb 15, 2016 | hide | past | favorite | 79 comments



Accompanying blog post is at https://spacy.io/blog/sense2vec-with-spacy


As someone who's tried to quantify the Reddit hivemind by analyzing word usage (http://minimaxir.com/2015/10/reddit-topwords/), I think the spaCy semantic approach is a much more robust approach to the Reddit data corpus.

However, I'm not sure if I agree with the use of phrase similarity as a indicator of Reddit hivemind behavior, which was discussed in the complementary blog post. The tool is more of an indicator of the writing styles of Reddit's primary demographics (Male, 18-30) and phrases which are coincident, instead of weighting the importance of given phrases to Reddit discussion.


Word vectors are really intriguing to me, and I have a few questions:

If you also trained a model on a different news aggregator's comments, would it be possible to "match up" the meaning vector spaces and see differences in what meanings each community ascribes to words?

Additionally, could you determine sentiments from the positions of the words in the meaning space? One example, looking for underlying assumptions, like associations of the words 'black person' and 'criminal'. Or idk, man + sex = player, but woman + sex = slut.

Would it be possible to go higher level and see how much a corpus agrees with "free markets are good" based on its word positions?

It seems like word2vec has the potential to bring Sapir-Whorf to a whole new level.


Let's say we want to know how usage differs in one subreddit vs another. If you just train two entirely separate models, you end up with two vector spaces. The meanings within each space are entirely relative --- the absolute positions obviously aren't significant. You can try to learn a mapping, but the transform is not necessarily linear. (Interesting empirical question there...)

What you need is a 'seam' that connects the two vector spaces. I would do it like this.

Train a single model, with the words decorated by their subreddit. So you have combat:/r/gaming and combat:/r/history as different tokens. Then you have shared tokens which aren't decorated in this way.


There's a fantastic blog post that does this with male and female ratemyprof reviews:

http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-th...


Now can't you experiment with algebra operations on those vectors, like "queen minus female plus male = king"?


This is supported in the underlying data, but we don't expose it in the UI at the moment. We'll probably do something a bit different to help people make those queries.


Would it be possible for you to make the trained vectors available for download and independent analysis without having to rescrape and retrain the model?


Maybe it wasn't clear, but we open-sourced everything, including the trained model. See here:

https://github.com/spacy-io/sense2vec

For installation:

$ pip install -e git+git://github.com/spacy-io/sense2vec.git#egg=sense2vec

$ python -m sense2vec.download

This downloads and installs the sense2vec package and model we used in the demo.

We are currently working on finishing the PyPI package. Meanwhile only Linux is supported and the docs are pretty much non-existing. Also, please make sure you have a recent Blas/Atlas package installed (RedHat: atlas, atlas-devel)

Direct link to the model file (~600MB):

https://index.spacy.io/models/reddit_vectors-1.0.1/archive.g...


FYI there are Reddit data dumps publicly available, no need to kill Reddit scraping it:

https://archive.org/details/2015_reddit_comments_corpus


Awesome.


This is very cool. I have played with the word2vec download a lot and I have been sorta amazed. If you had a lot of money... like google... it would be cool to train word vectors on the whole internet.

something like this definitely makes word net obsolete. :-( http://wordnetweb.princeton.edu/perl/webwn


word2vec embeddings tend to be improved when you add WordNet (Faruqui et al., 2015): https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-word-...


I don't think wordnet (and similar) will ever be obsolete. From my point of view, they are a "curated" subset of the semantic that you can obtain with word2vec. The latter can fail, also on very trivial cases (I discussed geography in another comment), and curated+structured knowledge bases such as wordnet allow to overcome the limitations of any machine learning algorithm.


I wouldn't go as far as that, wordnet has other interesting uses (and its prolog version makes for a cool "query" language)


It's impressive how good it is.

But there is still room for improvement. Searching "Haskell" leads to Clojure and C++. Yes, these are both Programming Languages, but out of all I personally wouldn't have said C++. :)

"Scheme" leads to Haskell, witch is very fitting, and Prolog, witch seems to fit as a "university language" aswell.

For "Agda" it outputs "typeclasses" at 74%. This is a much discussed topic. But for a "truly" semantic understand It should know that both "Agda" and "Haskell" are in the same category and that "typeclasses" is a property that elments in this category have or don't have.

Still, very impressive. But not the singularity jet.


I would have said C++ because the template system is a functional programming language.


https://sense2vec.spacy.io/?Agda|NOUN is enough of a hint for it to rank Haskell (and OCaml) above "dependent types"


The first name to show up when I search 'evil man' is 'Mr Rogers'...what the hell, reddit.

(first name, not first word, had to scroll down a bit, past words like 'Nazi soldier' and 'evildoer')


I was half-expecting this but Comcast is associated with Swastika and Nazi.


[flagged]


....now we know


The band names are particularly cool. It is sorta like a recommendation engine. If you like Pink floyd, you might like Led Zepplin.


Pretty good at picking up the Smash Bros Jargon:

https://sense2vec.spacy.io/?dash_dance|NOUN



NSFW|ADJ does work, though. Maybe NSFW is a reserved word of some kind?


try RTFM, doesn't work too!!


Try ftfy


Submitting 'NSFW' (without quotes) will display mangled urls.


Not necessarily a bug on their end, those urls are actually from a bot, AutoWikibot. Which uses the word "NSFW" in every post, so all of it's formatting is highly correlated with it: https://www.reddit.com/user/AutoWikibot

The last line (Parent commenter can toggle NSFW or delete...) has this formatting:

    ^Parent ^commenter ^can [^toggle ^NSFW](/message/compose?to=autowikibot&subject=AutoWikibot NSFW toggle&message=%2Btoggle-nsfw+ct3omf8) ^or[](#or) [^delete](/message/compose?to=autowikibot&subject=AutoWikibot Deletion&message=%2Bdelete+ct3omf8)^. ^Will ^also ^delete ^on ^comment ^score ^of ^-1 ^or ^less. ^| [^(FAQs)](/r/autowikibot/wiki/index) ^| [^Mods](/r/autowikibot/comments/1x013o/for_moderators_switches_commands_and_css/) ^| [^Call ^Me](/r/autowikibot/comments/1ux484/ask_wikibot/)
You can see how that formatting might fuck up their nice natural language processor and tokenizer.


Very cool!

I've been working for a while on extracting "semantic" from naked text (mostly news). One of the big limits of word2vec is that the semantic should be related with words proximity in sentences, which works in some cases, but not always.

I'll give an example for all: travel/geography. If you query something like Italy [1], the results are other European states. But if you're looking for news in Italy, or to plan a vacation to Italy, or for some Italian food... or anything related to Italy itself, you probably don't expect "Spain" to be the first result.

It would be nice to have some sort of easy way in word2vec to define domains and their relationship with words proximity in sentences, to overcome situations like this one.

[1] https://sense2vec.spacy.io/?Italy%7CGPE


Don't think of Word2Vec (etc) as a search - it's a different thing.

It's more like a recommender: If I like Pizza in Napoli where should I go in Spain and what should I eat there?

Italy:Pizza -> Spain:?

Italy:Napoli -> Spain:?


Well, you can't do this in the web API obviously, but there are a few ways you could do this. This is a good idea really -- thanks for bringing this up.

One way would be to predefine the entity types or tags that you want to get in your results. So you could ask for things like Italy that are nouns.

The other way is to use the vector space. The classic demonstration of this is the arithmetic, doing like "Italy|GPE - *|GPE + food". My results for this have been very mixed. I wouldn't expect the query above to work.

I would think you'd have more luck specifying the query as a combination of constraints: first query for foods in some way, and then sort them by distance from Italy.


Tried something similar a while back, only using Sparks Word2Vec implementation, though just looking at individual organizations as used in different subreddits. It is surprising how far Word2Vec can take you in deriving word similarity.

Used POS tagging in a previous post, though not with Word2Vec since I wasn't sure if differences like duck verb vs. duck noun would improve the result because the placement of verb vs. noun would already be different. Though certainly an interesting approach, I'm wondering if going backwards might yield better results for the POS tagger as well, since verb vs noun would span disparate word clusters.

1) http://dbunker.github.io/2016/01/05/spark-word2vec-on-reddit...


Strange, I enter "Harry Kane" and it shows me various footballers (none of which are on the same team as him). But then I enter "Jamie Vardy" and there are no results whatsoever, even though he's been HUGE on /r/soccer in 2015.


Hmm. Interesting. I wonder whether we're missing some data. I didn't do so much to verify that. Thanks.






Wow, didn't know the term "dank" before. Although Urban Dictionary says it is a term used by stoners and hippies. ;)


Oh this is fun if you look up politically charged words like 'thug' or even something that should be pretty harmless like 'girls'.


I don't see anything surprising about those searches. E.g. "thug" gets "gangster" which seems pretty fitting. "girls" gets "female friends", which is semantically very close.


is it just me or does it not search when you hit enter?


The server logs say everything's working, so I hope not! Are you still having trouble?

Sometimes I get tricked when I enter a query that has the same top result as the one that's currently displayed. Then it looks like the results haven't changed, but further down the list, they have.


Same problem here, on the latest firefox & Archlinux. It searches properly when you hit search but it does not respond to hitting enter on the keyboard.

It also does not find anything for "duck sized horses" or "horse sized duck".


Nothing happens for me where I press enter either, I have to click on the search icon (firefox on linux).


Working on this. Thanks.

Edit: We have this fixed but we're reluctant to roll it out. Two days ago when we did an AWS deploy, they replaced healthy machines with unhealthy ones, and we had an hour of outage.

If you want to fix this on the client, I think the following quick fix works. At the bottom of the sense2vec script https://sense2vec.spacy.io/js/sense2vec.js , change:

    input.addEventListener('keydown', function() {
       if(event.keyCode == 13) run();
    });
to

    input.addEventListener('keydown', function(event) {
       if(event.keyCode == 13) run();
    });


Also noting that linking URLs with the pipe symbol (eg https://sense2vec.spacy.io/?Autechre|PERSON) will result in the "|PERSON" being dropped from the search box.


It's not just you, I need to click the search icon.


tried "netflix and chill", didn't find anything? That to me is the phrase of the year that is more than the sum of its parts.


The underlying linguistics of a "phrase" here are sort of narrow. What we did is retokenize the text so that entities and basic noun phrases are merged into a single token. "Netflix and chill" is analysed as multiple tokens, so it doesn't come up as a query result.


I tried "rick and morty" and got the same issue. My guess is it has something to do with the term "and."


This is really cool. The next logical thing is to extend it to allow multiple meanings for the same word with the same type. For example "lead is a heavy element" vs "I own a dog lead". Not sure how you'd do that without explicitly giving each unique word an ID and manually annotating the training data.


Great to see another use case of spacy...I've been wanting to give it a go ever since OP pitched it as an alternative to NLTK.


Looking at the blog post, I wonder why spaCy thinks Barack Obama is spelled Barrack Obama. It forces spellings to the more canonical word spelling, maybe, regardless of context? Edit: or maybe somebody slipped and inserted a stray keystroke when editing the blog post?


Interesting side effect: it has learned to give somewhat decent anime recommendations. At the very least it recognizes some basic genres to an extent (for example, TTGL is close to Code Geass and Kill la Kill, while Kokoro Connect is close to Clannad and Toradora).


Yup. Mention Cowboy Bebop and it goes to Samurai Champloo. I hated Champloo and loved Bebop, but they are obviously very much in the same style.


Try "Haruhi" and "Madoka". IMHO impressive.


It does characters pretty well too. For most popular series that I tried, with the weird exception of Gintama, searching any of the main characters would pull up the rest of the main cast as highly related.

It seems to have learned the relationship between characters and shows, though it always ranks the Monogatari characters as relatively close to any character search.


Searched for "Best Iron Maiden album" and it was smart enough not to pick one. Then I tried "Seinfeld" and it returned Seinfeld as the 10th result behind some other sitcoms, including IT Crowd (that was an oddball!) :)


Interesting, the result seems to be for Jerry Seinfeld. If I try Seinfeld|WORK_OF_ART, I get a bunch of sitcoms but not Seinfeld.


Weirdly, their dataset doesn't include the word "thirteen" or "13". and it's not a matter of excluding numbers, they have "twelve", "12", and "14" and "fourteen".


Cool. It seems to distinguish words with and without trailing period for some reason ("MySQL" and "MySQL."), which seems a bit too specific. Perhaps a good reason to, though.


The reason isn't good --- it's just a tokenization problem.


Any chance you'll also have a Markov chain generator? Bonus points if you can pick your subreddit, give it a starting topic, and let 'er rip.


I think that's exactly what this does: https://www.reddit.com/r/SubredditSimulator/


I think the server just returns nothing when it is overloaded.

I searched "dank memes" and got "steel beems". Other searches all failed.



This is awesome! I would pay for an API.


FOUND NOTHING for 'dogs f--king' where the dashes need to substituted for the obvious letters


just a random hilarious oddity- i wonder why it associates ricky rubio with pokemon? https://sense2vec.spacy.io/?Ricky_Rubio%7CPERSON

edit: and now it doesn't?


It crashes on "/r/AsianFetish" phrase with 500 error :-)


Searched for something NSFW and didn't get any results :(


Looks like the parser got confused by the NSFW link tag.


Doesn't work as advertised for [Obama]



I get: https://sense2vec.spacy.io/?Obama%7CORG

First result is GOP


It is showing things which are used in English in the same way as "Obama" when the word "Obama" is used as an organization.

Clearly "Obama" isn't an organization, but it does a decent job. Consider the sentence "Obama’s unsolicited advice that Congress should “go ahead and vote” has only hardened the GOP majority’s resistance."


Try to search for Obama|PERSON




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: