As someone who's tried to quantify the Reddit hivemind by analyzing word usage (http://minimaxir.com/2015/10/reddit-topwords/), I think the spaCy semantic approach is a much more robust approach to the Reddit data corpus.
However, I'm not sure if I agree with the use of phrase similarity as a indicator of Reddit hivemind behavior, which was discussed in the complementary blog post. The tool is more of an indicator of the writing styles of Reddit's primary demographics (Male, 18-30) and phrases which are coincident, instead of weighting the importance of given phrases to Reddit discussion.
Word vectors are really intriguing to me, and I have a few questions:
If you also trained a model on a different news aggregator's comments, would it be possible to "match up" the meaning vector spaces and see differences in what meanings each community ascribes to words?
Additionally, could you determine sentiments from the positions of the words in the meaning space?
One example, looking for underlying assumptions, like associations of the words 'black person' and 'criminal'. Or idk, man + sex = player, but woman + sex = slut.
Would it be possible to go higher level and see how much a corpus agrees with "free markets are good" based on its word positions?
It seems like word2vec has the potential to bring Sapir-Whorf to a whole new level.
Let's say we want to know how usage differs in one subreddit vs another. If you just train two entirely separate models, you end up with two vector spaces. The meanings within each space are entirely relative --- the absolute positions obviously aren't significant. You can try to learn a mapping, but the transform is not necessarily linear. (Interesting empirical question there...)
What you need is a 'seam' that connects the two vector spaces. I would do it like this.
Train a single model, with the words decorated by their subreddit. So you have combat:/r/gaming and combat:/r/history as different tokens. Then you have shared tokens which aren't decorated in this way.
This is supported in the underlying data, but we don't expose it in the UI at the moment. We'll probably do something a bit different to help people make those queries.
Would it be possible for you to make the trained vectors available for download and independent analysis without having to rescrape and retrain the model?
This downloads and installs the sense2vec package and model we used in the demo.
We are currently working on finishing the PyPI package. Meanwhile only Linux is supported and the docs are pretty much non-existing. Also, please make sure you have a recent Blas/Atlas package installed (RedHat: atlas, atlas-devel)
This is very cool. I have played with the word2vec download a lot and I have been sorta amazed. If you had a lot of money... like google... it would be cool to train word vectors on the whole internet.
I don't think wordnet (and similar) will ever be obsolete. From my point of view, they are a "curated" subset of the semantic that you can obtain with word2vec. The latter can fail, also on very trivial cases (I discussed geography in another comment), and curated+structured knowledge bases such as wordnet allow to overcome the limitations of any machine learning algorithm.
But there is still room for improvement. Searching "Haskell" leads to Clojure and C++. Yes, these are both Programming Languages, but out of all I personally wouldn't have said C++. :)
"Scheme" leads to Haskell, witch is very fitting, and Prolog, witch seems to fit as a "university language" aswell.
For "Agda" it outputs "typeclasses" at 74%. This is a much discussed topic. But for a "truly" semantic understand It should know that both "Agda" and "Haskell" are in the same category and that "typeclasses" is a property that elments in this category have or don't have.
Still, very impressive. But not the singularity jet.
Not necessarily a bug on their end, those urls are actually from a bot, AutoWikibot. Which uses the word "NSFW" in every post, so all of it's formatting is highly correlated with it: https://www.reddit.com/user/AutoWikibot
The last line (Parent commenter can toggle NSFW or delete...) has this formatting:
I've been working for a while on extracting "semantic" from naked text (mostly news). One of the big limits of word2vec is that the semantic should be related with words proximity in sentences, which works in some cases, but not always.
I'll give an example for all: travel/geography. If you query something like Italy [1], the results are other European states. But if you're looking for news in Italy, or to plan a vacation to Italy, or for some Italian food... or anything related to Italy itself, you probably don't expect "Spain" to be the first result.
It would be nice to have some sort of easy way in word2vec to define domains and their relationship with words proximity in sentences, to overcome situations like this one.
Well, you can't do this in the web API obviously, but there are a few ways you could do this. This is a good idea really -- thanks for bringing this up.
One way would be to predefine the entity types or tags that you want to get in your results. So you could ask for things like Italy that are nouns.
The other way is to use the vector space. The classic demonstration of this is the arithmetic, doing like "Italy|GPE - *|GPE + food". My results for this have been very mixed. I wouldn't expect the query above to work.
I would think you'd have more luck specifying the query as a combination of constraints: first query for foods in some way, and then sort them by distance from Italy.
Tried something similar a while back, only using Sparks Word2Vec implementation, though just looking at individual organizations as used in different subreddits. It is surprising how far Word2Vec can take you in deriving word similarity.
Used POS tagging in a previous post, though not with Word2Vec since I wasn't sure if differences like duck verb vs. duck noun would improve the result because the placement of verb vs. noun would already be different. Though certainly an interesting approach, I'm wondering if going backwards might yield better results for the POS tagger as well, since verb vs noun would span disparate word clusters.
Strange, I enter "Harry Kane" and it shows me various footballers (none of which are on the same team as him). But then I enter "Jamie Vardy" and there are no results whatsoever, even though he's been HUGE on /r/soccer in 2015.
I don't see anything surprising about those searches. E.g. "thug" gets "gangster" which seems pretty fitting. "girls" gets "female friends", which is semantically very close.
The server logs say everything's working, so I hope not! Are you still having trouble?
Sometimes I get tricked when I enter a query that has the same top result as the one that's currently displayed. Then it looks like the results haven't changed, but further down the list, they have.
Same problem here, on the latest firefox & Archlinux. It searches properly when you hit search but it does not respond to hitting enter on the keyboard.
It also does not find anything for "duck sized horses" or "horse sized duck".
Edit: We have this fixed but we're reluctant to roll it out. Two days ago when we did an AWS deploy, they replaced healthy machines with unhealthy ones, and we had an hour of outage.
If you want to fix this on the client, I think the following quick fix works. At the bottom of the sense2vec script https://sense2vec.spacy.io/js/sense2vec.js , change:
The underlying linguistics of a "phrase" here are sort of narrow. What we did is retokenize the text so that entities and basic noun phrases are merged into a single token. "Netflix and chill" is analysed as multiple tokens, so it doesn't come up as a query result.
This is really cool. The next logical thing is to extend it to allow multiple meanings for the same word with the same type. For example "lead is a heavy element" vs "I own a dog lead". Not sure how you'd do that without explicitly giving each unique word an ID and manually annotating the training data.
Looking at the blog post, I wonder why spaCy thinks Barack Obama is spelled Barrack Obama. It forces spellings to the more canonical word spelling, maybe, regardless of context? Edit: or maybe somebody slipped and inserted a stray keystroke when editing the blog post?
Interesting side effect: it has learned to give somewhat decent anime recommendations. At the very least it recognizes some basic genres to an extent (for example, TTGL is close to Code Geass and Kill la Kill, while Kokoro Connect is close to Clannad and Toradora).
It does characters pretty well too. For most popular series that I tried, with the weird exception of Gintama, searching any of the main characters would pull up the rest of the main cast as highly related.
It seems to have learned the relationship between characters and shows, though it always ranks the Monogatari characters as relatively close to any character search.
Searched for "Best Iron Maiden album" and it was smart enough not to pick one. Then I tried "Seinfeld" and it returned Seinfeld as the 10th result behind some other sitcoms, including IT Crowd (that was an oddball!) :)
Weirdly, their dataset doesn't include the word "thirteen" or "13". and it's not a matter of excluding numbers, they have "twelve", "12", and "14" and "fourteen".
Cool. It seems to distinguish words with and without trailing period for some reason ("MySQL" and "MySQL."), which seems a bit too specific. Perhaps a good reason to, though.
It is showing things which are used in English in the same way as "Obama" when the word "Obama" is used as an organization.
Clearly "Obama" isn't an organization, but it does a decent job. Consider the sentence "Obama’s unsolicited advice that Congress should “go ahead and vote” has only hardened the GOP majority’s resistance."