For anyone interested in text analysis: PLEASE study and use this code and the referenced papers. It's importance is hard to overstate. It is far, far better than all previous approaches to word analysis. These representations are the dimensional compression that occurs in the middle of a deep neural net. The resulting vectors encode rich information about the semantics and usage pattern of each word in a very concise way.
We have barely scratched the surface of the applications of these distributed representations. This is a great time to get started in this field - previous techniques are almost totally obsoleted by this so everyone is starting from the same point.
Do you / does anyone know if there is an easy way to use word2vec to compare similarities of two different documents (think of TF-IDF & cosine similarity)? It is stated on the page that "The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences ", but the referenced paper has not yet been published.
It would be super interesting if there was a simple way to compare the similarities of two documents using something like this.
The value of W(i,j) is the distance between words i and j, and a row of the matrix is the vector representation of that word. Research around word vectors is all about computing W(i,j) in an efficient way that is also useful in natural language processing applications.
Word vectors are often used to compute similarity between words: since words are represented as vectors, we can compute the cosine angle between a given pair of words to find out how similar the two words are.
A google search suggests that he was born in 1961.
I ran a few queries using the code and its default dataset, trying to use neutral words for substraction: "mosquito -small +mountaineer", "mosquito -big +mountaineer", "mosquito -loud +mountaineer", "mosquito -normal +mountaineer", "mosquito -usual +mountaineer", "mosquito -air +mountaineer", "mosquito -nothing +mountaineer".
The most frequent words for these queries are:
But please write free and open source software. Otherwise no one can ever trust the software or consider it secure and safe.
One project I'm working on I was looking to compare the word/topic frequencies of a document/text against common usage and I'm sure there is something out there like this that does this already as opposed to me doing it (poorly) from scratch.
I downloaded the code but I dont understand how to get queen as a result. I tried submitting king - man + woman, but the bin dont understand it.
Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words
A good intuition as to why these kinds of regularities could even exist was given by Chris Quirk as a blog comment . Essentially, imagine that each word is approximately represented by the contexts it appears in, if so, swapping in and out the contexts of other words could indeed preserve some linguistic regularities.
I've written an LSA implementation a few years ago for a BI product ( written about http://www.innoveerpunt.nl/interactief-innoveren/innoveerpun... sorry that it's Dutch :) ).
I wonder how well it works; my takeaway was that you need to tweak the internal thresholds and matrix sizes a lot to get the optimal results, which in turn is highly dependant on the datasets you use (which is also made very clear in every LSA paper you'll read).