
word2vec – Tool for computing continuous distributed representations of words - linux_devil
https://code.google.com/p/word2vec/
======
jph00
I just gave an invited talk at KDD about deep learning in which I covered this
algorithm, so it's great to see this code appear now.

For anyone interested in text analysis: PLEASE study and use this code and the
referenced papers. It's importance is hard to overstate. It is far, far better
than all previous approaches to word analysis. These representations are the
dimensional compression that occurs in the middle of a deep neural net. The
resulting vectors encode rich information about the semantics and usage
pattern of each word in a very concise way.

We have barely scratched the surface of the applications of these distributed
representations. This is a great time to get started in this field - previous
techniques are almost totally obsoleted by this so everyone is starting from
the same point.

~~~
mlla
I have previously used Explicit Semantic Analysis (ESA) algorithm for
individual word similarity calculations. ESA uses as a basis the text of
Wikipedia entries and its ontology as a source and worked quite OK.

Do you / does anyone know if there is an easy way to use word2vec to compare
similarities of two different documents (think of TF-IDF & cosine similarity)?
It is stated on the page that "The linearity of the vector operations seems to
weakly hold also for the addition of several vectors, so it is possible to add
several word or phrase vectors to form representation of short sentences [2]",
but the referenced paper has not yet been published.

It would be super interesting if there was a simple way to compare the
similarities of two documents using something like this.

------
_m19m
Could anybody explain (or provide a pointer to an explanation of) the details
of how the individual words are mapped to vectors? The source is available,
but optimized such that the underlying how's are a bit opaque, and the
underlying whys even more so.

~~~
quentinp
You can think of this as a square matrix W. The size of the matrix is the size
of the vocabulary. If we look at the 100k most frequent words in our corpus, W
will be a 100k x 100k matrix.

The value of W(i,j) is the distance between words i and j, and a row of the
matrix is the vector representation of that word. Research around word vectors
is all about computing W(i,j) in an efficient way that is also useful in
natural language processing applications.

Word vectors are often used to compute similarity between words: since words
are represented as vectors, we can compute the cosine angle between a given
pair of words to find out how similar the two words are.

~~~
rcfox
Does that mean there actually is an answer for "What do you get when you cross
a mosquito with a mountaineer?"

~~~
SergeyHack
TL;DR: The answer to your query is a person named Chaudhry Sitwell Borisovich
who is definitely an entomologist-hymnist and probably is also a mineralogist-
ornithologist.

A google search suggests that he was born in 1961.

I ran a few queries using the code and its default dataset, trying to use
neutral words for substraction: "mosquito -small +mountaineer", "mosquito -big
+mountaineer", "mosquito -loud +mountaineer", "mosquito -normal +mountaineer",
"mosquito -usual +mountaineer", "mosquito -air +mountaineer", "mosquito
-nothing +mountaineer".

The most frequent words for these queries are:

6 times: "borisovich" "chaudhry" "entomologist" "hymnist" "sitwell"

5 times: "mineralogist" "ornithologist"

~~~
rcfox
Well done, sir.

------
PaulHoule
Just as a hint, you'll get better results if you apply some dimensional
reduction to this. LDA is an old standby, but I like what these people are
doing...

[http://www.cept.at/](http://www.cept.at/)

~~~
simonb
This already does dimensional reduction using neural networks (for details see
papers cited at the bottom of the page).

------
logn
If you've ever complained on HN about NSA technology ruining our society, then
time to take this tool and save society. This is powerful and cool technology.
It could expose corruption. It could be a bullshit detector. The same types of
tools that have stripped our liberties can be the same tools to re-balance
democracy.

But please write free and open source software. Otherwise no one can ever
trust the software or consider it secure and safe.

------
AliAdams
This is great! It would be amazing to have a list of other tools/libraries
like this.

One project I'm working on I was looking to compare the word/topic frequencies
of a document/text against common usage and I'm sure there is something out
there like this that does this already as opposed to me doing it (poorly) from
scratch.

------
calufa
"... and vector('king') - vector('man') + vector('woman') is close to
vector('queen') [3, 1]. You can try out a simple demo by running demo-
analogy.sh..."

I downloaded the code but I dont understand how to get queen as a result. I
tried submitting king - man + woman, but the bin dont understand it.

~~~
BitMastro
Use just "man king woman"

------
galapago
I'm working in a experimental library to do a similar thing, but using echo
state networks
([https://github.com/neuromancer/libmind](https://github.com/neuromancer/libmind)).
It will be nice to compare both approaches with a SVM to classify words.

------
sumit_psp
Isn't it similar to what
LSA([http://en.wikipedia.org/wiki/Latent_semantic_analysis](http://en.wikipedia.org/wiki/Latent_semantic_analysis))
does?

~~~
danieldk
From the linked paper:

 _Many different types of models were proposed for estimating continuous
representations of words, including the well-known Latent Semantic Analysis
(LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on
distributed representations of words learned by neural networks, as it was
previously shown that they perform significantly better than LSA for
preserving linear regularities among words_

------
rotbart
Our technology gives a much better identification of noun senses
([http://springsense.com/api/](http://springsense.com/api/))

