Hacker News new | comments | show | ask | jobs | submit login
word2vec – Tool for computing continuous distributed representations of words (code.google.com)
92 points by linux_devil 1442 days ago | hide | past | web | 25 comments | favorite

I just gave an invited talk at KDD about deep learning in which I covered this algorithm, so it's great to see this code appear now.

For anyone interested in text analysis: PLEASE study and use this code and the referenced papers. It's importance is hard to overstate. It is far, far better than all previous approaches to word analysis. These representations are the dimensional compression that occurs in the middle of a deep neural net. The resulting vectors encode rich information about the semantics and usage pattern of each word in a very concise way.

We have barely scratched the surface of the applications of these distributed representations. This is a great time to get started in this field - previous techniques are almost totally obsoleted by this so everyone is starting from the same point.

I have previously used Explicit Semantic Analysis (ESA) algorithm for individual word similarity calculations. ESA uses as a basis the text of Wikipedia entries and its ontology as a source and worked quite OK.

Do you / does anyone know if there is an easy way to use word2vec to compare similarities of two different documents (think of TF-IDF & cosine similarity)? It is stated on the page that "The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences [2]", but the referenced paper has not yet been published.

It would be super interesting if there was a simple way to compare the similarities of two documents using something like this.

Could anybody explain (or provide a pointer to an explanation of) the details of how the individual words are mapped to vectors? The source is available, but optimized such that the underlying how's are a bit opaque, and the underlying whys even more so.

You can think of this as a square matrix W. The size of the matrix is the size of the vocabulary. If we look at the 100k most frequent words in our corpus, W will be a 100k x 100k matrix.

The value of W(i,j) is the distance between words i and j, and a row of the matrix is the vector representation of that word. Research around word vectors is all about computing W(i,j) in an efficient way that is also useful in natural language processing applications.

Word vectors are often used to compute similarity between words: since words are represented as vectors, we can compute the cosine angle between a given pair of words to find out how similar the two words are.

Does that mean there actually is an answer for "What do you get when you cross a mosquito with a mountaineer?"

TL;DR: The answer to your query is a person named Chaudhry Sitwell Borisovich who is definitely an entomologist-hymnist and probably is also a mineralogist-ornithologist.

A google search suggests that he was born in 1961.

I ran a few queries using the code and its default dataset, trying to use neutral words for substraction: "mosquito -small +mountaineer", "mosquito -big +mountaineer", "mosquito -loud +mountaineer", "mosquito -normal +mountaineer", "mosquito -usual +mountaineer", "mosquito -air +mountaineer", "mosquito -nothing +mountaineer".

The most frequent words for these queries are:

6 times: "borisovich" "chaudhry" "entomologist" "hymnist" "sitwell"

5 times: "mineralogist" "ornithologist"

Well done, sir.

Well you can dot them if not cross them (cross needs 3 dimensions iirc)!

You inadvertently stumbled onto the punchline of the joke - "You can't cross them because a mountaineer is a scalar." (scaler) - works better when spoken.

Don't forget the part about the mosquito being a vector :)

Yeah, the papers linked in the references are probably a better place to start than the readme (though I'm not sure how closely aligned this implementation is with that research, but the paper is still a good read), especially [1]

[1] http://arxiv.org/pdf/1301.3781.pdf

I wrote a simple library[1] in Ruby for measuring the similarity between documents using word vectors. It has none of the cleverness of this one, but is much simpler, if that helps?

[1] https://github.com/bbcrd/Similarity

Just as a hint, you'll get better results if you apply some dimensional reduction to this. LDA is an old standby, but I like what these people are doing...


This already does dimensional reduction using neural networks (for details see papers cited at the bottom of the page).

If you've ever complained on HN about NSA technology ruining our society, then time to take this tool and save society. This is powerful and cool technology. It could expose corruption. It could be a bullshit detector. The same types of tools that have stripped our liberties can be the same tools to re-balance democracy.

But please write free and open source software. Otherwise no one can ever trust the software or consider it secure and safe.

This is great! It would be amazing to have a list of other tools/libraries like this.

One project I'm working on I was looking to compare the word/topic frequencies of a document/text against common usage and I'm sure there is something out there like this that does this already as opposed to me doing it (poorly) from scratch.

"... and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh..."

I downloaded the code but I dont understand how to get queen as a result. I tried submitting king - man + woman, but the bin dont understand it.

Use just "man king woman"

I'm working in a experimental library to do a similar thing, but using echo state networks (https://github.com/neuromancer/libmind). It will be nice to compare both approaches with a SVM to classify words.

Isn't it similar to what LSA(http://en.wikipedia.org/wiki/Latent_semantic_analysis) does?

From the linked paper:

Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words

It is similar, but at least according to what Mikolov wrote as a response to reviewer comments regarding LDA/LSA/tf-idf [1], LDA does not preserve linguistic regularities such as king - man + woman ~ queen. I asked for additional clarification, but so far I haven't received a reply.

A good intuition as to why these kinds of regularities could even exist was given by Chris Quirk as a blog comment [2]. Essentially, imagine that each word is approximately represented by the contexts it appears in, if so, swapping in and out the contexts of other words could indeed preserve some linguistic regularities.

[1]: http://openreview.net/document/7b076554-87ba-4e1e-b7cc-2ac10...

[2]: http://www.blogger.com/comment.g?blogID=19803222&postID=5373...

Yeah, the bag of words and K-Rank reduction stuff basically is LSA.

I've written an LSA implementation a few years ago for a BI product ( written about http://www.innoveerpunt.nl/interactief-innoveren/innoveerpun... sorry that it's Dutch :) ).

I wonder how well it works; my takeaway was that you need to tweak the internal thresholds and matrix sizes a lot to get the optimal results, which in turn is highly dependant on the datasets you use (which is also made very clear in every LSA paper you'll read).

Latent Relational Analysis is more in the spirit of LSA, see http://research.microsoft.com/apps/video/dl.aspx?id=104771

Our technology gives a much better identification of noun senses (http://springsense.com/api/)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact