Hacker News new | past | comments | ask | show | jobs | submit login
Demystifying Word2Vec (deeplearningweekly.com)
129 points by buss_jan on Feb 7, 2017 | hide | past | favorite | 14 comments

I'm reading the New Michael Lewis book right now, on kahneman and tversky, and there's a point where the notion of encoding similarity as distances is first very popular, but then torn apart in the sixties.

Essentially, it comes down to statements like 'this woman is like Queen Elizabeth' being different from 'Queen Elizabeth is like this woman.' human perception of similarity is assymetric. Tversky's suggestion is that we essentially think if things as a collection of tags, and A is like B if A's tags are mostly a subset of B's tags. But if B has more tags (because it's a specific thing that we know a lot about, for example) then B's tags won't mostly be a subset of A's tags, creating the asymmetry.

This directly attacks the kind of similarity that word2vec relies on, and I'm wondering if there are critiques along these lines on the literature.

I've not noticed this as a specific critique of Word2Vec models, but despite their near-magic for some purposes it's well-understood that they don't capture all interesting shades of human-perceived meaning – at least not yet.

There are some interesting extensions of the basic model which include echoes of this 'tags' idea.

First, some projects have also modeled each word's vector as a sum-of-subvectors, based on the word's substrings-of-various-lengths. (Notably this is a feature of Facebook's FastText library.) This also learns useful vectors for subword features (like linguistic morphemes/roots), which then lets models often bootstrap useful vectors for new words, not included in the training corpus, based on their similarity with known words. (While those are strictly composing 'tags' of orthographic representation, it would be natural to seek a way to decompose words into varying-sized collections of more primal concepts/denotations/connotations.)

Second, modeling longer texts (like sentences) as collections-of-word-vectors gives some interesting ways to calculate similarities between texts. A first, simple way is to treat a text as some (perhaps significance-weighted) average of all its words, thus getting a single vector, in the word-space, for the text. Another is to treat the text as having an invisible floating pseudo-word that tries to help in the same word-prediction task that's used to create word-vectors. When that pseudo-word is trained-up, as if it were a rare word, treat it as a vector for the text. (That's Mikolov/Le's 'Paragraph Vector', often referred-to as 'Doc2Vec'.)

But following a practice useful from image-color-histogram comparisons, you can also treat the constituent word-vectors of text as if they were 'piles of meaning' at their individual vector coordinates, and calculate the distance between texts as the sum-of-distances/efforts to move one sentence's pile configuration to another. With text this has been called "Word Mover's Distance" but is analogous to "Earth Mover's Distance"/"Wasserstein Distance" in statistics/probability.

As this has been applied to texts – multi-word collections – I could easily see it also applied to decomposed words/concepts - multi-tag collections, per your summary of Tversky's model above. At some point, I'd also expect these models to account for words/sentences as large 'smears' in space, rather than single points, to better capture the full range of possible meanings, and then likely intended meanings, based on other relevant context.

word2vec is a naive technique, which relies only on the colocation of words. Yes, it misses many things - word order, grammar (e.g. negations), words with many distinct meanings (lead (Pb) vs (to) lead), synonyms vs antonyms etc.

What's interesting is that it works at all. My blog post on that: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html

And if you want an interactive vis, there is one: https://lamyiowce.github.io/word2viz/ (and some discussion: https://news.ycombinator.com/item?id=13346104).

This is great! Thanks.

Isn't this all based on LSA (Latent Semantic Analysis; e.g. first major paper was by by Landauer back in 1990, and going back to the 1960's)?

I implemented LSA a few years ago based on these papers and it just seems like LSA/LSI to me..


Just found this:

> In this sense we have come full circle to the methods presented earlier that rely on matrix factorization (such as LSA). Where LSA uses SVD to find the best fitting subspace in terms of the co-variances of words across documents, the Glove model explicitly optimizes wordvectors to reflect the likelihood of their co-occurrence. The point, however, stands that Word2Vec is not an entirely novel approach to arrive at word vectors that capture semantic meaning in their substructure.

Just for some extra clarification:

With LSA, each document is transformed into a single vector that has the length of the vocabulary. The length of the vocabulary is the number of unique words across all documents. If a word is present in a document, it is represented as a 1 in the vector and 0 if it is not. So after this transformation, the text is transformed in an D by V matrix where D is the number of documents and V is the vocabulary size.

GloVe (and word2vec by equivalence) however, work on a different matrix. In this algorithm, the matrix is V by V, where V is the vocabulary size and each cell counts the number of the times a word appears next to another word in a document. That is, the matrix represents the counts of how often words are neighbors across all documents. (There are some other technicalities, by this the thrust of GloVe).

The similarity between LSA and GloVe is that once the respective matrix is built, it is followed by matrix factorization. By matrix factorization, I mean SVD. This is a process which takes an N by M matrix and returns an N by K matrix, essentially squeezing the matrix into fewer columns (K is a free parameter that is smaller than M).

So ultimately, what we get with LSA is a D by K matrix (where K is just an arbitrary number, like 100). The interpretation of this matrix is that each row represents a document and each column represents a topic. In GloVe, we get a V by K matrix where each row represents a word and each column represents an embedding.

From LSA/SVD you get a V x K matrix as well - that's exactly what the factorisation is doing.

The following two papers also go into detail about the mathematical similarities between LSA and neural embeddings and achieving similar performance with both:

Levy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. https://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf

Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. http://www.anthology.aclweb.org/Q/Q15/Q15-1016.pdf

True, you could use the right matrix of the SVD on the term document matrix. As far as I know, the embeddings won't have the same interpretation as the left matrix from SVD on the word-context matrix. (By LSA I mean SVD on the term document matrix).

Those are excellent papers, by the way.

As someone just now getting into the basics of NLP, thank you for this brief summary.

Good summary, thanks!


A good read, thanks! This technique largely inspired a project we did for school this year, a subreddit recommender system with an RNN learning an embedding space for subreddits. I've just finished up exams and am starting work on getting a minimal webapp up for people to play with, but links to the final report and an interactive bokeh plot of the final embedding can be found here: http://cole-maclean.github.io/blog/RNN-Based-Subreddit-Recom...

This may be helpful, too: https://deeplearning4j.org/word2vec

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact