
Demystifying Word2Vec - buss_jan
http://www.deeplearningweekly.com/blog/demystifying-word2vec
======
sdenton4
I'm reading the New Michael Lewis book right now, on kahneman and tversky, and
there's a point where the notion of encoding similarity as distances is first
very popular, but then torn apart in the sixties.

Essentially, it comes down to statements like 'this woman is like Queen
Elizabeth' being different from 'Queen Elizabeth is like this woman.' human
perception of similarity is assymetric. Tversky's suggestion is that we
essentially think if things as a collection of tags, and A is like B if A's
tags are mostly a subset of B's tags. But if B has more tags (because it's a
specific thing that we know a lot about, for example) then B's tags won't
mostly be a subset of A's tags, creating the asymmetry.

This directly attacks the kind of similarity that word2vec relies on, and I'm
wondering if there are critiques along these lines on the literature.

~~~
gojomo
I've not noticed this as a specific critique of Word2Vec models, but despite
their near-magic for some purposes it's well-understood that they don't
capture all interesting shades of human-perceived meaning – at least not yet.

There are some interesting extensions of the basic model which include echoes
of this 'tags' idea.

First, some projects have also modeled each word's vector as a sum-of-
subvectors, based on the word's substrings-of-various-lengths. (Notably this
is a feature of Facebook's FastText library.) This also learns useful vectors
for subword features (like linguistic morphemes/roots), which then lets models
often bootstrap useful vectors for new words, not included in the training
corpus, based on their similarity with known words. (While those are strictly
composing 'tags' of orthographic representation, it would be natural to seek a
way to decompose words into varying-sized collections of more primal
concepts/denotations/connotations.)

Second, modeling longer texts (like sentences) as collections-of-word-vectors
gives some interesting ways to calculate similarities between texts. A first,
simple way is to treat a text as some (perhaps significance-weighted) average
of all its words, thus getting a single vector, in the word-space, for the
text. Another is to treat the text as having an invisible floating pseudo-word
that tries to help in the same word-prediction task that's used to create
word-vectors. When that pseudo-word is trained-up, as if it were a rare word,
treat it as a vector for the text. (That's Mikolov/Le's 'Paragraph Vector',
often referred-to as 'Doc2Vec'.)

But following a practice useful from image-color-histogram comparisons, you
can also treat the constituent word-vectors of text as if they were 'piles of
meaning' at their individual vector coordinates, and calculate the distance
between texts as the sum-of-distances/efforts to move one sentence's pile
configuration to another. With text this has been called "Word Mover's
Distance" but is analogous to "Earth Mover's Distance"/"Wasserstein Distance"
in statistics/probability.

As this has been applied to texts – multi-word collections – I could easily
see it also applied to decomposed words/concepts - multi-tag collections, per
your summary of Tversky's model above. At some point, I'd also expect these
models to account for words/sentences as large 'smears' in space, rather than
single points, to better capture the full range of _possible_ meanings, and
then _likely intended_ meanings, based on other relevant context.

------
stared
And if you want an interactive vis, there is one:
[https://lamyiowce.github.io/word2viz/](https://lamyiowce.github.io/word2viz/)
(and some discussion:
[https://news.ycombinator.com/item?id=13346104](https://news.ycombinator.com/item?id=13346104)).

~~~
buss_jan
This is great! Thanks.

------
NKCSS
Isn't this all based on LSA (Latent Semantic Analysis; e.g. first major paper
was by by Landauer back in 1990, and going back to the 1960's)?

I implemented LSA a few years ago based on these papers and it just seems like
LSA/LSI to me..

[Update]

Just found this:

> In this sense we have come full circle to the methods presented earlier that
> rely on matrix factorization (such as LSA). Where LSA uses SVD to find the
> best fitting subspace in terms of the co-variances of words across
> documents, the Glove model explicitly optimizes wordvectors to reflect the
> likelihood of their co-occurrence. The point, however, stands that Word2Vec
> is not an entirely novel approach to arrive at word vectors that capture
> semantic meaning in their substructure.

~~~
twelfthnight
Just for some extra clarification:

With LSA, each document is transformed into a single vector that has the
length of the vocabulary. The length of the vocabulary is the number of unique
words across all documents. If a word is present in a document, it is
represented as a 1 in the vector and 0 if it is not. So after this
transformation, the text is transformed in an D by V matrix where D is the
number of documents and V is the vocabulary size.

GloVe (and word2vec by equivalence) however, work on a different matrix. In
this algorithm, the matrix is V by V, where V is the vocabulary size and each
cell counts the number of the times a word appears next to another word in a
document. That is, the matrix represents the counts of how often words are
neighbors across all documents. (There are some other technicalities, by this
the thrust of GloVe).

The similarity between LSA and GloVe is that once the respective matrix is
built, it is followed by matrix factorization. By matrix factorization, I mean
SVD. This is a process which takes an N by M matrix and returns an N by K
matrix, essentially squeezing the matrix into fewer columns (K is a free
parameter that is smaller than M).

So ultimately, what we get with LSA is a D by K matrix (where K is just an
arbitrary number, like 100). The interpretation of this matrix is that each
row represents a document and each column represents a topic. In GloVe, we get
a V by K matrix where each row represents a word and each column represents an
embedding.

~~~
eginhard
From LSA/SVD you get a V x K matrix as well - that's exactly what the
factorisation is doing.

The following two papers also go into detail about the mathematical
similarities between LSA and neural embeddings and achieving similar
performance with both:

Levy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix
factorization.
[https://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf](https://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf)

Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional
similarity with lessons learned from word embeddings.
[http://www.anthology.aclweb.org/Q/Q15/Q15-1016.pdf](http://www.anthology.aclweb.org/Q/Q15/Q15-1016.pdf)

~~~
twelfthnight
True, you could use the right matrix of the SVD on the term document matrix.
As far as I know, the embeddings won't have the same interpretation as the
left matrix from SVD on the word-context matrix. (By LSA I mean SVD on the
term document matrix).

Those are excellent papers, by the way.

------
ponderingHplus
A good read, thanks! This technique largely inspired a project we did for
school this year, a subreddit recommender system with an RNN learning an
embedding space for subreddits. I've just finished up exams and am starting
work on getting a minimal webapp up for people to play with, but links to the
final report and an interactive bokeh plot of the final embedding can be found
here: [http://cole-maclean.github.io/blog/RNN-Based-Subreddit-
Recom...](http://cole-maclean.github.io/blog/RNN-Based-Subreddit-Recommender-
System/)

------
vonnik
This may be helpful, too:
[https://deeplearning4j.org/word2vec](https://deeplearning4j.org/word2vec)

