
The Illustrated Word2vec - jalammar
https://jalammar.github.io/illustrated-word2vec/
======
danieldk
_There are clear places where “king” and “queen” are similar to each other and
distinct from all the others. Could these be coding for a vague concept of
royalty?_

This is a common misunderstanding and unfortunately strengthened by the
example of 'personality embeddings'. It is easy to understand intuitively why
this is normally not the case. If you rotate a vector/embedding space, all the
cosine similarities between words are preserved. Suppose that component 20
encoded 'royalty', there is an infinite number of rotations of the vector
space where 'royalty' is distributed among many dimensions. Consider e.g. the
personality vector openness-extroversion as values between 0-1. Now we have
two persons:

[0, 1] [1, 0]

These vectors are orthogonal, so the cosine similarity is 0. Now let's rotate
the vector space by 45 degrees:

[-0.707, 0.707] [0.707, 0.707]

The cosine similarity between the two vectors is still 0. Clearly, the direct
mapping of personality traits to vector components is lost (as can be seen in
the second vector).

Obviously, this is something that you generally do not want to do for
personality vectors. However, there is nothing in the word2vec objectives that
would prefer a vector space with meaningful dimensions. E.g. take the skip-
gram model, which maximizes the log-likelihood of the probability of a context
word f, cooccuring with a context word c. Shortened: p(1|w_f,w_c) =
𝜎(w_f·w_c). So, the objective in vanilla word2vec prefers vector spaces that
maximize the inner product of words that co-occur and minimize the inner
product of words that do not co-occur. Consequently, if we have an optimal
parametrization of W and C (the word and context matrices), any rotation of
the vector space is also an optimal solution. Which rotation you actually get
is dependent on accidental factors, such as the initial (typically randomized)
parameters.

Of course, it is possible to rotate the vector space such that dimensions
become meaningful (see e.g. [1]), but with word2vec's default objective
meaningful dimensions are purely accidental, and the meaning of vectors is
defined in their relation to other vectors / their neighborhood.

[1]
[https://www.aclweb.org/anthology/D17-1041](https://www.aclweb.org/anthology/D17-1041)

~~~
make3
No one is saying that the "royalty" direction should be in the same angle as
an axis, or that it should be in the same direction every time you train
word2vec of course. It doesn't mean that that direction doesn't exist, and
that word2vec doesn't code for such a Royalty direction (or region)

~~~
danieldk
Well, obviously, all royalty are going to have similar vectors. The skipgram
is just an implicit matrix vectorization of a shifted PMI matrix. And most
royalty will have similar co-occurrences. My point is that the vector
components do not mean anything in isolation. There is no dimension directly
encoding such properties. The king vector means 'royalty' because queen,
prince, princess, etc. are have similar directions.

------
jalammar
Hello HN,

Author here. I wrote this blog post attempting to visually explain the
mechanics of word2vec's skipgram with negative sampling algorithm (SGNS). It's
motivated by:

1- The need to develop more visual language around embedding algorithms.

2- The need for a gentle on-ramp to SGNS for people who are using it for
recommender systems. A use-case I find very interesting (there are links in
the post to such applications)

I'm hoping it could also be useful if you wanted to explain to someone new to
the field the value of vector representations of things. Hope you enjoy it.
All feedback is appreciated!

~~~
Radim
Nice work jalammar! Author of gensim here. Quotes from Dune are always
appreciated :-)

Here's some more layman reading "from back when", for people interested in how
word2vec compares to other methods and works technically:

\- [https://rare-technologies.com/making-sense-of-word2vec/](https://rare-
technologies.com/making-sense-of-word2vec/) (my experiments with word2vec vs
GloVe vs sparse SVD / PMI)

\-
[https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s](https://www.youtube.com/watch?v=vU4TlwZzTfU&t=3s)
(my PyData talk on optimizing word2vec)

~~~
wyldfire
The Dune references aren't limited to this article. :)

The BERT article [1] has 'em too!

[1] [https://jalammar.github.io/illustrated-
bert/](https://jalammar.github.io/illustrated-bert/)

~~~
jalammar
You're the first to point that one out! Nice catch!

------
stared
Well, I think that is important to remember that dimensions of word2vec DO NOT
have any specific meaning (unlike Extraversion etc in Big Five). All of it is
"up to a rotation". Using it looks clunky at best. To be fair, I may be biased
as I wrote a different intro to word2vec ([http://p.migdal.pl/2017/01/06/king-
man-woman-queen-why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html)).

For implementation, I am surprised it leaves out
[https://adoni.github.io/2017/11/08/word2vec-
pytorch/](https://adoni.github.io/2017/11/08/word2vec-pytorch/). There are
many other, including in NumPy and TF, but I find the PyTorch one the most
straightforward and didactic, by a large margin.

~~~
b_tterc_p
Well, sort of. They do have a meaning. It’s probably not an easily findable or
understandable concept to humans. If you hypothetically had a large labeled
corpus for a bunch of different features, you could create linear regressions
over the embedding space to find vectors that do represent exactly (perhaps
not uniquely) the meaning you’re looking for... and from that you could
imagine a function that transforms the existing embedding space into an
organized one with meaning.

~~~
sixo
you could still interchange the dimensions arbitrarily. You can't say
"dimension 1 = happiness", a re-training would not replicate that, and would
not necessarily produce a dimension for "happiness" at all.

~~~
b_tterc_p
I’m not saying that. I’m saying you could identify a linear combination of
x,y,z that approximates happiness, and by doing this for many concepts,
transform the matrix into an ordered state where each dimension on its own is
a labeled concept.

People are quick to claim that embedding dimensions have no meaning, but if
that is your goal, and your embedding space is good, you’re not terribly far
from getting there.

------
rmbryan
Excellent article, thank you. My snag in thinking about word2vec is how the
vector model stores information about words with multiple, significantly
different definitions, such as 'polish', Eastern Europe or glistening clean.

~~~
physicsyogi
Word2vec doesn't really address multiple meanings (polysemy). There has been
some progress on this though. Sebastian Ruder has been tracking the state-of-
the-art in this here: [1].

[1]
[https://nlpprogress.com/english/word_sense_disambiguation.ht...](https://nlpprogress.com/english/word_sense_disambiguation.html)

Edit: formatting

------
DLA
Thank you very much for writing this and for making such excellent visuals.
This is the single best description of word2vec I've personally ever seen.
Well done!

------
siavosh
Is word2vec still the cutting edge of NLP?

------
451mov
fantastic explanation

