Hacker News new | past | comments | ask | show | jobs | submit login

There are clear places where “king” and “queen” are similar to each other and distinct from all the others. Could these be coding for a vague concept of royalty?

This is a common misunderstanding and unfortunately strengthened by the example of 'personality embeddings'. It is easy to understand intuitively why this is normally not the case. If you rotate a vector/embedding space, all the cosine similarities between words are preserved. Suppose that component 20 encoded 'royalty', there is an infinite number of rotations of the vector space where 'royalty' is distributed among many dimensions. Consider e.g. the personality vector openness-extroversion as values between 0-1. Now we have two persons:

[0, 1] [1, 0]

These vectors are orthogonal, so the cosine similarity is 0. Now let's rotate the vector space by 45 degrees:

[-0.707, 0.707] [0.707, 0.707]

The cosine similarity between the two vectors is still 0. Clearly, the direct mapping of personality traits to vector components is lost (as can be seen in the second vector).

Obviously, this is something that you generally do not want to do for personality vectors. However, there is nothing in the word2vec objectives that would prefer a vector space with meaningful dimensions. E.g. take the skip-gram model, which maximizes the log-likelihood of the probability of a context word f, cooccuring with a context word c. Shortened: p(1|w_f,w_c) = 𝜎(w_f·w_c). So, the objective in vanilla word2vec prefers vector spaces that maximize the inner product of words that co-occur and minimize the inner product of words that do not co-occur. Consequently, if we have an optimal parametrization of W and C (the word and context matrices), any rotation of the vector space is also an optimal solution. Which rotation you actually get is dependent on accidental factors, such as the initial (typically randomized) parameters.

Of course, it is possible to rotate the vector space such that dimensions become meaningful (see e.g. [1]), but with word2vec's default objective meaningful dimensions are purely accidental, and the meaning of vectors is defined in their relation to other vectors / their neighborhood.

[1] https://www.aclweb.org/anthology/D17-1041

This doesn't refute the point. If there's a royalty dimension and then you rotate it, there's still a royalty dimension. It just isn't a basis vector. In a blog post intended for introducing the idea, is that distinction really worth dwelling on? It could be a misunderstanding, or just a pedagogical simplification.

Very interesting. I'll read the paper to wrap my head around the concept. Thanks for the feedback!

I agree that there's no reason that these properties are axis-aligned.

Isn't the normal approach to look at whether

word2vec('king') - word2vec('man') ?= word2vec('queen') - word2vec('woman')

There's an entertaining investigation of this applied to Game of Thrones at https://towardsdatascience.com/game-of-thrones-word-embeddin...!

No one is saying that the "royalty" direction should be in the same angle as an axis, or that it should be in the same direction every time you train word2vec of course. It doesn't mean that that direction doesn't exist, and that word2vec doesn't code for such a Royalty direction (or region)

Well, obviously, all royalty are going to have similar vectors. The skipgram is just an implicit matrix vectorization of a shifted PMI matrix. And most royalty will have similar co-occurrences. My point is that the vector components do not mean anything in isolation. There is no dimension directly encoding such properties. The king vector means 'royalty' because queen, prince, princess, etc. are have similar directions.

Related to this is factor analysis, the technique predominantly used in psychology to extract meaningful factors (analogue of components in principal component analysis).

Unlike PCA, it assumes meaningful "latent" factors (such as "royalty" above) and tries to find a rotation which best loads these onto the data. To achieve this, it doesn't attempt to encode the data perfectly but leaves room for error in the reduction to factors.

Has anyone tried a word2vec like training with an L1 norm minimizing regularization?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact