Towards Understanding Linear Word Analogies 70 points by kawin 79 days ago | hide | past | web | favorite | 9 comments

 Hi, first author here! Feel free to ask any questions.TL;DR: We prove that linear word analogies hold over a set of ordered pairs (e.g., {(Paris, France), (Ottawa, Canada), ...}) in an SGNS or GloVe embedding space with no reconstruction error when PMI(x,y) + log p(x,y) is the same for every word pair (x,y). We call this term the csPMI (co-occurrence shifted PMI). This has a number of interesting implications:1. It implies that Pennington et al. (authors of GloVe) had the right intuition about why these analogies hold.2. Adding two word vectors together to compose them makes sense, because you're implicitly downweighting the more frequent word -- like TF-IDF or SIF would do explicitly.3. Using Euclidean distance to measure word dissimilarity make sense because the Euclidean distance is a linear function of the negative csPMI.
 Based on a first glance, this looks like fabulous work. I've added it to my reading list. Thank you for sharing it!The first question that comes to mind is whether this property and its implications might hold for deep "contextualized" word embeddings such as ELMo[a], which, as I'm sure you're aware, have proven superior to "shallow" word embeddings like Word2Vec/SGNS and GloVe in a growing range of NLP tasks.A deep contextualized word embedding model maps words like "leaves" very differently depending on context. For example, the deep contextualized vector for the word "leaves" in the sentence "In the Fall, children love to play in the leaves" will be closer to the vector for "foliage" than to the vector for "leaves" in the sentence "Children don't like it when their father leaves for work," which will be closer to the vector for "departs."I strongly suspect the csPMI property and its implications would hold for the pair (vector("leaves"), vector("foliage")) in the first case and for the pair (vector("leaves"), vector("departs")) in the second case.What are your (speculative) thoughts on this?
 Interesting question! I think you have the right idea: the GloVe or SGNS vector for a word is some composition of the word sense representations. The number of senses for a word isn't necessarily finite either -- one could argue that each possible context a word could appear in denotes a unique word sense.I suspect that ELMo (and others) work by mapping a word vector to a word sense vector conditioned on the context, which is much larger than what is used in shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing word-context matrices containing a co-occurrence statistic like PMI, then ELMo might be implicitly factorizing a (word sense)-(context sense) matrix containing the same co-occurrence statistic. If this is true, then we'd expect the csPMI property to hold at the word sense-level as well. It'd be much harder to prove though, due to the relative complexity of ELMo compared to GloVe/SGNS.
 Thank you. Yes, exactly, that's my sense too (pun intended) :-)Naturally, I'm wondering whether it might be possible somehow to approximate a (pretrained) ELMo (or similar) model with two simpler transformations: first a transformation to the space of word-sense compositions (e.g., via GloVe/SGNS), and then a transformation to a space that somehow encodes probabilities over word senses given the context. Hmmm...
 There may be merit in tagging each of the words with their part of speech prior to fitting the model in a similar way to sense2vec. Using your example above you would then have 2 vectors, one for leaves|VERB and one for leaves|NOUN
 This is very cool! When I was a (philosophy) undergrad at UCLA, I took several graduate seminars (most of them in logic), one of which covered Rhetorical Structure Theory and Kehler's Theory of Grammar. In my term paper[1], §3.1, I posit that the "Parallel Relation" (analogies would be closely related, if not even categorically fall under PR) could possibly be modeled by Euclidean distance, such that dist(A, B) ≈ 0 when A and B retain "strong" linguistic coherence.Anyway, this is really awesome stuff and I'm happy to see more rigorous work done w.r.t. linguistic motifs.
 I'm having trouble following the proof of lemma 1, and I can't tell if it's the notation being different from what I'm used to.Is the expression for gamma prime equal to negative the square of the distance between x and y, that is, -? I'm curious, because I would have expected a stronger property to be necessary for the parallelogram structure.
 Neat to see this idea play out.
 Is anyone able to ELI5 this?

Search: