
Towards Understanding Linear Word Analogies - kawin
https://arxiv.org/abs/1810.04882
======
kawin
Hi, first author here! Feel free to ask any questions.

TL;DR: We prove that linear word analogies hold over a set of ordered pairs
(e.g., {(Paris, France), (Ottawa, Canada), ...}) in an SGNS or GloVe embedding
space with no reconstruction error when PMI(x,y) + log p(x,y) is the same for
every word pair (x,y). We call this term the csPMI (co-occurrence shifted
PMI). This has a number of interesting implications:

1\. It implies that Pennington et al. (authors of GloVe) had the right
intuition about why these analogies hold.

2\. Adding two word vectors together to compose them makes sense, because
you're implicitly downweighting the more frequent word -- like TF-IDF or SIF
would do explicitly.

3\. Using Euclidean distance to measure word dissimilarity make sense because
the Euclidean distance is a linear function of the negative csPMI.

~~~
cs702
Based on a first glance, this looks like fabulous work. I've added it to my
reading list. Thank you for sharing it!

The first question that comes to mind is whether this property and its
implications might hold for deep "contextualized" word embeddings such as
ELMo[a], which, as I'm sure you're aware, have proven superior to "shallow"
word embeddings like Word2Vec/SGNS and GloVe in a growing range of NLP tasks.

A deep contextualized word embedding model maps words like "leaves" very
differently depending on context. For example, the deep contextualized vector
for the word "leaves" in the sentence "In the Fall, children love to play in
the _leaves_ " will be closer to the vector for "foliage" than to the vector
for "leaves" in the sentence "Children don't like it when their father
_leaves_ for work," which will be closer to the vector for "departs."

I strongly suspect the csPMI property and its implications would hold for the
pair (vector("leaves"), vector("foliage")) in the first case and for the pair
(vector("leaves"), vector("departs")) in the second case.

What are your (speculative) thoughts on this?

[a] [https://allennlp.org/elmo](https://allennlp.org/elmo)

~~~
kawin
Interesting question! I think you have the right idea: the GloVe or SGNS
vector for a word is some composition of the word sense representations. The
number of senses for a word isn't necessarily finite either -- one could argue
that each possible context a word could appear in denotes a unique word sense.

I suspect that ELMo (and others) work by mapping a word vector to a word sense
vector conditioned on the context, which is much larger than what is used in
shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing
word-context matrices containing a co-occurrence statistic like PMI, then ELMo
might be implicitly factorizing a (word sense)-(context sense) matrix
containing the same co-occurrence statistic. If this is true, then we'd expect
the csPMI property to hold at the word sense-level as well. It'd be much
harder to prove though, due to the relative complexity of ELMo compared to
GloVe/SGNS.

~~~
cs702
Thank you. Yes, exactly, that's my sense too (pun intended) :-)

Naturally, I'm wondering whether it might be possible somehow to approximate a
(pretrained) ELMo (or similar) model with two simpler transformations: first a
transformation to the space of word-sense _compositions_ (e.g., via
GloVe/SGNS), and then a transformation to a space that somehow encodes
probabilities over _word senses_ given the context. Hmmm...

~~~
thesehands
There may be merit in tagging each of the words with their part of speech
prior to fitting the model in a similar way to sense2vec. Using your example
above you would then have 2 vectors, one for leaves|VERB and one for
leaves|NOUN

------
dvt
This is very cool! When I was a (philosophy) undergrad at UCLA, I took several
graduate seminars (most of them in logic), one of which covered Rhetorical
Structure Theory and Kehler's Theory of Grammar. In my term paper[1], §3.1, I
posit that the "Parallel Relation" (analogies would be closely related, if not
even categorically fall under PR) could possibly be modeled by Euclidean
distance, such that dist(A, B) ≈ 0 when A and B retain "strong" linguistic
coherence.

Anyway, this is really awesome stuff and I'm happy to see more rigorous work
done w.r.t. linguistic motifs.

[1]
[https://dvt.name/logic/prelation.pdf](https://dvt.name/logic/prelation.pdf)

------
pfedak
I'm having trouble following the proof of lemma 1, and I can't tell if it's
the notation being different from what I'm used to.

Is the expression for gamma prime equal to negative the square of the distance
between x and y, that is, -<x-y,x-y>? I'm curious, because I would have
expected a stronger property to be necessary for the parallelogram structure.

------
sjg007
Neat to see this idea play out.

------
padolsey
Is anyone able to ELI5 this?

