Essentially, it comes down to statements like 'this woman is like Queen Elizabeth' being different from 'Queen Elizabeth is like this woman.' human perception of similarity is assymetric. Tversky's suggestion is that we essentially think if things as a collection of tags, and A is like B if A's tags are mostly a subset of B's tags. But if B has more tags (because it's a specific thing that we know a lot about, for example) then B's tags won't mostly be a subset of A's tags, creating the asymmetry.
This directly attacks the kind of similarity that word2vec relies on, and I'm wondering if there are critiques along these lines on the literature.
There are some interesting extensions of the basic model which include echoes of this 'tags' idea.
First, some projects have also modeled each word's vector as a sum-of-subvectors, based on the word's substrings-of-various-lengths. (Notably this is a feature of Facebook's FastText library.) This also learns useful vectors for subword features (like linguistic morphemes/roots), which then lets models often bootstrap useful vectors for new words, not included in the training corpus, based on their similarity with known words. (While those are strictly composing 'tags' of orthographic representation, it would be natural to seek a way to decompose words into varying-sized collections of more primal concepts/denotations/connotations.)
Second, modeling longer texts (like sentences) as collections-of-word-vectors gives some interesting ways to calculate similarities between texts. A first, simple way is to treat a text as some (perhaps significance-weighted) average of all its words, thus getting a single vector, in the word-space, for the text. Another is to treat the text as having an invisible floating pseudo-word that tries to help in the same word-prediction task that's used to create word-vectors. When that pseudo-word is trained-up, as if it were a rare word, treat it as a vector for the text. (That's Mikolov/Le's 'Paragraph Vector', often referred-to as 'Doc2Vec'.)
But following a practice useful from image-color-histogram comparisons, you can also treat the constituent word-vectors of text as if they were 'piles of meaning' at their individual vector coordinates, and calculate the distance between texts as the sum-of-distances/efforts to move one sentence's pile configuration to another. With text this has been called "Word Mover's Distance" but is analogous to "Earth Mover's Distance"/"Wasserstein Distance" in statistics/probability.
As this has been applied to texts – multi-word collections – I could easily see it also applied to decomposed words/concepts - multi-tag collections, per your summary of Tversky's model above. At some point, I'd also expect these models to account for words/sentences as large 'smears' in space, rather than single points, to better capture the full range of possible meanings, and then likely intended meanings, based on other relevant context.
What's interesting is that it works at all. My blog post on that: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html
I implemented LSA a few years ago based on these papers and it just seems like LSA/LSI to me..
Just found this:
> In this sense we have come full circle to the methods presented earlier that rely on matrix factorization (such as LSA). Where LSA uses SVD to find the best fitting subspace in terms of the co-variances of words across documents, the Glove model explicitly optimizes wordvectors to reflect the likelihood of their co-occurrence. The point, however, stands that Word2Vec is not an entirely novel approach to arrive at word vectors that capture semantic meaning in their substructure.
With LSA, each document is transformed into a single vector that has the length of the vocabulary. The length of the vocabulary is the number of unique words across all documents. If a word is present in a document, it is represented as a 1 in the vector and 0 if it is not. So after this transformation, the text is transformed in an D by V matrix where D is the number of documents and V is the vocabulary size.
GloVe (and word2vec by equivalence) however, work on a different matrix. In this algorithm, the matrix is V by V, where V is the vocabulary size and each cell counts the number of the times a word appears next to another word in a document. That is, the matrix represents the counts of how often words are neighbors across all documents. (There are some other technicalities, by this the thrust of GloVe).
The similarity between LSA and GloVe is that once the respective matrix is built, it is followed by matrix factorization. By matrix factorization, I mean SVD. This is a process which takes an N by M matrix and returns an N by K matrix, essentially squeezing the matrix into fewer columns (K is a free parameter that is smaller than M).
So ultimately, what we get with LSA is a D by K matrix (where K is just an arbitrary number, like 100). The interpretation of this matrix is that each row represents a document and each column represents a topic. In GloVe, we get a V by K matrix where each row represents a word and each column represents an embedding.
The following two papers also go into detail about the mathematical similarities between LSA and neural embeddings and achieving similar performance with both:
Levy, O. and Goldberg, Y. (2014). Neural word embedding as implicit matrix
Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional similarity
with lessons learned from word embeddings. http://www.anthology.aclweb.org/Q/Q15/Q15-1016.pdf
Those are excellent papers, by the way.