Here’s my understanding. It is intimidating to write a response to you, because you have an exceptionally clear writing style. I hope the more knowledgable HN crowd will correct any errors in fact or presentation style below.
Old school word embedding models (like Word2Vec) come up with embeddings by using masked word predictions. You can embed a whole sentence by taking the average of all the word embeddings in the sentence.
There are many scenarios where this average fails to distinguish between multiple meanings of a word. For example “fine weather” and “fine hair” both contain “fine” but mean different things.
Transformers are great at producing better embeddings by considering context, and using the words in the rest of the sentence to produce a better representation of each word. BERT is a great model to do this.
The problem is that if you want to use BERT by itself to compute relevance you need to perform a lot of compute per query because you have to concatenate the query and the document vector to produce a long sequence that can then be “embedded” by BERT. Figure 2c in the ColBERT paper [1]
What ColBERT does is to use the fact that BERT can use context from the entire sentence and its attention heads to produce a more nuanced representation of any token in its input. It does this once for all documents in its index. So for example (assuming “fine” was a token) it would embed the “fine” in the sentence “we’re having fine weather today” to a different vector than the fine in “Sarah has fine blond hair”. In ColBERT the size of the output embeddings are usually much smaller than the typical 1024 you might expect from a Word2Vec.
Now, if you have a query, you can do the same and produce token level embeddings for all the tokens in the query.
Once you have these two contextualised embeddings, you can check for the presence of the particular meaning of a word in the document using the dot product. For example the query “which children have fine hair” matches the document “Sarah had fine blond hair” because the token “fine” is used in the exact same context in both the query and the document and should be picked up by the MaxSim operation.
If this is true, my understanding of vanilla token vector embeddings is wrong. my understanding was that the vector embedding was the geometric coordinates of the token in the latent space with respect to the prior distribution. So adding another dimension to make it a "multivector" doesn't (in my mind) seem like it would add much. What am I missing?
I think the important thing is that the first approach to converting complete sentences to an embedding was done by averaging all the embeddings of the tokens in the sentence. What ColBERT does is store the embeddings of all the tokens before then using dot products to identify the most relevant tokens to the query. Another comment in this thread says the same thing in a different way. Feels funny to post a stack exchange reference, but this is a great answer!
Old school word embedding models (like Word2Vec) come up with embeddings by using masked word predictions. You can embed a whole sentence by taking the average of all the word embeddings in the sentence.
There are many scenarios where this average fails to distinguish between multiple meanings of a word. For example “fine weather” and “fine hair” both contain “fine” but mean different things.
Transformers are great at producing better embeddings by considering context, and using the words in the rest of the sentence to produce a better representation of each word. BERT is a great model to do this.
The problem is that if you want to use BERT by itself to compute relevance you need to perform a lot of compute per query because you have to concatenate the query and the document vector to produce a long sequence that can then be “embedded” by BERT. Figure 2c in the ColBERT paper [1]
What ColBERT does is to use the fact that BERT can use context from the entire sentence and its attention heads to produce a more nuanced representation of any token in its input. It does this once for all documents in its index. So for example (assuming “fine” was a token) it would embed the “fine” in the sentence “we’re having fine weather today” to a different vector than the fine in “Sarah has fine blond hair”. In ColBERT the size of the output embeddings are usually much smaller than the typical 1024 you might expect from a Word2Vec.
Now, if you have a query, you can do the same and produce token level embeddings for all the tokens in the query.
Once you have these two contextualised embeddings, you can check for the presence of the particular meaning of a word in the document using the dot product. For example the query “which children have fine hair” matches the document “Sarah had fine blond hair” because the token “fine” is used in the exact same context in both the query and the document and should be picked up by the MaxSim operation.
[1] https://arxiv.org/pdf/2004.12832