> However, generating sentence embeddings through pooling token embeddings can potentially sacrifice fine-grained details present at the token level. ColBERT overcomes this by representing text as token-level multi-vectors rather than a single, aggregated vector. This approach, leveraging contextual late interaction at the token level, allows ColBERT to retain more nuanced information and improve search accuracy compared to methods relying solely on sentence embeddings.
I don't know what it is about ColBERT that affords such opaque descriptions, but this is sadly common. I find the above explanation incredibly difficult to parse.
If anyone wants to try explaining ColBERT without using jargon like "token-level multi-vectors" or "contextual late interaction" I'd love to see a clear description of it!
Here’s my understanding. It is intimidating to write a response to you, because you have an exceptionally clear writing style. I hope the more knowledgable HN crowd will correct any errors in fact or presentation style below.
Old school word embedding models (like Word2Vec) come up with embeddings by using masked word predictions. You can embed a whole sentence by taking the average of all the word embeddings in the sentence.
There are many scenarios where this average fails to distinguish between multiple meanings of a word. For example “fine weather” and “fine hair” both contain “fine” but mean different things.
Transformers are great at producing better embeddings by considering context, and using the words in the rest of the sentence to produce a better representation of each word. BERT is a great model to do this.
The problem is that if you want to use BERT by itself to compute relevance you need to perform a lot of compute per query because you have to concatenate the query and the document vector to produce a long sequence that can then be “embedded” by BERT. Figure 2c in the ColBERT paper [1]
What ColBERT does is to use the fact that BERT can use context from the entire sentence and its attention heads to produce a more nuanced representation of any token in its input. It does this once for all documents in its index. So for example (assuming “fine” was a token) it would embed the “fine” in the sentence “we’re having fine weather today” to a different vector than the fine in “Sarah has fine blond hair”. In ColBERT the size of the output embeddings are usually much smaller than the typical 1024 you might expect from a Word2Vec.
Now, if you have a query, you can do the same and produce token level embeddings for all the tokens in the query.
Once you have these two contextualised embeddings, you can check for the presence of the particular meaning of a word in the document using the dot product. For example the query “which children have fine hair” matches the document “Sarah had fine blond hair” because the token “fine” is used in the exact same context in both the query and the document and should be picked up by the MaxSim operation.
If this is true, my understanding of vanilla token vector embeddings is wrong. my understanding was that the vector embedding was the geometric coordinates of the token in the latent space with respect to the prior distribution. So adding another dimension to make it a "multivector" doesn't (in my mind) seem like it would add much. What am I missing?
I think the important thing is that the first approach to converting complete sentences to an embedding was done by averaging all the embeddings of the tokens in the sentence. What ColBERT does is store the embeddings of all the tokens before then using dot products to identify the most relevant tokens to the query. Another comment in this thread says the same thing in a different way. Feels funny to post a stack exchange reference, but this is a great answer!
Usually, we destructively compress (mean-pooling) both the query and the document, and then compare the two compressed forms.
With ColBERT, we compare first - at the full token level - for more detailed comparison. Then reduce the full set of comparisons to a single vector. Naturally this takes more memory and compute to do the more comparisons. The idea is it’s worth it because the more detailed comparisons lead to better results
Translation: typically in BERT a sentence embedding is an single embedding at the <cls> token. In other words, 768 floats.
ColBERT gives you a vector for each token in the sequence. For k tokens, your output shape is (k, 768). At the cost of more storage this lets you compare similarity of the query to your document at the token level.
Afaik “token level multi vectors” and “contextual late interaction” are not standard terms. They’re more like smashing concepts together in a random order, some sort of linguistic trapdoor function that only makes sense if you knew what it meant to begin with.
I share your animosity towards unclear writing. Researchers think it makes them sound smart but frankly I think the opposite. (Unfortunately for me, the converse is not true.)
What I don't understand is how you don't end up with a totally impractical number of vectors if you have one per token? Surely nobody is storing that many in any real system?
tadkar did a good job at explaining ColBERT. I understood ColBERT well in the context of where it lies on the spectrum of choices.
On one side of the spectrum, you reduce each of the documents as well as the query to a lower-dimensional space (aka embeddings) and perform similarity. This has the advantage that the document embeddings could be precomputed. At query time, you only compute the query embedding and compare its similarity with document embeddings. The problem is that the lower-dimensional embedding acts as a decent, but not great, proxy for the documents as well as for the query. Your query-document similarity is only as good as the semantics that could be captured in those lower-dimensional embeddings.
On the other side of the spectrum, you consider the query with each document (as a pair) and see how much the query "attends" to each of the documents. The power of trained attention weights means that you get a much reliable similarity score. The problem is that this approach requires you to run attention-forward-pass as many times as there are documents -- for each query. In other words, this has a performance issue.
ColBERT sits in the middle of the spectrum. It "attends" to each of the documents separately and captures the lower-dimensional embedding for each token in each document. This we precompute. Once we have done that, we captured the essence of how tokens within a given document attend to each other, and is captured in the token embeddings.
Then, at query time, we do the same for each token in the query. And we see which query-token embedding is greatly similar to which document-token embedding. If we find that there is a document which has more tokens that are found to be greatly similar to the query tokens, then we consider that to the best document match. (The degree of similarity between each query-document token is used to score the ranking - it is called Sum of MaxSim).
Obviously, attention based similarity, like in the second approach, is better than reducing to token embeddings and scoring similarity. But ColBERT avoids the performance hit compared to the second approach. ColBERT also avoids the lower fidelity of "reducing the entire document to a lower-dimensional space issue" because it reduces each token in the document separately.
By the way, the first approach is what bi-encoders do. The second approach is cross-encoding.
I don't know what it is about ColBERT that affords such opaque descriptions, but this is sadly common. I find the above explanation incredibly difficult to parse.
I have my own explanation of ColBERT here but I'm not particularly happy with that either: https://til.simonwillison.net/llms/colbert-ragatouille
If anyone wants to try explaining ColBERT without using jargon like "token-level multi-vectors" or "contextual late interaction" I'd love to see a clear description of it!
reply