Translation: typically in BERT a sentence embedding is an single embedding at th...

Translation: typically in BERT a sentence embedding is an single embedding at the <cls> token. In other words, 768 floats.

ColBERT gives you a vector for each token in the sequence. For k tokens, your output shape is (k, 768). At the cost of more storage this lets you compare similarity of the query to your document at the token level.

Afaik “token level multi vectors” and “contextual late interaction” are not standard terms. They’re more like smashing concepts together in a random order, some sort of linguistic trapdoor function that only makes sense if you knew what it meant to begin with.

I share your animosity towards unclear writing. Researchers think it makes them sound smart but frankly I think the opposite. (Unfortunately for me, the converse is not true.)