Improving Text Embeddings with Large Language Models

binarymax · 2024-01-02T20:04:20

Interesting, but this aspect makes me double-take: "We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive performance on the BEIR [ 40 ] and MTEB [27] benchmarks".

E5/BGE large are an order of magnitude smaller than Mistral-7B. So is this just "bigger model wins" in disguise?

I need to read the whole paper carefully, but this jumped out at me.

huac · 2024-01-02T22:49:17

agree, this is a nice example of generating synthetic data, and I believe that the synthetic data is helpful for generating useful embeddings for RAG, but not including an ablation with fine-tuned E5 or another commonly used embedding model (to control for the 'bigger model wins' effect) is a glaring omission. this paper shares many authors with the E5 paper, why did they not compare on a fair basis?

pama · 2024-01-03T04:04:19

I thought the main point was that this is a very fast way (in terms of wall time) to beat state of the art, not a fair comparison by size; if one made E5 bigger, then E5 would be even slower to train.

nalzok · 2024-01-02T22:58:45

> Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

I'm surprised they didn't put `Machine Learning (cs.LG)` and `Machine Learning (stat.ML)`.

3abiton · 2024-01-03T00:25:17

I am confused, aren't LLMs already embeddings of text?

jerpint · 2024-01-03T09:44:36

Yes but they are not trained to explicitly encourage similar texts to be semantically similar, only to do next token prediction. In embedding models a contrastive loss is used to minimize distance between pairs of semantically similar content and maximize distance to all other embeddings