Interesting, but this aspect makes me double-take: "We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive
performance on the BEIR [ 40 ] and MTEB [27] benchmarks".
E5/BGE large are an order of magnitude smaller than Mistral-7B. So is this just "bigger model wins" in disguise?
I need to read the whole paper carefully, but this jumped out at me.
agree, this is a nice example of generating synthetic data, and I believe that the synthetic data is helpful for generating useful embeddings for RAG, but not including an ablation with fine-tuned E5 or another commonly used embedding model (to control for the 'bigger model wins' effect) is a glaring omission. this paper shares many authors with the E5 paper, why did they not compare on a fair basis?
I thought the main point was that this is a very fast way (in terms of wall time) to beat state of the art, not a fair comparison by size; if one made E5 bigger, then E5 would be even slower to train.
Yes but they are not trained to explicitly encourage similar texts to be semantically similar, only to do next token prediction. In embedding models a contrastive loss is used to minimize distance between pairs of semantically similar content and maximize distance to all other embeddings
E5/BGE large are an order of magnitude smaller than Mistral-7B. So is this just "bigger model wins" in disguise?
I need to read the whole paper carefully, but this jumped out at me.