Hacker News new | past | comments | ask | show | jobs | submit login
Improving Text Embeddings with Large Language Models (arxiv.org)
48 points by cmcollier 5 months ago | hide | past | favorite | 6 comments



Interesting, but this aspect makes me double-take: "We demonstrate that Mistral-7B, when fine-tuned solely on synthetic data, attains competitive performance on the BEIR [ 40 ] and MTEB [27] benchmarks".

E5/BGE large are an order of magnitude smaller than Mistral-7B. So is this just "bigger model wins" in disguise?

I need to read the whole paper carefully, but this jumped out at me.


agree, this is a nice example of generating synthetic data, and I believe that the synthetic data is helpful for generating useful embeddings for RAG, but not including an ablation with fine-tuned E5 or another commonly used embedding model (to control for the 'bigger model wins' effect) is a glaring omission. this paper shares many authors with the E5 paper, why did they not compare on a fair basis?


I thought the main point was that this is a very fast way (in terms of wall time) to beat state of the art, not a fair comparison by size; if one made E5 bigger, then E5 would be even slower to train.


> Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

I'm surprised they didn't put `Machine Learning (cs.LG)` and `Machine Learning (stat.ML)`.


I am confused, aren't LLMs already embeddings of text?


Yes but they are not trained to explicitly encourage similar texts to be semantically similar, only to do next token prediction. In embedding models a contrastive loss is used to minimize distance between pairs of semantically similar content and maximize distance to all other embeddings




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: