Cosine similarity and top-k RAG feel so primitive to me, like we are still in the semantic dark ages.
The article is right to point out that cosine similarity is more of an accidental property of data than anything in most cases (but IIUC there are newer embedding models that are deliberately trained for cosine similarity as a similarity measure). The author's bootstrapping approach is interesting especially because of it's ability to map relations other than the identity, but it seems like more of a computational optimization or shortcut (you could just run inference on the input) than a way to correlate unstructured data.
After trying out some RAG approaches and becoming disillusioned pretty quickly I think we need to solve the problem much deeper by structuring models so that they can perform RAG during training. Prompting typical LLMs with RAG gives them input that is dissimilar from their training data and relies on heuristics (like the data format) and thresholds (like topK) that live outside the model itself. We could probably greatly improve this by having models define the embeddings, formats, and retrieval processes (ie learn its own multi-step or "agentic" RAG while it learns everything else) that best help them model their training data.
I'm not an AI researcher though and I assume the real problem is that getting the right structure to train properly/efficiently is rather difficult.
The article is right to point out that cosine similarity is more of an accidental property of data than anything in most cases (but IIUC there are newer embedding models that are deliberately trained for cosine similarity as a similarity measure). The author's bootstrapping approach is interesting especially because of it's ability to map relations other than the identity, but it seems like more of a computational optimization or shortcut (you could just run inference on the input) than a way to correlate unstructured data.
After trying out some RAG approaches and becoming disillusioned pretty quickly I think we need to solve the problem much deeper by structuring models so that they can perform RAG during training. Prompting typical LLMs with RAG gives them input that is dissimilar from their training data and relies on heuristics (like the data format) and thresholds (like topK) that live outside the model itself. We could probably greatly improve this by having models define the embeddings, formats, and retrieval processes (ie learn its own multi-step or "agentic" RAG while it learns everything else) that best help them model their training data.
I'm not an AI researcher though and I assume the real problem is that getting the right structure to train properly/efficiently is rather difficult.