I've been working on various RAG (Retrieval Augmented Generation) projects, and I'm curious if you all are seeing any generalizeable patterns in building the most performant RAG for any given dataset? For eg: is it even possible to say that in “most” cases, the best retriever setup is going to be a combination of semantic search (embeddings) + keyword search (BM25) + some xyz technique?
My hypothesis is that there’s no one-size-fits-all RAG design - every dataset is unique, every use-case is nuanced - and therefore, requires a uniquely optimized RAG pipeline. And it’s practically impossible to find the most optimal RAG setup for your dataset with a manual-trial-and-error approach - because the combinations of the different parameters of a RAG grow exponentially with each parameter (For eg: if you could choose from 5 different chunking strategies, 5 different chunk sizes, 5 different embedding models, 5 different retrievers, 5 different re-rankers, 5 different prompts, 5 different LLM settings - that’s 5^7 = 78125 different RAG configurations - which is practically impossible to try out exhaustively).
I’d love to hear from people that are working extensively on RAG based use-cases, if my hypothesis above is flawed, and if so, what’s been your approach to building an optimal RAG pipeline, and how much time & effort has it been taking you?
The reason I’m asking is because I’m working on a project [0] that performs automatic hyperparameter optimization on the various RAG parameters - so you basically just bring your dataset, and RAGBuilder will evaluate multiple configurations and help you identify what’s the best chunking strategy, what’s the best combination of retrievers to use, etc. for your dataset.
[0]: https://github.com/KruxAI/ragbuilder
example similar comments to your submission text: https://hn.garglet.com/similar/comment/41727287
example random query: https://hn.garglet.com/form/textSearch?input6=We+got+good+en...
example, similar users to me: https://hn.garglet.com/similar/users/naveen99