Understanding What Matters for LLM Ingestion and Preprocessing

doctorpangloss · 2024-04-21T23:28:31 1713742111

Does anyone have any non-toy, fully open source data and code examples (“fully open source”) of open-weights LLMs fine tuned on non-instruct style data AND the resulting instruct-style queries “actually” “work”? Are there any fully open source examples of actually working RAG that to the end user are obviously superior to the most sophisticated open source full text search, or even Google indexing?

In creative art, there is a thriving use for fine tuning. Much of it is reproducible. There are specific guides with specific results. But where is the guide for “corporate knowledge base?” I get the feeling that “it” meaning “fine tune an LLM or use a RAG” is inferior to sophisticated open source full text search, but so many people are invested in pretending otherwise because of all the dollar signs.