If you don’t mind sharing, what lesser known RAG tricks did you use to ensure th...

jonathan-adly · on Feb 25, 2024

Main thing is to curate a good set of documents to start with. Garbage in (like Bing google results like this study did) --> Garbage out.

From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.

westurner · on Feb 25, 2024

From [1], pdfGPT, knowledge_gpt, and paperai are open source. I don't think any are updated for a 10M token context limit (like Gemini) yet either.

[1] https://news.ycombinator.com/item?id=39363115

From https://news.ycombinator.com/item?id=39492995 :

> On why code LLMs should be trained on the edges between tests and the code that they test, that could be visualized as [...]

"Find tests for this code"

"Find citations for this bias"

Perhaps this would be best:

"Automated Unit Test Improvement Using Large Language Models at Meta" (2024) https://news.ycombinator.com/item?id=39416628

From https://news.ycombinator.com/item?id=37463686 :

> When the next token is a URL, and the URL does not match the preceding anchor text.

> Additional layers of these 'LLMs' could read the responses and determine whether their premises are valid and their logic is sound as necessary to support the presented conclusion(s), and then just suggest a different citation URL for the preceding text