There is a 5 month old thread [1] on this, but it might be already outdated.
What is the best approach for feeding custom set of documents to LLM and get non-halucinating and decent result in Dec 2023?
UPD: The question is generally about how to "teach" LLM answer questions using your set of documents (not necessarily train your own, so approaches like RAG counts)
[1] https://news.ycombinator.com/item?id=36832572
You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.
Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.
What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.
What it will do is learn the patterns that are in those documents.