Main thing is to curate a good set of documents to start with. Garbage in (like Bing google results like this study did) --> Garbage out.
From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.
> When the next token is a URL, and the URL does not match the preceding anchor text.
> Additional layers of these 'LLMs' could read the responses and determine whether their premises are valid and their logic is sound as necessary to support the presented conclusion(s), and then just suggest a different citation URL for the preceding text
Simple similarity search is not enough.