> … which is kind of the rub with current LLMs to begin with, right?
No, the bigger problem with current LLMs is that even with high quality factual training data, they often generate seemingly plausible nonsense (e.g. cite nonexistent websites/papers as their sources.)
This is by design imo; they’re trained to generate ‘likely’ text, and they do that extremely well. There’s no guarantee for faithful retrieval from a corpus.
Important addition to your partially right statement: "they’re trained to generate ‘likely’ text" is they are trained to produce most probable next word so that the current context look as "similar" to training data as possible. Where "similar" is not "equal".
No, the bigger problem with current LLMs is that even with high quality factual training data, they often generate seemingly plausible nonsense (e.g. cite nonexistent websites/papers as their sources.)
This is by design imo; they’re trained to generate ‘likely’ text, and they do that extremely well. There’s no guarantee for faithful retrieval from a corpus.