"Calibration" in a knowledge context means having estimated_p(correct) ~ p(correct), and it turns out that LLMs are reasonably good at this. Also a core reason why LLM-as-a-judge works so well: quality evaluation is vastly easier than generation.
In my experience self validation performance and confidence ratings are both poor. I think the problem is that these sorts of formats just aren’t that common in the training data. What does help is to as a series of structured questions pertaining to quality and to aggregate those, but it’s still often not that helpful.
I admit I'm a bit confused by the reward function, as given it seems to provide the same score independent of correctness due to the squaring? And I think even if that's a mistake and it's supposed to be negative for incorrect answers, a policy that optimizes for that reward is to output 1 for anything with less than a 50% chance of being true and 10 for anything over 50%. Is that how RL is typically done?
While the training data may be wrong, LLMs still go off the rails that outputs data that doesn't seem like it would be part of the training material in the first place.
I think if the training data has errors or oudated information, then an LLM reproducing that information is not called hallucinations, and that's a different problem. It seems just that it's very difficult for an LLM to know if it's stating something that's part of or based on its training data or not, regardless whether it's the truth or not.
https://arxiv.org/abs/2207.05221
"Calibration" in a knowledge context means having estimated_p(correct) ~ p(correct), and it turns out that LLMs are reasonably good at this. Also a core reason why LLM-as-a-judge works so well: quality evaluation is vastly easier than generation.