Hacker News new | past | comments | ask | show | jobs | submit login
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales (arxiv.org)
28 points by jasondavies 10 months ago | hide | past | favorite | 9 comments



Excellent background on knowledge calibration from Anthropic:

https://arxiv.org/abs/2207.05221

"Calibration" in a knowledge context means having estimated_p(correct) ~ p(correct), and it turns out that LLMs are reasonably good at this. Also a core reason why LLM-as-a-judge works so well: quality evaluation is vastly easier than generation.


<not this paper approach>

This is one of the key prompting in a lot of Enterprise cases. You can currently prompt LLMs to add a confidence score along with their responses.

Especially when you are using LLMs for downstream NLP tasks.

The confidence score can be a great indicator also for applying a two-tier model approach!


In my experience self validation performance and confidence ratings are both poor. I think the problem is that these sorts of formats just aren’t that common in the training data. What does help is to as a series of structured questions pertaining to quality and to aggregate those, but it’s still often not that helpful.


The challenge is the confidence scores can often be confabulations themselves.


What is a confidence score like that based on?


yeah but they say in the paper that isn't very accurate, they suggest a specifically fine-tuned model for it, not just prompting it for a score?


I admit I'm a bit confused by the reward function, as given it seems to provide the same score independent of correctness due to the squaring? And I think even if that's a mistake and it's supposed to be negative for incorrect answers, a policy that optimizes for that reward is to output 1 for anything with less than a 50% chance of being true and 10 for anything over 50%. Is that how RL is typically done?


it is nice that you posted datasets


[flagged]


While the training data may be wrong, LLMs still go off the rails that outputs data that doesn't seem like it would be part of the training material in the first place.

I think if the training data has errors or oudated information, then an LLM reproducing that information is not called hallucinations, and that's a different problem. It seems just that it's very difficult for an LLM to know if it's stating something that's part of or based on its training data or not, regardless whether it's the truth or not.

For example, I think the goal is to refuse this prompt: https://chatgpt.com/share/943944d2-9b25-47a5-a4eb-23f040273b...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: