Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don’t buy Lecun’s argument. Once you get good RL going (as we are now seeing with reasoning models) you can give the model a reward function that rewards a correct answer most highly, an “I’m sorry but I don’t know” less highly than that, a wrong answer penalized, a confidently wrong answer more severely penalized. As the RL learns to maximize rewards I would think it would find the strategy of saying it doesn’t know in cases where it can’t find an answer it deems to have a high probability of correctness.


How do you define the "correct" answer?


Certainly not possible in all domains but equally certainly possible in some. There’s not much controversy about the height of the Eiffel Tower or how to concatenate two numpy arrays.


obviously the truth is what is the most popular. /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: