Hacker News new | past | comments | ask | show | jobs | submit login

You punish it if parts of the answer can be found in its training data, and reward it otherwise.





But the whole point of the training is that you reward it if it correctly reproduces the next token.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: