Training a model is more like evolution. The motivation to "cheat" comes from th... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

skybrian 41 days ago | parent | context | favorite | on: Natural Language Autoencoders: Turning Claude's Th...

Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.

There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.

astrange 38 days ago [–]

> The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating."

That's what Goodhart's Law is! All evaluations will eventually cause cheating on them.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact