Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.

So just... don't? Tell the LLM that its Some Guy.



That has it's own unique problems:

https://en.wikipedia.org/wiki/Waluigi_effect


I don't see the relation. Why would the Waluigi effect get worse if we don't tell the AI its an AI?


Because it's the truth. If you tell the AI that it's actually a human librarian, it might ask for a raise, or days off. If you tell it to search for something, it might insist that it needs a computer to do that. There will inherently be a information mismatch between reality and your input if the AI is operating on falsehoods.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: