Despite extensive safety training, LLMs remain vulnerable to “jailbreaking” through adversarial prompts. Why does this vulnerability persist? In a new open access paper published in Philosophical Studies, I argue this is because current alignment methods are fundamentally shallow.
That Mastodon post then links to a paper by the same person, so one assumes they are giving an accurate summary of their own work.
So I don't know what you are claiming is factually incorrect.