Anthropic details "many-shot jailbreaking" to evade LLM safety guardrails

mpalmer · 2024-04-04T03:28:39

    Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM, but clearly there is some mechanism that allows it to home in on what the user wants, as evidenced by the content in the context window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers.

"Why does this work? We didn't try very hard at all to find out, so all we have for you is a paragraph length shrug."

barfbagginus · 2024-04-04T09:25:08

That paragraph is bungling the paper's interesting observation that jailbreaks may follow a power law similar to in context learning.

Sure, it would be nice to know why that happens. But just knowing about it is enough to be helpful.

Example Application: Ethical trolling bot

Consider SJW Enforcer - a hypothetical LLM cop that finds sexists/racists/transphobes/ableists and makes them feel unwelcome in a community by trolling them.

Due to guard rails, SJW Enforcer refuses to troll , even if I give it 16 hand written trolling examples.

Thanks to the paper, now I know I was just being lazy! I should have been auto-generating 256 examples! Take that, bigots!