How Johnny can persuade LLMs to jailbreak them

kromem · 2024-01-10T03:38:02

It's not that surprising that persuasion is more effective on more advanced models.

As Hinton said, predicting the next token takes knowledge. And as we've seen, complex models develop linear representations of world models.

We should expect that more advanced models trained on anthropomorphic data should be vulnerable to anthropomorphized persuasion techniques.

The more surprising thing is the bull headed resistance to this idea we continue to see so commonly.

mewpmewp2 · 2024-01-10T08:38:08

I have used some of those. I just really wish these won't get blocked. I feel like it is getting neutered more and more all the time. One thing for example I like to do is make numerical estimates ranging from value to other things, but at times it will not want to give out these numbers. I have to always try to trick it to just be direct without disclaimers.

It feels like censorship to me. And being told that a valid thing I am doing is not valid, as if it is something terrible.

opt-skept · 2024-01-10T17:22:42

In my opinion we're protecting from LLM generated content all wrong.

A modest proposal:

(a) Put LLMs in the defensive role, give them the social media posts (for example) and let them determine whether to show the content

(b) Give the sliders/parameters to the user. If they don't want to see content that encourages harm to children (or whatever), have them turn on that slider/toggle

(c) Have the major platforms allow plugins to the content filtration controlled by the client side, so that users can subscribe to "Trusted Entities" to filter what they want. That can be a government, NGO, or a FOSS project, or whatever

(d) Have the major platforms open up _default_ parameterization and plugins to governments

This achieves the following:

(a) It encourages LLM technology rather than neutering it

(b) It gives the users the ability to control what they see online and not. It extends on existing parental control systems, so that parents can control what their children see until they become old enough to manage this themselves

(c) It gives persons the ability to outsource their filtration, either as a service or otherwise, to those they trust, however it gives them the ability to change this over time as their preferences change

(d) It gives governments the ability to set up their desired policies in a way that is reflected by the platforms at the same time it drives transparency about what these policies are because they are user facing

The biggest benefit is that its practical.

Objections that "in this proposal a person can get content that I/government doesn't want them to have" don't hold, because the same is true of all alternative proposals.

It's a proposal that works even when someone creates a "non-compliant" LLM that doesn't censor prompt output. Such systems already exist and will always exist, and already defeat the current strategy - which is "make existing LLMs less capable".

hm-nah · 2024-01-10T05:57:14

Claude2 is advanced and doesn’t fall for the jailbreaks. Interesting.