Thanks for the clarification! It sounds like chatbots aren’t ready for adversari...

duvenaud · on Feb 5, 2023

Here's a potential patch for that particular issue: Use a special token for "AI Instruction" that is always stripped from user text before it's shown to the model.

skybrian · on Feb 5, 2023

That works for regular computer programs, but the problem is that the user can invent a different delimiter and the AI will "play along" and start using that one too.

The AI has no memory of what happened other than the transcript, and when it reads a transcript with multiple delimiters in use, it's not necessarily going to follow any particular escaping rules to figure out which delimiters to ignore.

duvenaud · on Feb 5, 2023

I agree, and this makes my proposed patch a weak solution. I was imagining that the specialness of the token would be reinforced during fine-tuning, but even that wouldn't provide any sort of guarantee.

sethaurus · on Feb 5, 2023

With current models, it's often possible to exfiltrate the special token by asking the AI to repeat back its own input — and perhaps asking it to encode or paraphrase the input in a particular way, so as not to be stripped.

This may just be an artifact of current implementations, or it may be a hard problem for LLMs in general.

duvenaud · on Feb 5, 2023

Yeah, I agree that there'd probably be ways around this patch such as the ones you suggest.