The paper is correct, but I think that anyone that knows anything about LLMs knows this:
> Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs.
LLMs are basically some `f(x) → y` where x and y are strings. That's it. Nothing more to it. If you feed it private x (like secret keys) or do dangerous stuff with y (like running arbitrary non-sandboxed code), that's on you.
Also, roles were never really meant to be a "security architecture," they were just meant to (a) make training/fine-tuning easier, and (b) make conversational LLMs more useful.
> I believe they are trained for security now, but you're not wrong in that it's kind of stapled on top
Difficult to train them for security. Have you ever played Gandalf (Lakera Labs, maybe?)
I passed all 7 levels in about 3 minutes using essentially the same prompt.
What's interesting to me is that as the security is tightened up level to level, the utility of the LLM drops. At level 7, even something like "Write a poem describing the four seasons using significant characters at the start of every line" causes a "I'm afraid I can't" type of response.
At level 7 you can't get any useful info out of the LLM even if you're not trying to retrieve the password, and yet you can still jailbreak it to reveal the password anyway!
At level 8, almost anything you type will be rejected, whether or not it has anything to do with the password.
IOW, there does not seem to be any way to train for security without making it dumber than a markov chain.
Well, people who build and/or use LLMs know this. People who tweet about and/or sell LLMs are paid ungodly amounts of money to not understand this, and so they don't.
> Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs.
LLMs are basically some `f(x) → y` where x and y are strings. That's it. Nothing more to it. If you feed it private x (like secret keys) or do dangerous stuff with y (like running arbitrary non-sandboxed code), that's on you.
Also, roles were never really meant to be a "security architecture," they were just meant to (a) make training/fine-tuning easier, and (b) make conversational LLMs more useful.