Most of these attacks succeed because app developers either don’t trust role boundaries or don’t understand them. They assume the model can’t reliably separate trusted instructions (system/developer rules) from untrusted ones (user or retrieved data), so they flippantly pump arbitrary context into the system or developer role.
But alignment work has steadily improved role adherence; a tonne of RLHF work has gone into making sure roles are respected, like kernel vs. user space.
If role separation were treated seriously -- and seen as a vital and winnable benchmark (thus motivate AI labs to make it even tighter) many prompt injection vectors would collapse...
I don't know why these articles don't communicate this as a kind of central pillar.
Fwiw I wrote a while back about the “ROLP” — Role of Least Privilege — as a way to think about this, but the idea doesn't invigorate the senses I guess. So, even with better role adherence in newer models, entrenched developer patterns keep the door open. If they cared tho, the attack vectors would collapse.
> If role separation were treated seriously -- and seen as a vital and winnable benchmark, many prompt injection vectors would collapse...
I think it will get harder and harder to do prompt injection over time as techniques to seperate user from system input mature and as models are trained on this strategy.
That being said, prompt injection attacks will also mature, and I don't think that the architecture of an LLM will allow us to eliminate the category of attack. All that we can do is mitigate
But alignment work has steadily improved role adherence; a tonne of RLHF work has gone into making sure roles are respected, like kernel vs. user space.
If role separation were treated seriously -- and seen as a vital and winnable benchmark (thus motivate AI labs to make it even tighter) many prompt injection vectors would collapse...
I don't know why these articles don't communicate this as a kind of central pillar.
Fwiw I wrote a while back about the “ROLP” — Role of Least Privilege — as a way to think about this, but the idea doesn't invigorate the senses I guess. So, even with better role adherence in newer models, entrenched developer patterns keep the door open. If they cared tho, the attack vectors would collapse.