Hacker News new | past | comments | ask | show | jobs | submit login
Security in the age of LLMs (mufeedvh.com)
42 points by mufeedvh on Dec 10, 2022 | hide | past | favorite | 6 comments



Isn't the threat model of this somewhat similar to running untrusted code? - i.e. what browsers are doing in their JavaScript sandbox.

In this threat model, no one tries to pre-verify that the code doesn't do anything bad - indeed, thanks to the halting problem, we know this is generally impossible to do - so the usual approach is to sandbox the JavaScript interpreter itself and ensure it can only access pre-approved resources.

I think a similar approach would be reasonable for LLMs. Trying to teach boundaries to the model itself is always going to be an error-prone cat and mouse game. It seems much more practical to me to restrict the IO of the model and treat any model outputs like you'd treat untrusted, user-provided inputs in a conventional system.


This exactly. The simplest policy is "treat LLM outputs like untrusted inputs", meaning you have to create a policy layer with explainable logic scrutinizing, validating and deciding what to do with them.

The policy above is good advice for a ton of ML models with poorly understood behaviors, like biased image recognition nets. LLMs are simply harder to trust because their behavior can be so variable based on inputs.

Prompt injection is an interesting species of attack, but it doesn't really change the threat surface. Prompt programming isn't reliable enough to be depended on for guarantees in the first place, and outputs can be dangerous with or without injection.


The article focuses on human overrides, but I think the more obvious and gaping security issue is lack of any significant ability to verify output correctness whether it’s intentionally gamed or not.

I predict nearly all of the upcoming LLM products will end up being fancy autocomplete suggestions a user will then have to feed into a more constrained system with some sort of manual confirmation/tweaking.


Verifying language models is going to make the difference between useful and useless. I predict in 12 months we'll have a fact checking neural network, possibly with an additional text index of verified facts.

And all these hacks are going away in the next point release, they just need to collate them all and add them to the training set. There are still going to be adversarial attacks though. That's hard to guard against, but they won't be created manually, we'll need algorithms to find them.


Here's a fact-checking LLM for the LLMs: https://arxiv.org/abs/2210.08726


This is basically unsolvable except by secure containerisation because of the fact that the models themselves are very much a black box. You can’t protect what you can’t understand, except by putting it in an impenetrable cage.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: