Hacker News new | past | comments | ask | show | jobs | submit login

Because we already knew those were vulnerable. This new research is about a model where you can look at the weights and even re-train them, but the backdoor persists.



I am confused by what youre saying. Anthropic's research and Karpathy's speculation about attack paths based on the idea of "sleeper agents" are both (as far as I know) new ideas and apply to closed models just as much as they do to open ones. There is nothing about this research that is specific to either open or closed source models.

This spin from ars technica and others seems to just be anti-open source AI editorializing aimed at increased regulatory capture for the big players.


I guess you could do this with a closed-source model, but there are much easier ways of backdooring those. That's why this attack isn't really relevant. If a company wanted to mess with API responses, they could just do that directly or swap out the backend model whenever they wanted. A model release with a restrictive license that forbids further training could already be trained with whatever behavior the attacker wanted. So the only case where a backdoor surviving retraining is interesting is in an open model.


I am totally confused by what youre saying still. First what are the easier ways of backdooring closed source models? There is some other way to get a model to output known vulnerable code in 2024 but not 2023 or to exfil data? As far as I am aware this is a novel capability of this specific type of vulnerability.

This paper from Anthropic and subsequent speculation are not just about this vulnerability surviving retrains though that is an important facet. It is also demonstrating that unwanted behavior can be hidden and then triggered. This is IMO the most important part of the research and the part that is emphasized the most in the paper: that a model can perform perfectly well until some trigger happens and then start doing unwanted behavior in a wide variety of ways.

>"If a company wanted to mess with API responses, they could just do that directly or swap out the backend model whenever they wanted."

What company are we talking about here? The company hosting the model as a service or a company trying to attack the model?


I remember when Ars was a legitimately independent tech site in the /. days, before Condé Nast bought it. Now it's a megaphone for Big Tech and has been hijacked by political activism, a shadow of its former self. As are most tech publications these days.


They criticize big tech companies to the point that some writers get flack on every article, and they’ve never been shy about covering topics with political ramifications so you might also want to consider whether your own political opinions have shifted in ways which leads you more sensitive to views you now identify as belonging to opponents.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: