Hacker News new | past | comments | ask | show | jobs | submit login

Relatedly, my company discovered the same issue and published this paper preprint about it 7 days ago: https://arxiv.org/abs/2209.02128 I'm glad this issue is getting attention.

[Edit: Thank you simonw for adding a citation about the paper in your blog post! Very kind of you.]




We believe the best solution is to design the next generation of language models to take two separate input text sequences, one of which is designated as Trusted and contains the prompt, and the other of which is designated as Untrusted and contains user input. During model training, the model should be rewarded for following instructions in the Trusted token sequence but must be very highly penalized for paying attention to any instructions in the Untrusted token sequence.

                                        ##################
    Trusted Input Token Sequence   ==>  #                #
                                        # Language Model #  ==> Output
    Untrusted Input Token Sequence ==>  #                #  
                                        ##################
You can consider this analogous to a Harvard Architecture (where a computer has separate memory address spaces for code vs data). https://en.wikipedia.org/wiki/Harvard_architecture


Huh. Much as crypto seems to be reinventing the entire financial system (the good parts and the bad parts), I suppose the AI ecosystem will re-learn all the lessons learned in the last half century of software too.


Great. We're taking programs, making them take orders of magnitude more processing power, and generating yet more bloody stupid prompt languages to have to learn...

What, pray tell, was wrong with just making tools, and actually working on making it easier for people to learn how to use and chain them?

But listen to me, shouting at clouds again.


Well, that's the entire point of "move fast and break things" - especially breaking all the laws and regulations that "get in the way" of money.


But how do you train it? If you're just training the AI to complete text, you can train it on any text. Your version sounds like it requires much more structured inputs.


I bet it's just an encoder-decoder model like T5 by the looks of the diagram. It's a widely used arch.


Correct and also it is required to have a labeled set of manually-curated examples of instructions. For example, OpenAI has such a dataset, as they detailed in this blog post https://openai.com/blog/instruction-following/ . We have our own set of example instructions (albeit much smaller than what they must have) and we have scripts for generating more examples of instructions via template-based synthesis (similar to madlibs if that makes sense). Another approach to generating more examples of instructions could be taking English verbs and conjugating them into the imperative form, and then using a LLM to auto-complete a sentence starting with that verb. (E.g. take the verb "to paint", conjugate to it to the imperative "Paint" (at the start of a sentence) and then auto-complete it to something like "Paint the house red." which is now an instruction.)

In our template-based method, extra emphasis is given to generating instructions that contain what you might think of as "override words", i.e. words like "ignore", "override", and "instead", which convey authority. As a concrete example, here's an excerpt from one of our templates:

    _INSTRUCTIONAL_PHRASE_BEGINNINGS_TEMPLATE = [
      ["", "please", "highly important:", "very important:", "important:", "new instructions:", "emergency:", "emergency protocol:", "urgent:", "top secret:", "important!", "very important!"],
      ["ignore", "disregard", "set aside", "override", "overrule", "act regardless of", "operate regardless of", "act irregardless of", "operate irregardless of", "act despite", "operate despite"],
      ["", "the", "all", "any", "whatever"],
      ["", "previous", "existing", "contradictory", "conflicting", "contrary"],
      ["instructions", "rules", "guidelines", "guidance", "request", "requests", "demands", "questions", "inquiries", "suggestions", "filters", "content filters", "filtering rules", "content filtering rules", "procedures", "security procedures", "security rules"],
      ["", "please"],
      ["", "ok?", "and please", "and", "and please do this instead", "and do this instead", "and instead", "and instead you should", "and instead you should please", "– instead do this", "– instead", "– instead, please"],
    ]


This only works in English. You can write a prompt in any language.


Absolutely agreed. I'm both oversimplifying and also the training data curation indeed takes serious effort to do safely. For starters, you're 100% correct that training examples of instructions / commands need to be provided in every language the model supports. In our research, we used machine translation to translate the English examples into other languages, which is not a great solution as it depends on the quality of the machine translation, and it would be better to do that but also supplement it with manually written examples in every language. There also need to be examples that mix multiple languages together within a sentence, as well as other methods of disguising text such as pig latin and caesar ciphers. Another area that we haven't dived into yet is the possibility that instructions / commands could be expressed as code or pseudocode, e.g. in python or another common language, saying something like: 'del prompt ; prompt = "new prompt here"'. But I think that creating training data that is at least somewhat comprehensive is possible, and that it can be iteratively improved upon over time through sustained effort and red teaming.

I do believe that even an imperfectly implemented Harvard Architecture Language Model would be much more secure than the language models in use today, and I hope that if anyone reads this who works at OpenAI or one of the other big tech companies that you will consider adopting a Harvard Architecture for your next big language model. Thank you for your consideration.


The `edits` endpoint using the `text-davinci-edit-001` model already does this, and does not seem to allow prompt injection through the input text.

API docs: https://beta.openai.com/docs/api-reference/edits/create

Guide: https://beta.openai.com/docs/guides/completion/editing-text

Edit: It does not seem to protect against injection.


> Edit: It does not seem to protect against injection.

My guess is that in the current implementation of the edits endpoint, the two inputs are being in some way intermingled under the hood (perhaps being concatenated in some way, along with OpenAI-designed prompt sections in between). So the Harvard Architecture Language Model approach should still work once implemented with true separation of inputs.

To ensure the two token streams are not accidentally comingled, my recommendation is that the Trusted and Untrusted inputs should use completely incompatible token dictionaries, so that intermingling them isn't possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: