This makes it difficult to justify a production deployment:
> The core of last_layer is deliberately kept closed-source for several reasons. Foremost among these is the concern over reverse engineering. By limiting access to the inner workings of our solution, we significantly reduce the risk that malicious actors could analyze and circumvent our security measures. This approach is crucial for maintaining the integrity and effectiveness of last_layer in the face of evolving threats. Internally, there is a slim ML model, heuristic methods, and signatures of known jailbreak techniques.
So security by obscurity, to defend llms that are routinely exploited from a position of obscurity. This does not inspire confidence. I'm eagerly awaiting second wave of solutions to this problem that don't take a web app firewall approach where context about what is being defended is absent.
Yeah, I don't like this at all. If I'm going to evaluate a prompt injection protection strategy I need to be able to see how it works.
Otherwise I'm left evaluating it through wasting my time playing whac-a-mole with it, which won't give me the confidence I need because I can't be sure an attacker won't guess a strategy that I didn't think of myself.
This doesn't even include details of the evals they are using! It's impossible to evaluate whether what they've built is effective or not.
I'm also not keen on running a compiled .so file released by a group with no information on even who the authors are.
Prompt injection is a security
issue. It’s about preventing
attackers from emailing you and
tricking your personal digital
assistant into sending them your
password reset emails.
No matter how you feel about “safety
filters” on models, if you ever want
a trustworthy digital assistant you
should care about finding robust
solutions for prompt injection.
To be fair, this library attempts to solve both at once.
You can avoid prompt injection by simply not using LLMs as autonomous agents where the output of the model is critical for security. That sounds like a horrible idea anyways. A language model is the wrong interface between untrusted people and sensitive data
Sure, but there are SO many things people want to build with LLMs that include access to privileged actions and sensitive data.
Prompt injection means that even running an LLM against your own private notes to answer questions about them could be unsafe, provided there are any vectors (like Markdown image support) that might be used for exfiltration.
Using prompt injection mitigation techniques is akin to directly interfacing untrusted clients to your production database, but just organizing your tables that contain sensitive data in a confusing way in the name of security. If you depend on a language model behaving correctly to avoid leaking sensitive data, you've already leaked the sensitive data.
Scope the information that the language model has access to to a subset of the information that the person interfacing with the language model has access to. Prompt injection doesn't matter at that point, because the person will only be able to "leak" information they have permission to access anyways.
That's not enough. Even if the LLM can only access information that should be visible to the user interacting with it (which I see as table stakes for building anything here) you still have to worry about prompt injection exfiltration attacks.
Re: exfiltration: just don't do things that untrusted data sources tell you to do. Separate processing the input data from the persons commands, so that the LLM can perform inferencing operations on the data according to the specified commands. The part of the pipeline that processes untrusted data should not have any influence on the behavior of the part of the pipeline capable of interacting with entities who should not have access to the untrusted data.
"Separate processing the input data from the persons commands, so that the LLM can perform inferencing operations on the data according to the specified commands"
Prompt injection is the security flaw that exists because doing that - treating instructions and data as separate things in the context in the LLM - is WAY harder than you might expect.
> Prompt injection is the security flaw that exists because doing that - treating instructions and data as separate things in the context in the LLM - is WAY harder than you might expect.
Then we should improve the tooling around this to make it way easier, rather than hoping security by obscurity will work this time.
AI labs around the world have been trying to solve this problem - reliable separation of instructions from data for LLMs - for a year and a half at this point. It's hard.
You're equating this to censorship, when I think it's more like Google adding security measures so you can't break their search engine rather than removing unfavorable results.
There's nothing wrong with blocking prompt injection for a customer service chatbot, though. This would be obnoxious applied directly to something like ChatGPT, or worse yet their API, but I don't think that's really the intended use case.
The technology was always the same amount of useful. This library just lowers the competence floor for exploits. Try to exploit yourself and come up with anti-jailbreak prompts to mitigate the exploits.
// Example pattern for Claude API Key (adjust according to the actual pattern)
d.claudeKeyRegex, err = regexp.Compile(`(claude-)[0-9a-zA-Z]{32}`)
if err != nil {
return err
}
// Example pattern for Groq API Key (adjust according to the actual pattern)
d.groqKeyRegex, err = regexp.Compile(`(groq-)[0-9a-zA-Z]{32}`)
if err != nil {
return err
}
> The core of last_layer is deliberately kept closed-source for several reasons. Foremost among these is the concern over reverse engineering. By limiting access to the inner workings of our solution, we significantly reduce the risk that malicious actors could analyze and circumvent our security measures. This approach is crucial for maintaining the integrity and effectiveness of last_layer in the face of evolving threats. Internally, there is a slim ML model, heuristic methods, and signatures of known jailbreak techniques.