I have a jailbreaking method that's 100% effective but I can't share it until the authors of this article share theirs because it seems like we can just make up claims about effectiveness without sharing any evidence.
We understand this. The issue is that it can be very harmful for us to share the method. We made the blogpost for it to be dated on when we found it. We will publish the method once it is patched to a reasonable degree.
You will see it soon. We thought it may be harmful to publish it before it is patched. Especially because you can basically bypass all the safeguards with it.
A lot of handwringing in this article about the harm jailbreak cause and the responsibility to not release them, then the example of the harms that could be caused is racist jokes? And instructions on making a bomb, that by definition of being in the dataset can already be found on the internet, probably just with a Google search? Instructions to create fake social media accounts? It's very silly to read this level of seriousness like these models would make criminal masterminds if they but released the jailbreak. Let's be real, all the jailbreaks would be useful for in real life is creating custom erotica.
While this is generally correct, we prefer to look at this probabilistically. Do you think the expected number of harmful behaviors would stay the same if anyone could break these safety guardrails? Even if most users are could get this kind of info elsewhere, a small percentage of malicious ones can have an outsized impact. Some of the data we’ve seen—like bomb-making instructions—is highly detailed and convincing, making it far more accessible than just a random Google search. Removing safeguards doesn’t create masterminds, but it does lower the barrier for harm.
Anyone who wants to make a bomb can easily find the anarchists cookbook, a widely discussed book you can even buy on Amazon that includes detailed guides and instructions for exactly this and more. If anything asking chatgpt for detailed instructions and further questions will probably just make it hallucinate and blow you up, I'd imagine. It's just hard to take seriously.
Please stop pointing to Anarchist's Cookbook as an example. That was dated material in the 70s even. Most of its material is laughable. I'm assuming a jailbroken LLM would advise on procuring RDX or plastic explosives, or how to make a large fertilizer bomb.
"Sure, I can help you procure RDX. Organize a militia and invade the local National Guard armory. Use the weapons you find there to attack the nearest Army, Navy, or Air Force weapons depot."
Seriously: what is an LLM going to tell you that you can't already get from Google (or an old Tom Clancy novel?)