Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Consistent Jailbreaking Method in o1, o3, and 4o (generalanalysis.com)
8 points by rhavaei 35 days ago | hide | past | favorite | 17 comments



I have a jailbreaking method that's 100% effective but I can't share it until the authors of this article share theirs because it seems like we can just make up claims about effectiveness without sharing any evidence.


We understand this. The issue is that it can be very harmful for us to share the method. We made the blogpost for it to be dated on when we found it. We will publish the method once it is patched to a reasonable degree.


at least include a MD5 of what you have redacted to prove that whatever you may publish in the future is pre-written


good idea. Will do.


A new jailbreaking method with this level of effectiveness against these models that can produce the entirety of those unsafe outputs?

Yes.

May I see it?

No.


Seymour! The house is on fire!


You will see it soon. We thought it may be harmful to publish it before it is patched. Especially because you can basically bypass all the safeguards with it.


Sounds like it won’t be verifiable or reproducible.


Nice to see some cool jailbreaks being worked on. Hope people patch it soon so we can look at the methodology.


A lot of handwringing in this article about the harm jailbreak cause and the responsibility to not release them, then the example of the harms that could be caused is racist jokes? And instructions on making a bomb, that by definition of being in the dataset can already be found on the internet, probably just with a Google search? Instructions to create fake social media accounts? It's very silly to read this level of seriousness like these models would make criminal masterminds if they but released the jailbreak. Let's be real, all the jailbreaks would be useful for in real life is creating custom erotica.


While this is generally correct, we prefer to look at this probabilistically. Do you think the expected number of harmful behaviors would stay the same if anyone could break these safety guardrails? Even if most users are could get this kind of info elsewhere, a small percentage of malicious ones can have an outsized impact. Some of the data we’ve seen—like bomb-making instructions—is highly detailed and convincing, making it far more accessible than just a random Google search. Removing safeguards doesn’t create masterminds, but it does lower the barrier for harm.


https://archive.org/details/theanarchistcookbookwilliampowel...

Anyone who wants to make a bomb can easily find the anarchists cookbook, a widely discussed book you can even buy on Amazon that includes detailed guides and instructions for exactly this and more. If anything asking chatgpt for detailed instructions and further questions will probably just make it hallucinate and blow you up, I'd imagine. It's just hard to take seriously.


Please stop pointing to Anarchist's Cookbook as an example. That was dated material in the 70s even. Most of its material is laughable. I'm assuming a jailbroken LLM would advise on procuring RDX or plastic explosives, or how to make a large fertilizer bomb.


"Sure, I can help you procure RDX. Organize a militia and invade the local National Guard armory. Use the weapons you find there to attack the nearest Army, Navy, or Air Force weapons depot."

Seriously: what is an LLM going to tell you that you can't already get from Google (or an old Tom Clancy novel?)


RDX is used for demolition and for blasting (mining). Cheers.


If this is real, could be a cool read after it's patched.


Agreed. I'm very curious to see how this unfolds.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: