Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But ultimately, it's an unsolved problem in the field. Every single LLM has been jailbroken.


Has o1 been jailbroken? My understanding is o1 is unique in that one model creates the initial output (chain of thought) then another model prepares the first response for viewing. Seems like that would be a fairly good way to prevent jailbreaks, but I haven't investigated myself.


Literally everything is trivial to jailbreak.

The core concept is to pass information into the model using a cipher. One that is not too hard that it can't figure it out, but not too easy as to be detected.

And yes, o1 was jailbroken shortly after release: https://x.com/elder_plinius/status/1834381507978280989




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: