But ultimately, it's an unsolved problem in the field. Every single LLM has been...

accrual · 2024-10-17T01:37:27 1729129047

Has o1 been jailbroken? My understanding is o1 is unique in that one model creates the initial output (chain of thought) then another model prepares the first response for viewing. Seems like that would be a fairly good way to prevent jailbreaks, but I haven't investigated myself.

tcdent · 2024-10-17T02:09:53 1729130993

Literally everything is trivial to jailbreak.

The core concept is to pass information into the model using a cipher. One that is not too hard that it can't figure it out, but not too easy as to be detected.

And yes, o1 was jailbroken shortly after release: https://x.com/elder_plinius/status/1834381507978280989