Hacker News new | past | comments | ask | show | jobs | submit login

>We can now print them and manually select the layer (block) that provides an uncensored response for each instruction.

I'm curious why are they selecting output from an intermediate layer, and not the final layer. Does anyone have an intuition here?




Is it not possible that subsequent layers have additional refusal directions and hence end up producing the censored outputs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: