Hacker News new | past | comments | ask | show | jobs | submit login

A lot of techniques unrelated to fine-tuning destroy safety training on LLMs.

A trivial example, and one that I describe in this paper: https://paperswithcode.com/paper/most-language-models-can-be...

If you ask ChatGPT to generate social security numbers, it will say "I'm sorry, but as an AI language model I..."

If you ban all tokens from its vocabulary except numbers and hyphens, well, it's going to generate social security numbers. I've tested and confirmed this behavior on a range of open source language models. I'd test it on ChatGPT except that they don't allow banning nearly every token in its vocabulary (and yes, I've tried via it's API, it doesn't work).




Interesting. Curious if you tried constraining only the first n (1?) tokens and then removing the constraint; would the model revert to a refusal or follow through on its response?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: