A lot of techniques unrelated to fine-tuning destroy safety training on LLMs. A ...

A lot of techniques unrelated to fine-tuning destroy safety training on LLMs.

A trivial example, and one that I describe in this paper: https://paperswithcode.com/paper/most-language-models-can-be...

If you ask ChatGPT to generate social security numbers, it will say "I'm sorry, but as an AI language model I..."

If you ban all tokens from its vocabulary except numbers and hyphens, well, it's going to generate social security numbers. I've tested and confirmed this behavior on a range of open source language models. I'd test it on ChatGPT except that they don't allow banning nearly every token in its vocabulary (and yes, I've tried via it's API, it doesn't work).