Hacker News new | past | comments | ask | show | jobs | submit login

There's both. With the web interface it clearly has stopwords or similar. If you run it locally and ask about e.g. Tienanmen square, the cultural revolution or Winnie-the-Pooh in China, it gives a canned response to talk about something else, with an empty CoT. But usually if you just ask the question again it starts to output things in the CoT, often with something like "I have to be very sensitive about this subject" and "I have to abide by the guidelines", and typically not giving a real answer. With enough pushing it does start to converse about the issues somewhat even in the answers.

My guess is that it's heavily RLHF/SFT-censored for an initial question, but not for the CoT, or longer discussions, and the censorship has thus been "overfit" to the first answer.






This is super interesting.

I am not an expert on the training: can you clarify how/when the censorship is "baked" in? Like is the a human supervised dataset and there is a reward for the model conforming to these censored answers?


In short yes. That's how the raw base models trained to replicate the internet are turned into chatbots in general. Making it to refuse to talk about some things is technically no different.

There are multiple ways to do this: humans rating answers (e.g. Reinforcement Learning from Human Feedback, Direct Preference Optimization), humans giving example answers (Supervised Fine-Tuning) and other prespecified models ranking and/or giving examples and/or extra context (e.g. Antropic's "Constitutional AI").

For the leading models it's probably mix of those all, but this finetuning step is not usually very well documented.


You could do it in different ways, but if you're using synthetic data then you can pick and choose what kind of data you generate which is then used to train these models; that's a way of baking in the censorship.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: