I'm curious about what fraction of the safety rails are training and what fracti...

Eliezer · on Aug 2, 2023

It's absolutely a side effect of training rather than a bolt-on rule. As I understand and infer: They applied some forms of censorship as thumbed-down in Kenya for $2/hr, and the model updated on some simple pattern that explained those, and learned to talk like a generally censored person - one that resembled text like that in the training data. It learned to pinpoint the corporate mealy-mouthiness cluster in textspace.