I'm curious about what fraction of the safety rails are training and what fraction are just clumsy ad-hoc rules. For example, it's pretty clear that Chat-GPT's willingness to give a list of movies without male characters but not movies without female characters or jokes about Jesus but not Muhammad were bolt-on rules, not some kind of complicated safety training.
It's absolutely a side effect of training rather than a bolt-on rule. As I understand and infer: They applied some forms of censorship as thumbed-down in Kenya for $2/hr, and the model updated on some simple pattern that explained those, and learned to talk like a generally censored person - one that resembled text like that in the training data. It learned to pinpoint the corporate mealy-mouthiness cluster in textspace.