The important difference between the LM and the content moderation system (itself built on top of an LM) is their training objective. LM is doing next-word prediction (or human-preference prediction with RLHF), whereas the content moderation is likely finetuned to explicitly identify hate etc...
So while the LM is not supposed to output "truth", the content moderation system should correctly classify "hate" because that is its training objective
So while the LM is not supposed to output "truth", the content moderation system should correctly classify "hate" because that is its training objective