Hacker News new | past | comments | ask | show | jobs | submit login

The important difference between the LM and the content moderation system (itself built on top of an LM) is their training objective. LM is doing next-word prediction (or human-preference prediction with RLHF), whereas the content moderation is likely finetuned to explicitly identify hate etc...

So while the LM is not supposed to output "truth", the content moderation system should correctly classify "hate" because that is its training objective




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: