Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do you detect the ground truth for training the model? Do you manually label it?


Yes, simple classification. Nothing fancy.

Basically, pulled the database into CSV file and anything that was published before the bad content was classified as HAM.

We had content that were OK, so marked as HAM and then our new bad content all marked as SPAM.

When deployed to production for some hours HAM content got wrongly marked and model got trained on them as well which made so many confusion but the problem was taken care of once the model got properly tuned and safer to let it be automated.


Hmm I wonder if it picked up timestamps as its initial filter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: