Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why existing profanity detection libraries suck and how I built a better one (medium.com/victorczhou)
12 points by vzhou842 on Feb 5, 2019 | hide | past | favorite | 17 comments


Most of the words on the profanity list are in common use in polite company these days. If the goal is to identify hate speech and sexual speech, doesn't there need to be a semantic component. I wonder why a profanity filter is needed--what is the use case?


Hey, author here. I run a couple web games that allow in-game chat between players. One of the biggest complaints from players is that some people like to troll and spam post racial slurs / hate speech / threats. I wanted the profanity detection so I could just filter out the most egregiously profane messages and potentially kick/ban players who continue trying to send obviously profane messages.


Seems like a reasonable use of it as long as the sensitivity is set so you have a reasonably low false positive rate (or have a human in the loop).


that's the idea!


Do none of the winners of the Kaggle competition he linked release their code?


Only as anecdata, I remember a Forum/Discussion Board (technical) that a few years ago had issues (its word filter) with the word Matsushita (Panasonic was fine ;)).


This is called the Scunthorpe Problem. https://en.m.wikipedia.org/wiki/Scunthorpe_problem


I have never meant to say the word “ducking” in my entire life...


a propos the subjectivity of profanity, the word Slut in Danish means finished, it might often be found at the end of long bodies of text.


Same in Swedish and probably also Norwegian, which both have a pretty substantial online gameing presence


yup valid points, that’s definitely a weakness here


There is no sociologically valid use case for a profanity detection library. This is a misuse of technology. Get over your shit.


Is there a "sociologically valid use case" for other software you've seen announced on HN?


Not even making services age-appropriate?


Firstly, it won't work because designing a profanity filter can only ever be an eternal game of cat and mouse - there are always going to be new words or ways of r3wr1t1ng the same words. This doesn't take into account the problem of multiple languages.

Secondly, age-appropriate is a far more complex challenge than simply filtering out "bad" words. Anything can be reworded. Families differ in which words they consider acceptable at each age. If you really want to make an online service safe for young children, you probably shouldn't be matching them with unapproved strangers to talk to in any form.


Are you familiar with the maxim that people behave better if they believe they are being observed?

Would you agree that people would be less likely to “go off on one” if a draft message displayed a “this message could be construed as offensive” warning before they sent it?


But what defines “offensive”? If you used a “profane” word, I’d likely not even notice. But if you said something negative about me or my family, yes, that’d likely be offensive. Even that isn’t fool proof, a friend saying something negative about me might not be offensive (especially if a truthful statement), but a stranger it would likely always be. There’s so much nuance here that I feel it’s a losing battle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: