
Why existing profanity detection libraries suck and how I built a better one - vzhou842
https://medium.com/@victorczhou/building-a-better-profanity-detection-library-with-scikit-learn-3638b2f2c4c2
======
drallison
Most of the words on the profanity list are in common use in polite company
these days. If the goal is to identify hate speech and sexual speech, doesn't
there need to be a semantic component. I wonder why a profanity filter is
needed--what is the use case?

~~~
vzhou842
Hey, author here. I run a couple web games that allow in-game chat between
players. One of the biggest complaints from players is that some people like
to troll and spam post racial slurs / hate speech / threats. I wanted the
profanity detection so I could just filter out the most egregiously profane
messages and potentially kick/ban players who continue trying to send
obviously profane messages.

~~~
IshKebab
Seems like a reasonable use of it as long as the sensitivity is set so you
have a reasonably low false positive rate (or have a human in the loop).

~~~
vzhou842
that's the idea!

------
IshKebab
Do none of the winners of the Kaggle competition he linked release their code?

------
jaclaz
Only as anecdata, I remember a Forum/Discussion Board (technical) that a few
years ago had issues (its word filter) with the word Matsushita (Panasonic was
fine ;)).

~~~
WalterGR
This is called the Scunthorpe Problem.
[https://en.m.wikipedia.org/wiki/Scunthorpe_problem](https://en.m.wikipedia.org/wiki/Scunthorpe_problem)

------
Overtonwindow
I have never meant to say the word “ducking” in my entire life...

------
bryanrasmussen
a propos the subjectivity of profanity, the word Slut in Danish means
finished, it might often be found at the end of long bodies of text.

~~~
vector_spaces
Same in Swedish and probably also Norwegian, which both have a pretty
substantial online gameing presence

~~~
vzhou842
yup valid points, that’s definitely a weakness here

------
rebornshellfish
There is no sociologically valid use case for a profanity detection library.
This is a misuse of technology. Get over your shit.

~~~
ashleyn
Not even making services age-appropriate?

~~~
rebornshellfish
Firstly, it won't work because designing a profanity filter can only ever be
an eternal game of cat and mouse - there are always going to be new words or
ways of r3wr1t1ng the same words. This doesn't take into account the problem
of multiple languages.

Secondly, age-appropriate is a far more complex challenge than simply
filtering out "bad" words. Anything can be reworded. Families differ in which
words they consider acceptable at each age. If you really want to make an
online service safe for young children, you probably shouldn't be matching
them with unapproved strangers to talk to in any form.

~~~
beobab
Are you familiar with the maxim that people behave better if they believe they
are being observed?

Would you agree that people would be less likely to “go off on one” if a draft
message displayed a “this message could be construed as offensive” warning
before they sent it?

~~~
jsjohnst
But what defines “offensive”? If you used a “profane” word, I’d likely not
even notice. But if you said something negative about me or my family, yes,
that’d likely be offensive. Even that isn’t fool proof, a friend saying
something negative about me might not be offensive (especially if a truthful
statement), but a stranger it would likely always be. There’s so much nuance
here that I feel it’s a losing battle.

