

Show HN: A naive classifier to figure out if a sentence contains dirty words - gauravm
https://github.com/reddragon/is-dirty

======
kristiandupont
A friend of mine told me about how they once had to dig through their code to
figure out why their site was classified as adult by some filter. After days
of searching, they found this comment at the bottom of a javascript file:

// Slut.

Which is Danish for "the end".

~~~
renekooi
Another false positive anecdote! My parents used to have their ISP's adult
content filter enabled. One day I couldn't visit DeviantArt, because it said
"Mature Content Filter Enabled" somewhere on the page, and "mature content"
triggered the ISP's filter.

------
spoiler
I don't find this very useful. It's _too_ naïve for a real-world usecase.

I didn't look at the implementation, but the "classy party" looks like it
simply matches for a sequence of 'a', 's', and 's' bytes in a string.

It would be better it it tokenized the sentence using punctuation and white-
space as terminators. So, it would detect `big-ass sandwich` and `smart-ass
person` but not `classy party` or `bass instrument`.

Furthermore, it would be cool if you created a configuration format for this
kind of thing, so one could do something like this (excuse the config format,
I realise it's probably shit and problematic):

    
    
        [smart][big][fat]ass
        !sex[ual]+education
    

which would detect all of the following: smartass, bigass, fatass, _and_ ass
itself. The second rule would _not_ filter `sex(?:ual)` token followed by an
`education` token. You get the idea

These are just some heat-of-the-moment ideas, because I think this is exciting
and could be useful. :-)

~~~
gauravm
Thanks. This quick idea worked for my cases, because there were few potential
false positives. But your idea around using a regex style matcher should be
good.

------
nopuremore
With the little effort of google translate your dirty words to Spanish (copy
paste all words), you obtain a filter for Spanish, add synonyms for stronger
filtering.

Perhaps gay is not a dirty word? (is included in your dirty words, but gay
people should think otherwise.

~~~
spoiler
I'm gay, but I don't consider it offensive that the word is in there.

A lot of people use the term "gay" in conversation as a synonym for "that
sucks"; a friend of mine does it _all_ the time. I don't think they mean
anything by it.

To differentiate between "I am gay," and "Oh that's gay. I'm sorry that
happened," you'd need a NLP with a politeness preference.

~~~
gauravm
Sorry, no offense intended, if anyone took it. In my use-case, the words such
as 'gay' and 'lesbian' were in almost all cases, used for explicit documents.

This is a very naive implementation to quickly get a handle of amount of porny
documents. I intend to do some more work around clustering of porny words. I
think understanding sentiment would be hard and involves a lot of labeled
data, but that is a potentially very useful project.

~~~
spoiler
It's okay! I wasn't offended. :-)

Although I didn't realise this was meant to filter out a pornographic
vocabulary; it makes more sense now.

------
radio4fan
I inherited a (dreadful) application which had a hilariously lame 'rude words
filter'. It checked for words on a banned list.

The full list is here:
[http://pastebin.com/raw.php?i=1Pv4v8j7](http://pastebin.com/raw.php?i=1Pv4v8j7)

It contains such gems as "cockburger", "penispuffer", and -- the piece de
resistance -- "twatwaffleunclefucker".

------
natch
What is the use case for such a classifier?

~~~
gauravm
I wanted something easy to use to quickly get an idea of how much explicit
content could we be dealing with. The main challenge was dealing with a multi-
lingual database. I didn't even find a naive classifier.

Though I don't have time/RoI to improve this, but potential ideas are to use
labeled data to cluster porny words and get a probablistic metric of porni-
ness of a sentence.

