Hacker News new | comments | show | ask | jobs | submit login

That's actually really useful.

Having it there but tagged is halfway towards being able to use it to filter them out. Not having it means that when you merge it with another set that you're not going to be able to remove the porn.

And it also allows you to use it as a training set for classifiers.




"And it also allows you to use it as a training set for classifiers."

One could imagine a project on Common Crawl which auto-generated a list of slang terms for porny things by creating a list of n-grams from the words used in documents tagged as porn.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: