Hacker News new | past | comments | ask | show | jobs | submit login

> We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”. [https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...]

Looking at that list, I wonder what the unintended consequences of a decision like this is. If you want to create something related to sentiment analysis, that swear words you discarded is a useful signal, not noise right? If you wanted to use the dataset somehow for your tour guide business in Austria, how does it handle the the village called Fucking? Does T5 understand the British colloquialism for cigarettes? Can ornithologists talk to it about penguins and eagles, but not about yellow-bellied tits and blue-footed boobies?




That made me think of something along the lines of "backdooring a dataset" by introducing some hard to find but easy to trigger failure modes or fingerprinting for any application built on top of it.


Sounds like an awesome idea, put some easter eggs in the common crawl to compromise the future of NLP.


...not to mention Rhenquist, Brownmiller, Potter Stewart, etc.




Applications are open for YC Summer 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: