Hacker News new | past | comments | ask | show | jobs | submit login
The naughty username checking system used by Twitch (ghostbin.com)
565 points by PrincessJess on Oct 6, 2021 | hide | past | favorite | 337 comments



We used to call this the "Scunthorpe problem" - how do you filter for obviously rude names while still allowing people to have their actual names? Bearing in mind that some people's names are actually rude.

I worked on the application form processing for the Nectar card launch in the UK back in the early 2000's, and we had several cases. Luckily we had human data-entry clerks in the loop, so all we had to do was flag when a name contained anything on the "Scunthorpe list" and get a human to look at it. Even then it wasn't perfect and a few slipped through. One of the early PR messes of the launch was someone getting a Nectar card issued with a rude name, and of course they immediately went to the press with it [0]

I'm interested to see they still haven't solved this [1]

[0] I never saw the journalistic interest in this: this guy said his name was <rude name>, and then signed the form to say everything on the form was true. Why is anyone surprised or interested that we accepted his name was what he said it was and gave him a card in that name?

[1] https://metro.co.uk/2015/02/20/woman-refused-sainsburys-nect...


People should just be taught (I would consider this is an essential skill which should be put in basic school curriculum to psychologically prepare people for the life, many support/sales professionals learn it quickly but the rest of the people don't) words are just words and they don't have to take them seriously. No word is rude until you consider it rude. No word can actually harm you directly. It's you who is in charge of deciding what to feel whenever you hear/see a particular word. Sadly, most of the people just ignore this job. I personally enjoy being called by the derogatory nickname of my ethnicity because I just like how does it sound. First half of my life I didn't even know it was supposed to be insulting (as I didn't know actual racists exist and can be serious about hating/judging people by just their ethnicity).


We used to be taught that "sticks and stones may break my bones but words will never hurt me".

I'm not so sure that has ever been true. Emotional bullying was always a thing, and it did hurt. I'm glad that we're now taking that more seriously.

But it's also true that my emotions are my responsibility, and my emotional reaction to words is unique to me. I cannot force everyone else to be responsible for how I feel about their words. And I cannot expect everyone else to anticipate how I might feel about what they want to say and modify their speech accordingly.

There's a balance in there. I suspect that this balance is what we call "good manners" or "politeness".


>I suspect that this balance is what we call "good manners" or "politeness".

Or "empathy". And I think on an interpersonal level this is easy. You can't force anyone else to be responsible for how you feel about their words, that is true. But if you say to a colleague or a friend that "when you call me X, that makes me feel bad, could you call me Y instead?". Depending on the friend they might want to understand why this feels bad, but they would call you Y from then on. Easy peasy.

But on a collective level we aren't yet (if we'll ever be) as developed a civilization to have empathy on that level. So some people want to, with good intentions, extend their empathy to others and write code like this. But seconds later, the creative and determined individual called cVnt_h_ater have bypassed the code and can proudly wear his misogyny on his sleeve.


> But on a collective level we aren't yet (if we'll ever be) as developed a civilization to have empathy on that level.

I agree. Which is why we have rules around politeness and manners, which roughly conform to same actions as actual empathy. Which makes sense - people don't always have the same level of empathy, there are lots of neurodivergent people who will never understand empathy but who can learn a set of rules that will allow them to avoid social problems.

Or at least, we used to. I feel like this respect for being nice to each other so we can all get along is on the wane.

Or it might be that I moved to Berlin, and Berliners are famous in Germany for being rude/abrupt, and Germans are famous globally for being direct ;)


> I feel like this respect for being nice to each other so we can all get along is on the wane.

In my mind, part of that is that the rules are getting stricter, so people respect those rules less. Over time, the needle has moved further towards the idea that anyone can be offended by anything anyone else says. Even if it's clear (to the average person) that the speaker meant no harm (or even intended a totally different meaning for the words), the fact that someone is offended means that the speaker is wrong. Professors get sanctioned for using a racist word in a discussion about racism (and sometimes that word in particular).

At some point, people start seeing those rules as ridiculous and no longer respect them (though they may still feel compelled to follow them, for fear of repercussions).


> Over time, the needle has moved further towards the idea that anyone can be offended by anything anyone else says.

What do you mean by this? Are you talking about some contrived example where one person had their parents murdered by a maniac while shouting "bicycle" over and over which causes that word to be triggering ptsd for that individual. In that case that idea is correct since that, hypothetically, can happen for _any_ word yes. But how is that relevant to any of this?


I'm talking about things like

1. A professor fired for using the N-word in a discussion about racial discrimination

2. A professor placed on leave for using a Chinese expression that sounded like the N-word

3. People that express the opinion that someone who has undergone a gender change operation is different from someone that was born that gender. Expressing an opinion that they're different in some ways; not mocking or insulting them.

4. A man saying "I'd fork that" about a software project and everything went sideways because someone else decided it was sexual and was offended by it.

On and on and on. It has gotten to the point that if someone THINKS you crossed an imaginary line, a line that keeps moving, then you can face severe personal and professional repercussions. I do not believe that this is good for us as a society. People are supposed to have differing opinions and then discuss those differences. People are supposed to be able to discuss things in general. People are supposed to be able to have a little fun.

Accidentally offending someone has turned into a horrible offense, and it's bad for our growth as a society.


3. The words you use for this are trans and cis. No one is going to get upset at someone for using them. If you are trying to use "creative" language to say the same thing you shouldn't be surprised when people take offense, considering that non-offensive terms already exist.


The word the most rabid "PC warriors" use for this are "man" and "woman". If you dare say "Trans women are different from other women in that...", you can bet there's some looney who will jump down your throat along the lines of "How dare you call this woman anything else but just 'a woman', you horrible bigot!"

If you didn't understand that this was what the GP meant you must have led an exceedingly sheltered online life until now. (Which, frankly, feels so unlikely that it feels nearer to hand that your comment was made in bad faith or, at best, unthinkingly.)


This hypothetical situation with the looney you made up is not very fleshed out. The looney seem to be referring to an individual woman that is somehow out of frame in the story and judging by the looneys remarks, the person who now is talking about trans women in general seem to have made previous remarks about the individual that was insensitive to that individuals desires to be referred to as a woman and not a trans woman. Which is kind of a dick move...

But cool story bro. Fight for your right to be a dick I guess.


>>>> [@RHSeeger:] 3. People that express the opinion that someone who has undergone a gender change operation is different from someone that was born that gender. Expressing an opinion that they're different in some ways; not mocking or insulting them.

>>> [@dehugger:] 3. The words you use for this are trans and cis. No one is going to get upset at someone for using them.

>> [Me:] The word the most rabid "PC warriors" use for this are "man" and "woman". If you dare say "Trans women are different from other women in that...", you can bet there's some looney who will jump down your throat [...]

> This hypothetical situation with the looney you made up is not very fleshed out.

Fleshed out enough for anyone reasonably intelligent to get the gist, I would have thought. What, exactly, did you not understand?

> The looney seem to be referring to an individual woman that is somehow out of frame in the story

What "story" -- my hypothetical? Since "the looney" is prominently mentioned in it, how can they be "out of frame"?!?

> and judging by the looneys remarks, the person who now is talking about trans women in general seem to have made previous remarks about the individual

No, absolutely not. Where on Earth did you get that from? My hypothetical was talking about trans women in general, period. There's no need to make up any complicated backstory, because what I said was all I wanted and needed to say: If you dare say, in a general discussion on the Internet, e.g, "Trans women are different from other women in that...", you can bet there's some looney who will jump down your throat BECAUSE you used the term "trans women"; to some people that, too, is anathema because it "singles out" trans women from other women. Some people are so blinded by PC that to them, acknowledging any difference is a sin.

> that was insensitive to that individuals desires to be referred to as a woman and not a trans woman. Which is kind of a dick move...

Yeah, in your weird made-up story that has nothing to do with anything. You are of course perfectly free to make up little fairy tales and publish them on the Internet, but not to impute them as back-stories to anything I (or anyone else) has said. Don't put your shit in my mouth.

> But cool story bro. Fight for your right to be a dick I guess.

If anyone is "being a dick" here, AFAICS it's you.


I used the words that seemed the most obvious way to express myself. I would have had to look up the term cis to make sure I was using it the right way, and there wasn't a good reason to. The fact that someone could possibly be offended by the use of the words I chose shows part of the problem. That's like saying you think "Italian-American" is ok, but "An American citizen with Italian ancestors" is offensive. It's a clear, well understood description of the trait being discussed that expresses no judgement. It _cannot_ be offensive under any normal circumstances.


I think that's a slightly different problem - there is a purity spiral going on in academia (and Twitter) around the definition of "offensive" language.

Purity spirals never end well. The most famous one ended up with people being burnt alive as witches.

But it is a purity spiral - if you don't intend to be considered "pure" then it doesn't matter and you can effectively ignore it. Though whether your employer agrees or not is a different matter.


> I feel like this respect for being nice to each other so we can all get along is on the wane.

I think this is depending on how you see it. The modern landscape with social media makes activists very effective in creating information campaigns that raises awareness of how to be polite in a modern world. Eg. nowadays personal pronouns are complicated business but finding out and respecting individuals chosen pronouns are considered polite. A successful change of culture that increases empathy and respect of fellow humans in general. But the downside is that the opinions of the people who don't want to be polite in this way are also amplified and they may want to resist and pushback that triggers them to be contrary and act with less empathy and politeness to make their point. The thing is, positive changes in culture usually last and backwards reactionary people die so that makes me kinda hopeful.


> Eg. nowadays personal pronouns are complicated business but finding out and respecting individuals chosen pronouns are considered polite. A successful change of culture that increases empathy and respect of fellow humans in general.

In one way, that is of course true, and people who just blithely ignore that, or even explicitly refuse to use the pronouns they're asked to, are of course arseholes.

But OTOH, it feels like it's getting more and more to keep track of -- at least if you also want to keep up the old norms of "good behaviour" we learned as kids. And perhaps that's precisely why some of us are feeling those old norms are falling by the wayside -- maybe people only have some set amount of attention to spend on this stuff, and if something new is added, something else gets crowded out?

If that's the case, do we know for sure that it's unequivocally a good thing? I'm not quite certain.


> the creative and determined individual called cVnt_h_ater have bypassed the code and can proudly wear his misogyny on his sleeve.

Why deny him this right? Isn't it convenient to let kids and mentally unhealthy people to identity themselves this obviously? It's always easier when you know what to expect.


Because if you let them do their thing, they group together, drag more people down to their level and then get one of them elected as president. That's not in my interest.


The idea of hating all women this way is nonsensical for normal people (despite the fact limiting their rights used to be a common tradition, I believe most of the people didn't really hate them) and can only exist in some mentally traumatized people. Seeing someone exercising obvious nonsense rarely triggers you to start doing the same on systematic basis. Electing a president requires a majority vote. Even though majority's intelligence is mediocre by definition, it hardly is SO dumb and sick. Even a rude person with sexist jokes, even a women rights limiting proponent rarely is a real serious passionate misogynist, such will mostly see "cVnt_h_aters" ridiculous too.


> "when you call me X, that makes me feel bad, could you call me Y instead?"

There are absolutely people who will use this as a reason to put X on a billboard.


People, yes. Friends, no.


I thought it was something along the lines of "sticks and stones may break my bones but words can break hearts"

Basically, the pen is stronger than the sword, kind of morale.


I've never heard that one, but it might be a regional variant. I always heard the other one.


I grew up with 'words can never hurt me', but this variant is my favorite:

Sticks and stones may break my bones but words will hurt forever.


I'm not a native speaker, so almost every time I hear that phrase it ends after "sticks and stones", because on TV and movies, that's usually enough for the intended audience.

After Googling, I think I might have picked up this version from Tim Minchins song "prejudice" [1] although Google also come up with this clip where a senator also use the "break my heart" version, clearly when it was meant for him to say "never harm" [2]

1: https://www.youtube.com/watch?v=KVN_0qvuhhw

2: https://youtu.be/QoMheenRUHM?t=92


I'm not the only kid Who grew up this way

Surrounded by people who used to say That rhyme about sticks and stones

As if broken bones

Hurt more than the names we got called

And we got called them all

https://www.youtube.com/watch?v=ltun92DfnPY


> We used to be taught that "sticks and stones may break my bones but words will never hurt me".

Now it is more "sticks and stones will break your bones because your words have hurt me"


I'm quite a feminine guy and I was always called various homophobic slurs at school. Even though I wasn't gay I didn't understand why it would matter if I was so I would just say "ok".

Even when the insults could be considered objectively offensive I'm still generally okay with it. I've been called names quite a bit in the workplace over the years and I'm okay with it if makes them happy. Being autistic I don't really care and neurotypical people seem to get enjoyment calling people names so it's win-win.

High EQ individuals have told me this is wrong and the appropriate reaction would be to take offence and try to make them feel bad, but I've argued this would just result in an objective reduction in happiness in the world so it wouldn't make sense. Plus, most of the time my colleagues have been nice to my face so it's not like it's ever got in the way of me doing my job.


neurotypical people seem to get enjoyment calling people names

I'm of two minds about this, because of the generality of this blanket statement. Why should we accept this as the norm? I would classify these people ("enjoy calling people names") as abusive, and think that classifying this behaviour as neurotypical effectively legitimizes it.

That said, nicknames are also an expression of affection and inclusivity, and derogatory nicknames are not necessarily meant as an insult. Still, they're just as easily used to demean, belittle or dehumanize the target, so it depends a lot on context. I understand what you mean, but I disagree with your general statement.


I adore you but doubt your judgement about those being high EQ individuals.

By the way in the middle school we used a derogatory for gay every day without even having any idea of what did it mean, all we knew was that was the worst word we knew. When others bullied me (always (or I just never cared and don't remember the verbal) physically, way harder to endure) I occasionally went mad and called them this word repeatedly.


This is neither here nor there, but I learned from watching Louie that "faggot", the derogatory term for homosexuals, has the same origins as the word "fascism" – a bundle of sticks. The idea is that in the dark days of witch trials, one would, apparently, not only burn supposed witches, but also homosexuals. However, since homosexuals were thought of as less than witches, while a witch got the stake, the homosexuals were thrown in with the bundles of firewood. The now archaic word for a bundle of sticks was faggot, or fire-faggot in the case of firewood. The part about different forms of herecy receiving different treatment at the stake might be artistic license, the rest is true though.

https://www.oed.com/viewdictionaryentry/Entry/67623#eid46436...


This is probably also the root of the other meaning of the word "fag", which in British English slang (or is it even "slang" any more?) means not only "homosexual" but also "cigarette": A little stick that burns.


"Being autistic I don't really care and neurotypical people seem to get enjoyment calling people names so it's win-win."

Not sure how you win exactly, but others in a similar position but sensitive to name calling certainly don't.


The thing with basic etiquette and other secondary virtues is that if you have none of them, the climate can become horribly rough and unpleasant but you can also overdo them. It's all about a reasonable equilibrium.

As for name calling and slurs at work, I don't see a reason why these should ever be tolerated. Someone who cannot demonstrate a minimum of politeness towards their colleagues shouldn't be part of the team. To me this is a no-brainer.

Usernames are a bit different IMHO. In a general social network, offensive usernames give important clues that allow you to avoid all contact and block users before you even talk to them. So maybe banning them does more of a disservice than a service to other users.


> As for name calling and slurs at work, I don't see a reason why these should ever be tolerated.

The problem is when language changes for some but not all, so suddenly words that used to be perfectly normal are considered slurs: Those who learned and use them in the original non-offensive sense are suddenly, and in their own opinion unjustly, seen as bigoted.


> High EQ individuals have told me this is wrong and the appropriate reaction would be to take offence and try to make them feel bad, but I've argued this would just result in an objective reduction in happiness in the world so it wouldn't make sense. Plus, most of the time my colleagues have been nice to my face so it's not like it's ever got in the way of me doing my job.

You may be wrong, and these "High EQ individuals" right, in the sense that your acquiescence to this bullying behaviour might encourage the bullies to keep it up with others too. In that sense, your standing up against them would be not just for yourself but for all their potential future victims too, which would increase the future sum total of happiness in the world.

(I'm not saying it is necessarily so, but it may be.)


> High EQ individuals have told me this is wrong and the appropriate reaction would be to take offence and try to make them feel bad, but I've argued this would just result in an objective reduction in happiness in the world so it wouldn't make sense.

As the sibling comment eluded to, you definitely seem to be the "High EQ individual" in this story!


What s rude is the shared imagery not the words. If I call myself MyDirtyDickInYourDogBringsMeToOrgasm, imagining this scene is disgusting enough to ask me to change it.

Words are not just symbols on paper, they are image triggers in your brain.


I have Aphantasia, thankfully I'm invulnerable to this.


If someone attempts to degrade someone's emotional or mental state, that's something that should be stigmatized. But, that doesn't mean that the individual is not completely responsible for his own mental state, in the sense meant by Stoics. It's not about things, but our thoughts about things, which are ultimately under our control. We may play the victim, but we'd only be fooling ourselves.


I disagree. Language is very often used as a way to include and exclude people in and from various social groups. You're right in that there's never any direct harm in the sound of the words themselves. But you can say the same thing about verbally threatening physical harm (or even psychological harm) to someone, and would be very wrong in your assumption.


> But you can say the same thing about verbally threatening physical harm

No, verbally threatening physical harm induces reasonable fear and affects your logical reasoning because now you have to consider real risk.


Trying to hurt someone with words also induces a reasonable assumption of that person wishing you ill. You don't know how far they will go in either case.


> No word can actually harm you directly.

"Sticks and stones may break my bones, but words will never hurt me" is something told to children in the playground who are too young to know better.

You might as well argue that there's no such thing as malware, because its entirely harmless until executed on a processor.


So "get over it" is your solution? Sounds like you've never been a victim. Place yourself in someone else's shoes please.

I mean I don't like the (self) censorship either and don't mind rude or mature words, since we're all mature here and calling a fucking asshole a f*ing a$$h0le is an insult to people's intellect.


As someone who was incessently bullied for many years, "get over it" is pretty good advice for some people. Like all advice on personal emotional problems, it doesn't sork for everyone, but for a lot of people it does.


> Uhura : But why should I object to that term, sir? You see, in our century, we've learned not to fear words.


Yeah, that only shows how stupid the original Star Trek often was. Just because they hadn't yet had a Trump or a Hitler in Uhura's century is no reason for her not to know that "words" are exactly what scum like that rises to the top on.


Agree, the scum is leveraging wetware vulnerabilities in our species around sound bites, tribes, labels, and twisted facts.

But I think that was Roddenberry's point: we can only advance after excising that primitive baggage.


I think words can hurt people in certain contexts, but it is very rarely about a single term.


It's getting particularly problematic for international sites. What constitutes offensive language in one country might be a perfectly normal name, or product, in another country that uses the same language, let alone in another language.

There's been a running joke on the UK Reddit subs recently about people getting short bans for using the term "faggot" as it's now on some automatic, site-wide blocklist. The bans are completely context free, so people are supposedly getting banned for discussing the food product, a kind of large meat-ball made primarily from minced offal. The same happens with "fag", meaning a cigarette.


There's the opposite problem too, where "fanny" has a different meaning in the UK vs. USA. There's the classic line from "The Office" about it ("over there fanny means your arse. Not your... minge") and I think most folk are pretty aware of the American meaning here these days.

I wonder, is it too much to ask that we (as in, the various English speaking places) understand the different meanings of words that are offensive in one dialect but have a different and mundane meaning in another dialect?


That this system is so primitive doesn't surprise me at all. After all, people with a family name "Null" often get database errors -- despite even Oracle, which can't tell null and empty string apart, decidedly has IS NULL vs == "NULL" comparisons... https://www.bbc.com/future/article/20160325-the-names-that-b...


Little Bobby Tables struggles


Its a similar problem to people having names that are considered "fake". I shudder what people actually named Harry Potter or James Bond have to go through regularly.

Also, I dont see any sense in actually enforcing any of those lists. I always think about the "Journey of Life" in The Grand Tour, which, frankly, is completely inoffensive in any other language than english https://www.youtube.com/watch?v=BLXe2WTYngQ


I live in Wedding, pronounced "Vedding". It was weird for about the first week, then it never really occurred to me again. I'm sure the people who live in Fuck have the same, right up until they have to fill in their address online.


Seems like a rude name might give you some privacy advantages in the 21st century. To the price of being excluded by random services maybe.

Old math prof of mine was named Dr. Cock, and he wasn't the only one, the name was common. Students called him Dr. Octocock.


I have a friend who studied at the University in Linz, way back when. According to him, on the Universities System/370 they used to have an account name scheme where they simply concatenated the first 2 letters of the first name and the first 3 letters of the last name. Aledgedly, the scheme got changed after Professor Arno Schulz (https://de.wikipedia.org/wiki/Arno_Schulz) was understandably upset about his account name (https://en.wiktionary.org/wiki/Arsch).


Random story I remember my ICT teacher telling us in school along the same lines as Scunthorpe. They installed some nonsense web blocking thing that schools like to have and a lot of teachers complained because it blocked weightwatchers.com


:)


I only recently learned about Scunthorpe because the username I've been using for 15 years is now affected.

Georgyo has orgy right in the name. More and more places are refusing to accept it when signing up.


> We used to call this the "Scunthorpe problem"

Fwiw, it's well-known as such, after relatively high-profile incidents involving people with that address: https://en.wikipedia.org/wiki/Scunthorpe_problem


Is that just well-known in the UK though? I've mentioned this to people outside the UK and only occasionally have they known it.


Nobody outside the UK has heard of Scunthorpe because it can't make it through the filters.

(There's a ridiculously large list of innuendo place names for tiny villages: https://anglotopia.net/ultimate-list-of-funny-british-place-... )


One of Douglas Adams' lesser-known but still hilarious works: https://www.panmacmillan.com/authors/douglas-adams/the-meani...


Hey now, it's not Bielefeld. In Ohio at least, the word it's concealing is taboo enough that nobody who does know wants to explain to their more-sheltered coworkers why the situation is funny. (Clearly the filter wasn't designed in Australia.)


> Nobody outside the UK has heard of Scunthorpe

The Register has lots of readers outside the UK, and it has never shied from -- on the contrary, seems it positively delights in -- blowing this stuff up in its pages.


I'm also in the UK, so I don't really know. I thought/assumed it was fairly well-known among software engineers, HN, etc. (not discounting the day's 'lucky 10,000' of course) but not so much by others, regardless of country. :shrug:


I’ve seen it called the clbuttic problem


The city of Toppenish in the US had the same problem when the council turned on a generic filtering system for the city networks. Everything stopped working.


Haha, I call this the "Clbuttic" problem. But nowadays it can be solved with machine learning fairly easy https://moderationapi.com/blog/moderate-text-automatically-u...


That's a slightly different problem from the Scunthorpe problem.

You have a customer who gives her name as "Fanny Batter". Is that a permissible name?

What about "Juan Kerr"? Or "Amanda Huggenkiss"?

It's not about replacing "ass" with "butt", but detecting rudeness/humour in context.

I'd be amazed if machine learning can solve this one. It's extremely hard for humans.

* all those names should be rejected btw


> * all those names should be rejected btw

No they shouldn't.


If you don't want someone going to the press and creating some "human interest" story about how they got a loyalty card with a stupid name, then your algorithm should at least question them.


But perhaps you do want them to do that, or at least don't care if they do. Maybe you think silly jokers is a good demographic to market to this way.

And if not, then just tell the press: "They typed in the stupid name, so they got it on their card. How is that our fault? Now go write about some real news in stead of this lazy sensationalism."


The clbuttic problem involves replacement; original source is (AFAIK) here:

https://thedailywtf.com/articles/The-Clbuttic-Mistake-


If nobody ever implemented a rude word filter, nobody would ever need to implement a rude word filter.


Reminds me of the guy that streamed a talking banana on Twitch, where viewers could make it say things. People submitted variations of the n-word and got him banned, and after trying to filter out all character combinations he could think of he wrote a phonetic filter. That apparently worked much better than trying to think of every permutation of characters that sounds like bad words.

https://youtu.be/bJ5ppf0po3k?t=715


>the guy that streamed a talking banana on Twitch

Of all the ways I expected a talking banana to backfire, I didn't expect this one. Thanks for sharing


Anything exposed to the open internet devolves into porn or racism or both unless active effort is made to prevent it. I'm reminded of https://en.wikipedia.org/wiki/Tay_(bot)


quite curious as to how Instagram managed to avoid this, despite launching with no moderation, (just Mike and Kevin), yet all you needed was an email to use the service...


That's called the tragedy of the commons, everything starts pure and innocent, that's why people tend to romanticize the "early days" before X


They certainly have moderation now, and it's a constant exercise in boundary-pushing. I'd be interested in a "history of Instagram moderation", that would be a great piece of anthropology.

Was the no-links policy there from the start? I think that would have helped a lot. As it is, you're allowed basically one outbound link from your profile, so there are link-expander services which people use to link to more things.


Wait, now I’m extremely curious what failure modes one would expect from a talking banana.


I assume the other failure modes are pornographic in nature.



The list included mike hawk (phonetically similar to 'my...'), so they are interested in phonetics, apparently, even for these usernames. The banana streamer has their stuff set up better than twitch, then.


From the video: "It was a mix of racism and creativity"

That sums up the WoW Classic (and I'm sure many other gaming communities) a little too perfectly.


That was both terrible and amazing.


They figured that out on Ellis Island so yeah. Soundex.


Are you saying Ellis Island used Soundex (that seems to check out) or devised it?


Ellis Island is the one with that big green statue. What it has in common with Soundex?


I’m not saying that there aren’t big green statues on Ellis Island, but the really really big green one is on the next island over.


> really really big green one

It's not that big really, half of it is the plinth to be honest.


Ellis Island in New York harbour (AFAIK) used to be the main immigration center for people arriving by ship from Europe, and is famous (among other things, I assume) for having originated some weirdly spelled names when the immigration officials who registered the newly arrived got it wrong.

Don't know if the GP meant that they had soundex, or that they invented it, but in any case it seems they would have needed it.


That is oddly hilarious yet really dark at the same time. Thank you for sharing the story about the talking banana.


But now it's anti-Chinese if you can't say 那個 :v


We had to do this for a link shortening system (to make sure random base64 didn't contain profanity). It was a pretty fun problem. Not just the implementation, but doing the math to make sure it didn't make our shortened links easily enumerable. The implementation wasn't too bad, but we set up logging initially to spit out any random strings it decided to block. I demo'd this in front of the whole company and live tailed the logs and the first one that popped up during the demo was a big ole F bomb. It made for an excellent demo.


> to make sure random base64 didn't contain profanity

I would have said "why bother" until this happened to us.

A customer rang us up in a fury because some demo/ random data that we generated happened to have the word "penis" in it. They were convinced we must have put it there because we thought he was a cock. It was very difficult to defuse the situation.


I have a bus ticket from Stockholm with the serial of F4CK. Totally made my day back then to be honest.


A guy at my school(in Poland) got in serious trouble because of his hoodie with the huge FCUK logo on it. It took actually showing the principal that this is a legit company[0] and not just a play on "FUCK".

[0] https://en.wikipedia.org/wiki/French_Connection_(clothing)


I mean, it was still a play on the word, just being done by a legit company. Not that legit means mature.

I was doing a student event when I saw someone wear "K1SS MY 4RSE". I told him "What an obnoxious hoodie.". He meekly said "I thought it said 'Kiss my force'.". I later saw him in a corner praying. I should've asked him what Allah would've thought about him talking to him wearing that hoodie.


[flagged]


Please note: HN is not Reddit.


I was pretty happy when my randomly generated imgur URL of a photo of a telephone cake I made ended up being xxTEL.

https://imgur.com/gallery/xXtEL


> I would have said "why bother" until this happened to us.

Aah, the good ole "one customer is unhappy, let's waste a week of time on this" approach to IT management. Takes guts to tell such customers "here's your refund, now piss off", but it is the right thing to do.


Not just that, also “let’s all stop having fun”. (Not about this case, but


Recently on HN someone thought that reddit.com/imgur directing to a post on /r/Drugs meant something when it's just a randomly generated ID. All 5 letter word that I tried worked because there's been so many posts.

https://news.ycombinator.com/item?id=28676096



I just recently saw a randomly generated ID of ours in production that starts with "doggy". Thankfully "doggy" is pretty innocent, but it really made me think "wow what if it was something bad". Unfortunate that that exact scenario seems to have happened to you already.


I got a CAPTCHA variant not long ago that was along the lines of "U2KYS". Pretty toxic!


"You too kiss" is "toxic"? How? Or am I missing something?


We had a similar situation with a random name generator that just picked first names and last names at random. One result of this was 'Gaylord Dickinson', which sounds like it could only possibly have been made up as a homophobic joke, but which was just the random combination of two quite common first and last names.


When it comes to censoring randomly generated strings, I like simply to omit vowels from the alphabet. Usually I'll omit some of the more obvious lookalikes too, e.g. [1 0 v].

It's a simple solution. Sure, it is still possible for something to slip through that looks similar to something bad. But the potential to strongly offend is greatly reduced.


Yeah there was an article I read a while back about a company looking to prevent the use of 'naughty' words in randomly-generated strings used as event IDs. Apparently someone with some pull had seen a message with an offensive word. Some management committee spent a long time trying to figure out how to solve the problem including proposing keeping a list of bad words, and then worrying about what should be in it and who would maintain it. At some point an engineer got a chance to speak and said something like "just use base-31 and omit vowels". The story as I remember it didn't mention the use of v or l33t-speak, but they were randomly generated, not maliciously constructed, values.

Also note that if you're too naive about checking for 'naughty' words, you get https://en.wikipedia.org/wiki/Scunthorpe_problem


Hashids does this (avoid bad words) if anyone is curious to see an implementation

> algorithm tries to avoid generating most common English curse words by never placing the following letters (and their uppercase equivalents) next to each other:

> c, s, f, h, u, i, t

https://hashids.org/#how-does-it-work

E: ah it was already mentioned later on, hadn't got that deep into the comments yet!


Oh yes, there's plenty of ways to avoid curse words, but each one has a cost, and if the system needs to generate ids very quickly, any scheme that works too hard could be a bottleneck. The naive way of just randomly throwing together letters of the alphabet will eventually generate a forbidden word. Any steps taken to reduce the probability should understand the time/space tradeoffs.


> didn't mention the use of v or l33t-speak

What is 'v' in this context?

Edit: thanks for the answers. It makes sense now.


It's visually equivalent to a u, so without vowels it's still possible to get "fvck", for example.


The Romans wovld approve of svch a scheme.


How would the Romans have pronounced 'approve', though?


I suppose it's about the use of 'v' instead of 'u' in the F word


imagine the phrase "see you later" was offensive, or perhaps its abbreviation "cu"

the combo "cv" could then become problematic.


Gov.uk did this. The code WNKR still ended up on Reddit yesterday.


That's amusing, but I think it also highlights the effectiveness of the strategy. WNKR is excusable and defensible. WANK would not be.

Edit: But I'll concede that when your outputs are only four characters long and end users will actively interact with them (write them down, type them again later, etc.), additional safeguards might be appropriate. Or simply omit all alphas and use only numerics.


> Or simply omit all alphas and use only numerics.

You're still not out of the park with numerics - people with 1313 or 6660 or 4444 or something will complain a lot. The possibility of a 666 in some new biometric government IDs in my country rose a massive stink from church...


My girlfriend got a new bank account and when she received her account number it contained 666. She asked for a different number and they changed it without charge.


> biometric government IDs

Yeah. An important, long-lived ID that will stick with an individual for their entire life, and that they may want to commit to memory. That seems like a good time to take a hypersensitive approach and adopt some kind of filter.


I was in the UK recently and - if you can believe it - there was a car on the block I stayed whose plates contained 666.

Also, have a feelin you meant to do 1312. What’s the issue with 4444, though?


> What’s the issue with 4444, though?

https://en.wikipedia.org/wiki/Tetraphobia


Woah, had no idea!

> When Beijing lost its bid to stage the 2000 Olympic Games, it was speculated that the reason China did not pursue a bid for the following 2004 Games was due to the unpopularity of the number 4 in China. Instead, the city waited another four years, and would eventually host the 2008 Olympic Games, the number eight being a lucky number in Chinese culture.

Thought this was particularly interesting.


In Germany, where you can request number plate combinations (as long as they are free and follow a few roles), 666 is a pretty common combination amongst young drivers.

> What’s the issue with 4444

4 is pronounced similar to "death" in sino-japanese languages and dialects.


Also, as you have the muncipality-shortcut at the beginning, and then two user-definable letters, you freqently see "rude" combinations, and noone bats an eye. BIT-CH, MON-GO, ANA-L, DIL-DO. (And those are just the combinations that are understandable in english which I saw when driving). I'll never forget the face of the guy at the Zulassungsstelle when I was there with a friend who wanted COC-K-6969. He got it it, btw. When I tried some years later, AC-DC-666 sadly was already taken.


People explained 4444, with 1313 I was thinking of the equivalent for superstitious westerners afraid of the number 13.


What’s the issue with 1312?


it spells ACAB if you match each number with the letter in the alphabet at this index (I realized that after seeing a bunch of 1312 tags around where I live)


What's the issue with ACAB?



> Or simply omit all alphas and use only numerics.

That works pretty well until you realize that some numerical combinations are common neo-nazi codes and may lead to ... unfortunate associations. The ADL lists a few of those^1, but the list is by far not comprehensive, codes actually differ based on locality, and accidental combinatory collision in a 10-character space than it is in an alphanumerical 36-character space.

[1] https://www.adl.org/education/references/hate-symbols/88


Removing wovels would increase likelyhood of bad word combos, but removing consonants would have the desired effect.


It’s so laughable that we care about whether a generated string contains some temporally relevant profanity. We truly are still barbarians, and will be viewed as such by history.


It highly depends on what the purpose of the string is: if it can be clearly seen by the user, has to be read, or worse typed, then it's not just a random string in a database.

If the string is a url, imagine sending https://somesite/wanker to your client, when it actually could also be https://somesite/ay3ugd


Well, given that we have a more than 5000 years old habit of looking for omens in random data to divine the future (whether that random data is scattered bones, laid out animal entrails, tea leaves, coffee grounds, tarot cards or so many others), it's unfortunate but not surprising.


It's not that stupid. People will send the shortened link to other people, who might not understand that the string was randomly generated.


"Hello Richard, your access key is URAC0CK"

It's random, I swear!


If you generate an identifier for an important client that contains "knobhead," they won't think it's a randomly generated string, but that someone at your company is deliberately insulting them.


We did a similar thing at Groupon after a customer’s coupon code contained an F bomb.


I removed the letter U from a random password generator for a Customer's app after a password was generated containing the "C-word".


Both of you are allowed to swear on the internet


They're allowed, but it is courteous of them to refrain.


And yet, not _required_ to if they don't want to.


I agree, I only wanted to point out that they can I they want to. I didn't say they have to

Edit: I'm being downvoted so I want to explain - the internet is a huge mishmash of different cultures and all I wanted to say is that it is allowed to swear because I though that maybe, in their local one, it is not and they think it's universal


Let it go, downvotes don't matter.


But I’d rather not say it, online or in person.


I presume you removed V as well? :)


Hashids (https://hashids.org/#how-does-it-work) have a pretty clever trick for this. They’re able to encode multiple IDs to a single obfuscated hash, which works by reserving some characters from the alphabet to use as a separator between each encoded value. That guarantees that whatever characters you choose to be separators are never next to each other in the output. By default their separators are (lower + upper case) “c, s, f, h, u, i, t”

It worked surprisingly well when we used it.


>the first one that popped up during the demo was a big ole F bomb

And this, ladies and gentlemen, is what it would show BEFORE the filter... but after (runs the code again, and prays it works) ... NO PROFANITY!


To be clear, this was the list of "things we blocked".

"Had we not done this work, that link would have been sent out to one of our users." was very well received.


In my help desk days I took a call from an irate person ranting about how we were telling her to "get a male sex change". Eventually I figured out she had become upset with "msexchange" showing up in the address!


Before Stack Overflow there was Experts Exchange. Their URL was of course those two words, all lowercase and mashed together into one... (Can't recall if they later inserted an underscore in between them?)


They added a tac eventually, I guess after their trip to pen island. It was expertsexchange.com for a long time though.


How statistically likely is this? Can't you just regenerate until no you have a suitable short URL. Aside from performance, this is as random as you can be. Or generate the characters one by one and backtrack, this requires less random days. Or regenerate the unwanted substring.


Just trying again is absolutely fine.

Our main concern was whether we needed to increase the size to 26 to account for the loss of keys. After doing the math, a 25 digit random string has a ~5% chance of containing one of 150 three or four character inappropriate substrings. That 5% loss isn't that big of a deal. But we had to figure out the math as part of due diligence before shipping.


Fascinating! I guess an easy solution is to inject non alpha characters into any generated string. I imagine a constraint was that you wanted them to be easy to type?


SMS is the biggest constraint. Unicode characters trigger lower segment char limits (effectively doubling the cost of a 71 char text message). And also it's important that the links can be clicked on a smartphone. So url-safe base64 (some shorteners use base62). And numbers can be N4u6hty too, so you gotta catch those cases.


Goodness the scope of the problem just exploded in my mind after this explanation.


The good news is that things like https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and... exist so getting a source of words to filter is easy enough. And converting numbers to letters isn't too bad.

The hardest problem with the implementation was that with a long list you can't just search for a few dozen inappropriate words (like the Twitch implementation). It would be very expensive to do hundreds or even thousands of checks against every inappropriate word.

The solution we came to was to truncate all the inappropriate words to either 3 or 4 letters and store them in a big set. We then take our generated strings, which are usually 11 characters, and break them up into all possible substrings of lengths 3 and 4. For example, 1a2b3c4d5e6 would be broken down into 1a2 a2b 2b3 b3c 3c4 c4d 4d5 5e6 1a2b a2b3 2b3c b3c4 3c4d c4d5 4d5e d5e6. An 11 character string would always have 16 such substrings. We then check all 16 against the banned set. 16 lookups into a set is pretty cheap and as we have expanded the word set over time (e.g. add a new language) our performance hasn't changed.

One drawback to our approach is that we do have false positives but we did the math and our space was still large enough, the cost of generating a new one was pretty low, and customers never see it so it's just not a big deal to throw out false positives.


I have a hard time believing this was / is the real version used. It doesn't seem broad enough. More likely it was a kind of smoketest that made sure that a more automated keyword checker was working.

It does remind me of the XKEYSCORE (Snowden leaks) that used keywords to bubble up potential threats from emails etc https://www.businessinsider.com/nsa-prism-keywords-for-domes... .


This looks like legit, no-nonsense gets-the-job-done code that gets updated every time some jerk find a new way to be a jerk to others. It isn't great, but at least not over-engineered, and I'm not sure if Twitch account sign-up volumes and abuse are at the point where they should staff a project to do this more robustly / scalably


Exactly what I think. This is the kind of code that doesn't have a platonic ideal, it has to get updated with time and experience and reports. There is no "non-hacky" way to do this, you just have to look at the reports that are coming in and keep adding rules that are relevant.


Well there are some significantly less hacky / more scalable ways to do this. This looks at the edge of maintainability, but if the lists were 10x as long and people constantly stepped on each other's toes causing outages while making updates, a bit of project investment probably wouldn't be a bad idea.


You say that there's no "non-hacky" way, but...

https://www.youtube.com/watch?v=bJ5ppf0po3k

He made a filter that did analysis of the pronunciation of the messages that were being sent. His breakdown of it starts at around 13:30


There are portions of my codebase that are intentionally "dumb" code. They contain cascading rules controlling what UI elements are visible that can be challenging to reason about. So I wrote it so simple that anyone can read it.

I see the same here. It's not clever, but no one has any doubt what words are being checked.


it's missing a few slurs, I'm not sure it gets updated (or it's a filter elsewhere which gets updated)


It seems you are assuming that software is usually written well, or as well as it can be. It's much more likely to be the opposite.


This seems like the kind of thing that would be horribly specced and be a user story along the lines of "the user must not be allowed to make an inappropriate username."

The engineer would write something for every test case the product manager complained about, anything else computationally easy, and call it a day.

I once had to implement an audit logging system. What was supposed to be logged? "Important actions." Nobody on the team could define it. We just logged every database write along with the username responsible and called it a day. Nobody ever followed up or inspected it.

Same deal. Both exist mostly for compliance.


Bad, old memory: "Every transaction must have a line written to the printer"

Tracking down "why is the antique system suddenly slow". Power went out, system came back up fine, everything but the one ancient but vital app is fine. Dig, dig dig, there's this old dot matrix printer in another room (because it used to be loud and annoying) that no on has fed or looked at in years.

It finally died with that outage, and it not accepting data was the problem. It had cheerfully printed the ribbon through, then fed out the rest of the box of paper it had, and that might've been several years before i saw it.

The roller the paper was supposed to ride had been eroded. The metal rods the print head rode on had a perceptible bump at the ends of the normal stroke.

The fix was a little dongle for the printer port that held the appropriate "i'm alive" lines up. hardware /dev/null. I'm thinking it was 25 pin rs232 because I remember a lot of cussing over it.


Holy cow how old was this system, circa 1975 or something?


I saw a dot matrix printer logging the chemical composition of the flue gas at a chemical research place in about 2002.

That kind of thing is probably fairly common in the industry.


Dot matrix printers still make up a lot of the flight manifest printers at airport gates. Listen for them right before they close the doors. They print off the list of passengers checked in as boarded


I want to say it was probably installed in '85 and I saw it in '94; but i wouldn't swear to those dates.

I'm pretty sure it was the only dot matrix printer with a serial port i ever saw. Even daisy wheels were parallel port by the time this went in; but they had a like 50ft cable to move it to the other room. Someone worked hard and paid large to set that up originally.


Would be unsurprised if some poor engineer got assigned the project, realized it was an untractable mess of scunthorpe, and decided to check some boxes and move on to some ticket of higher value.


It also mostly checks for English naughty words and not much else. People can have fun in lots of other languages, so it would seem this is a small sample.


You can't do all of them. Here is an example: in my native language Pula is a slur for male genitalia (way worse than D*ick in English) but at the same time it's the name of Botswana currency (https://en.wikipedia.org/wiki/Botswana_pula) and also a city in Croatia (https://en.wikipedia.org/wiki/Botswana_pula).


Some near the end looked like they might be in another language, but I won't be the one to find out.

It used to be "if I search for this term, am I accidentally going to wind up getting goatse or something?" The good old days.

Now it's "if I search for this term, is the FBI going to kick my door in?"


I was pedantic about it in the good old days too, it's goatse.cx and not goatse

you need the domain to make the goat-sex joke work


Woah, 20+ years later and I literally had no idea until now. Learn something new every day, but somehow this one is shocking because of how obvious it is and how long it took :)


There was a lot more horrifying stuff on that site besides hello.jpg, but most everyone recoiled in horror before they saw anything else.


There are a few German words in the lists.


After looking through it quickly it seems to do the same as most profanity checks with Dutch: it doesn't block "kut" (a crude word for female genitalia) but it does block "kunt" (third person form of "can")



Unfortunately it seems I don't have access to that page. Do I need to be a Wikipedia editor?


Ah, that's right—need to have had an account for four days and... to have clicked the 'Edit' button at least once?

That's what the linked page says (I'm not familiar with Dutch Wikipedia specifics), but that seems like such a strange statistic to track instead of minimum edit count.

So weird that I looked up how to configure MediaWiki autoconfirm requirements. They probably mean a minimum of one edit. Auto-confirmation considers only age and edit count, and I don't see why clicking 'Edit' is so meaningful a condition to warrant developing an extension.

https://www.mediawiki.org/wiki/Manual:Autoconfirmed_users

Searching about extensions did lead to discovering an obscure feature: there's now a built-in URL shortener: https://w.wiki/Q8

Edit: interestingly, the single-character ones seem to have been pre-planned: https://w.wiki/e, https://w.wiki/E, https://w.wiki/4


> need to have had an account for four days and... to have clicked the 'Edit' button at least once

In the Dutch WP? Because otherwise I should qualify.


Yep; the account for each wiki is automatically created the first time you visit that wiki, and rights are mostly local.

https://meta.wikimedia.org/wiki/Help:Unified_login

You can list your local accounts at

https://meta.wikimedia.org/wiki/Special:CentralAuth


And specifically Italian blasphemy for some reason…


One word, that starts with V, and isn't Vagina.


Bunch of ineffective entries too, all patterns containing underscores won't ever match.


Underscore matches any single char, so those patterns work fine. It’s an efficient way to match any small separator, like space, dashes, etc.


I didn't know _ is a metachar in LIKE expressions, thanks! Not the first time its syntax caught me off-guard.


It only has two, doesn't it? Underscore for "any single character", and percent for "any string of characters", AFAIK.


Is Twitch's footprint big enough internationally that they have to worry about it?


That list is a list of words chosen by William Knowles to taunt any NSA who may be listening.

It's not a list of words used by the NSA or any spies. https://attrition.org/misc/keywords.html


In a previous life, all of our code was scanned for "vulnerabilities". One of the issues they looked for was if passwords were being stored in local variables. Initially, LOTS of people would do something like:

$Username=<username>

$Password=<password>

Connection.string=($Username, $Password)

The parser would flag this - Password was being stored to a variable! So we just changed our code:

$pw=<Password>. Problem solved!


I can't find the article now but the developer who wrote these scripts said they were a singular effort from years ago before security was taken over by a more formal development team.


Someone in another thread mentioned that these might be part of corpus generation for an ML model. That would make more sense to me.


Based on the filepath given in this very thread, it seems plain that that's the case (safety-ml\offensive-usernames\data_pull\sql\bad.sql).


Training an ML model to "learn" a rules engine strikes me as an incredibly bad practice. It'd make more sense to just have an actual corpus of labeled data.


Seems plausible to me. One of the streamers on Twitch had an actual wooden board on the wall he lasered subscriber names onto, and some of the regulars had fun finding lewd usernames to gift subs to. There were quite a lot of them out there. It was kind of a running joke how much Twitch let through the cracks.


If valid, this means virtually every phone call, and every email (with clients) is flagged. :P

This must be more nuanced. Maybe, it's "additional algo processing when a word is hit", eg another layer before "involve human".


If you liked this chaos, you'd love my 15+ years of cobbled-together efforts at limiting forum spam and the like. I suddenly don't feel quite so alone in the myriad efforts needed to tackle this sort of thing.


It's an endless arms race. I spend a lot of time studying unsavory people and a great deal of effort goes into the development of new dogwhistles that are designed to either provoke or connect with peers while maintaining deniability. You might like this paper on the evolutionary dynamics of covert social signaling: https://www.nature.com/articles/s41598-018-22926-1


I can really identify with that. My efforts are against spam and trolls; at least the spam is predictable and easier to take a sledgehammer to! Those grey-area trolls are such a miserable part of online content and moderation.


If they're covert and only members of the group using them know what they mean, then why fight it? Seems like a helpful way to communicate about things that others might not like to hear. The paper you linked said that too - "Such signals may allow coordination and enhanced cooperation while also avoiding the alienation or hostile reactions of individuals with different preferences.".

This happens all the time in cartoons that appeal to children but also contain subtle adult jokes so everyone can enjoy them on different levels.


Because it encourages more of the same, and it's rarely completely covert. Often and increasingly, it's racial dogwhistling and just devolves every thread to wasteful, antagonistic and unproductive conversation. You might've heard the line/story about accommodating nazis in your bar.


Maybe dog whistling isn't the problem but just plain old whistling?


At an online company I worked at we had various word filters for our forums, but new stuff was always popping up and getting through.

What worked in the end was having any newly created thread send a message containing the post title & body to a slack channel specifically for monitoring the forums. Employees and our forum moderators were in there, and any bad threads were nearly instantly deleted. Eventually the spammers mostly gave up. Hard to beat a dozen human brains :)


I get alerted to every new thread, and when I'm online my response rate is also very fast. When the message count was under 1,000/week, I used to get an email for every single post too. But if other moderators aren't around and I'm offline, I am out of luck. Shadow-banning can be effective. I also give regulars the ability to sin-bin any post which removes it from view and leaves it for me to check.

Having a dozen reviewers, especially if spread across timezones, would be a dream!


Whats this one about?

    CREATE OR REPLACE FUNCTION is_blasphemy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)di(o|0)%'
     OR replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)mad(o|0)nna%'
     $$ LANGUAGE SQL;



Sheesh

c) "Porco dio saranno mica i testimoni di Geova? No eh diocan digli che i signori sono fuori, non ho tempo per stargli dietro." ("Fuck, they can't be Jehovah's witnesses, can they? Tell them we're out, we don't have time for their shit.")


That's a beautiful swear. Far from naughty, I feel enriched having learned this.


I tried googling it but I put it all in one word. Thanks for the help!


is not even remotely complete, in italy we have two regions dedicated to the creation of blasphemies so advanced in ingenuity that two telephone books in regexp would not be enough to stop them


That one was my favorite too. Filter lists don't work well in a multi lingual world. https://www.reddit.com/r/Rainbow6/comments/a01w7q/got_banned...


Mine is `create or replace function is_tragedy`! (Not the tragedy itself of course, though I confess I'm unfamiliar, just that line specifically.)


that should actually be porcAmadonna


Grazie.


Italian


More high-profile profanity filters open for viewing: edit filters used in Wikimedia projects, which also together encompass many languages.

The main 'bad words' filter of English Wikipedia:

https://en.wikipedia.org/wiki/Special:AbuseFilter/384

The page (on en:wiki) listing all filters, which also have uses other than detecting abuse:

https://en.wikipedia.org/wiki/Special:AbuseFilter

The special page has the same title in other editions of Wikipedia and other Wikimedia wikis, though many filters are set to hidden. Dutch Wikipedia, for example:

https://nl.wikipedia.org/wiki/Special:AbuseFilter/10



This script feels like this is something that:

- should be in its own application with its own rules engine so you dont accidentally whack a bunch of userames

- I would have done in the past and cringe when people ask me to update it.


I feel bad for the Amazon employee who came into work one day with this project sitting on their desk.


My understanding from talking to {current,former} {Amazon,Twitch} employees is that Twitch has retained a decent amount of engineering independence. For better or worse, it's unlikely that some rando at Amazon ended up with this particular PHP file on their desk.


Truth is, Amazon is the parent organization of Twitch, so calling Twitch employees Amazon employees as well is perfectly fine. Just like calling GitHub employees Microsoft employees.

So yes, Amazon employees have ended up with this particular file on their desk.


Twitch is to Amazon as Quebec is to Canada.


This ain't PHP...


My brain sees sigils and thinks PHP. Whoops.


Man, I was _not_ expecting it to be stored procedures lmao.


Well, is a good idea actually.


As long you don't mind a cheeky production release to get new words urgently added once the next bot wave hits.


My wife used to name her RPG characters “Isis” after a cat we used to have. Used to have to explain to vets that we named her Isis years before Bush and Cheney created Isis by starting their illegitimate war in Iraq.


Better yet, my children's school had a parent support portal called Isis.

I have some fun emails whose subject line is "here are your ISIS family log-in details"

and this was right around the time the terrorist group was frequently in the news


Are people not familiar with the Egyptian goddess, though?


In 2021, hardly anybody remembers Isis to be the Goddess of Love who invented marriage


Isis is a common figure in everything from literature to disco lyrics. It's offensive to ban that name.


Honestly, what a great way to start _that_ conversation haha


ISIS was founded in 1999.

The Iraq War began in 2003.

Bush's administration ended in 2009.

ISIS gained power in 2014, 5 years into Obama's administration.


Aw, why can't you be nicknamed 420blazeit?

That's not even offensive and there are still way more weed-nicknames available, so I don't get it.


You can if you’re a magic fire sprite:

http://www.threepanelsoul.com/comic/picked-out


420 == marijuana, blazeit - self explainatory.. not everyone is fine with it..


Real story: I needed to register an instagram account a few months ago. I tried my nick "Iv", of course it was unavailable. I tried many different alternatives and finally managed to get "Iv but the real Iv" or ivbuttherealiv. Recently I saw my account was banned. I then realized that there was a "butt" in there.

Now I will never have an IG account linked to my FB. Oh well, I can live without that but thank god I don't depend on that platform for business. It made me laugh but there is zero appeal available.


Oh classic Mike Hunt, poor guy.


I see they got Mike Hawk, too. But Mike Litoris is still free to use his/their own name.


We had a legitimate Michael Hunt at our school. He went by Mike.


I used to work for a company that automated customer service. While we were trialing with new a client, our ai service responded to a customer: "Hey B**, thanks for reaching out."

To our surprise, the end user did not feel offended at all. In fact, they were happy because we responded instantly instead of the usual 24 to 48 hours.

[Story]: https://idiallo.com/blog/do-you-make-your-customers-wait


The word is bitch so the sanitised version would be B**

Why portray the bad word as 3 letters - it doesn't make sense.


Ha, the same thing probably happened to GP as has happened to you - unescaped * characters (escape with `\`) resulting in two imperceptibly italic asterisks.


you are 100% correct!


> LIKE '%aggin%'

Looks like "Baggins" is banned.

Poor Bilbo...


I feel like you could make an interesting game out of this. Given these rules, find the best "false negative," a realistic and inoffensive, but banned username. My best so far are "brownie_gurl" and "Megasthenes."


Sure, but...

The story we know was written by him. I wonder what the trolls would say, or the dead dragon, or the town that his actions helped destroy?

Is he a hero, or, just the guy who wrote it all down?

And just when something really important happens, he throws a powerful weapon(the one right) at his nephew and goes away to retire!

A life lead with riches (gold from the trolls), a ring granting him extremely long life and health, and yup.. off he goes, first sign of real trouble.

Poor Bilbo indeed!


So is Mike Hunt.

Poor Mike.


Glad to see the word Niger (as in the country) wasn’t blocked. Ran into this with Venmo a little while ago and they thankfully backpedaled. https://news.ycombinator.com/item?id=24042742


It seems NVIDIAMINER could be controversial.


Language friend, this is a family site


If "fork" and "shirt" are allowed I am fine. :)

You can still use any other language except English to achieve the goal .


Unexpected "The Good Place" :)


Doing this in SQL is absolute insanity. I’ve seen many pieces of code grow in this way and I understand how it happens but still surprises me to see how scrappy things are under the hood at some big-time companies.


Hello, what would be a better way to do it? In code? Regex?


Probably yes, in code with regex. To name a few advantages: 1. Better readability and organization. Rules and word lists can be abstracted to more of a config format. 2. possible to easily store in a database and support dynamic additions/changes to the rules and words 3. Better accountability of performance. Able to use profiling to catch any perf issues in the rules. 4. Testable with unit tests


I wonder why they didn't go with syllables and something like a levenshtein distance to syllables?

That way they could more easily maintain similar looking and sounding words, including leetspeak and other variants.

Because with this kind of approach, something like "yolocaust" will get through, as most checks only go with exact matches and permutations will always get around it easily...

Whereas with something like a levenshtein distance you could compare it with a set of words and syllables and if it's too similar looking, e.g. 90% the same distance compared to username length, you could simply block it.


Agreed. If it's even close to a no no word it should be banned completely. No more riggers, naggers, poggers, biggas, lucks, bunts, minks, bikes, or trikes. This is a doubleplusgood plan.


Maybe usernames in general should require a higher entropy than words in a dictionary :D


People how hate naggers don't like levenshtein either


So what is a good practice for this kind of thing? I’ve got an app where users create a display name.

Is there a legit service I can use or some actual well tested library that can help me? I’m using node and Go so either languages.


Give me a few weeks


?

Are you making such a service?


looks like it is limited to English only. There is whole world of non-English offense out there. Like using English characters to make national offensive words as well as using national characters to make English offensive words.


English only

Should optimize checks with a de-obfuscation function (attempt to expand non-AZ back to AZ, even if that then shoots permutations at banned word-runs).

It should probably also look more like a spam scoring system, where really obvious stuff is hard-trashed but borderline things are flagged for review / discussion.

I am also very disturbed that, as with most censorship, 'obviously bad' things such as terrorism/etc are co-mingled with 'is adult' as a negative check.

It seems reasonable for Twitch to have validated 'safe for minors' areas where names are filtered. Generic areas, where things are in the gray area and unchecked. Adult Only areas, where swears, profanity, maybe even some of the hateful things are allowed. Informed consumer choice.


It also includes a couple of Italian swearwords in the function `is_blasphemy`


And exactly one German one in 'is_profanity'


There are more German ones: the last one in is_child_exploitation and nazi stuff in is_hateful.


Amusing to see the is_hateful query which used to just be an unreadable mess of regex for validation. Clearly the infra has grown/changed over 10 years, and this is obviously neater, but boy does it LOOK a lot bigger now (and I'm sure they've added yet more new words that try to hurt people...)


> is_marijuana

Looks like Twitch doesn't like weed.


'Angry parents say gaming site Twitch promotes drug use, slow news day story at 10'


It's okay. Instead of watching a streamer called 420blazeit, they can watch Amouranth sucking on a microphone.


Reminds me of the black words service at apple that checks for explicit or non printer friendly stuff for iPhone engraving. It was hilarious looking at the advanced linguistic engine they developed to filter out asshole and 30+ variations of it with snarky comments added by devs in the past.


The movie Gran Torino came out when I was twenty, and in seeing it, I heard many racial slurs uttered in context for the first time. I was, of course, familiar with the "main" ones, but the one in particular that I remember was "spook". I remember it vividly because, upon hearing it, I had an immediate realization that I had heard it before when it was said near the end of Back To The Future (the scene where Marvin Berry and the Starlighters chase off Biff's gang).

In my youthful innocence I had assumed the characters in the movie were trading silly, made-up names (e.g. "Who are you calling "sparky", Popeye?"), not racial epithets.

The fact that this list needs to exist makes me sad, but I am glad technology can assist on some level with the issue.


Poor Bilbo and Frodo - they'll never be allowed to use Twitch:

CREATE OR REPLACE FUNCTION is_hateful (VARCHAR) RETURNS BOOLEAN STABLE AS $$ SELECT ... OR replace($1,'_','') LIKE '%aggin%'


Or designer of the first commercial microprocessor, Intel's 4004 https://en.wikipedia.org/wiki/Federico_Faggin


I feel sad for Mike Hunt, who can't use his name for a username :)


This is pretty much the same problem as Email spam. There needs to be a service/collaborative project to filter these, instead of each app hacking up an ad-hoc way of doing it.


Why is all of this implemented in SQL? Wouldn't it be better to do it in code with dedicated methods to filter stuff out? IMO logic inside of SQL queries just adds unnecessary complexity, implementing this in code would've been maintainable and testable.


Boss: "Hey Bob! People started spamming one of our boards, we are currently busy doing other things and cannot deploy new client, can you make filter with these 3 words and deploy it ASAP?"

Bob: "Sure no problem if it's only temporary"

--few months later

Chief architect:

"People are spamming more and more, we should design new system for these 234 new bad words, I need team of 7, two backend guys, 5 frontend guys and 4 weeks. It will also require minor rewrite of few external components."

Boss: "geez we're in the middle of sprint right now, Bob can you add these 234 words to existing filter? Make sure it's in production before lunch, thanks" (checks watches) "I have to go now, meeting with customer, bye".


It is FUNCTIONS so it is code like any other. Even SQL is code (declarative).

The reason is: This can be updated very easily and out-of-band of other deploys of the main application code.

To make this work it has to be super agile to update. Probably this version we are looking at is an old version that happened to be put in git. The functions in prod probably have several additions since.


We tried to convince Twitch for years that their filters were garbage and they should use CleanSpeak. They kept insisting their engineering team had written the best filter in the world.

Sometimes it’s just not worth the effort to try and help people solve these problems.


There should be an open source or even Official RFC(tm) for this. The use case is very generic.


Interesting to see this. These are like the ten commandments, primitive yes/no rules that reflect the people who came up with them. A world of nuance is missing obviously, not to mention gaps that can be gamed. How do you express ethics in code


Potential solution for this;

- Have people choose any username they want

- Have an “after the fact” human review system on usernames

- If the username is inappropriate, change it to “smallsausage[0-9]+” without the option of reverting it back or requesting a new user on the same e-mail address.


Is there a blog or something where someone is going through the dump and summarizing?


Your best bet atm is to just look through reddit/hn comments/posts people make as they find stuff. The leak's too big for one person/team to quickly find all spicy stuff.




So, it's only working for english and english-related profanity ? Interesting.


Ah, so that's the Word Filter that Twitch directly mentioned as the majority of their effort to fight hate comments (as cited in the main thread on the leak).

Also, Lisa Pedro is not welcome there.


Let's not jump to conclusions. We don't know where this specific filter was used, is it an automated blocker or just flagging for moderation, is it the only layer, or is it even the currently used one.


A clbuttic [0] solution to the problem.

[0] https://en.wiktionary.org/wiki/clbuttic



Sucks to be you, if your name is "Kyle", and you just wanted people to watch you play.

Looks like "SeeKylePlay" would trigger an insta-ban for "Sieg-Heil". F.


They included 'mike hawk' (say it out loud...)? If that's on the list, there are a few thousand other auditory joke names that should have been on here.


This SQL file could be an entire AI-based, ML-driven startup to stop hateful / offensive language. Just need a good name like Klean or, NoH8 or something.


I love the fact that blasphemy is only an Italian problem


So... This ignores lookalike unicode from other languages? Or does SQL know that, for example, `c` (ascii c) looks like `с` (cyrillic s)?


It only allows English characters to begin with. It doesn't even do Latin Extended characters like čćšđ, let alone non-Latin.


Whoever leaked this is going to get buttbuttinated for sure.

I have a hard time believing this silly mess is an actual component of anything.


Its surprising how good solid best effort parsing is by the human visual system.

sukciD suggiB was here..


Prejudice.

Just six seemingly harmless letters arranged in a way to form a word with more power than the pieces of metal which is forged to make swords.

Just a couple of G's, an R and an E, an I and an N....

https://youtube.com/watch?v=KVN_0qvuhhw


Was surprised it's SQL but guess that makes sense/faster to be done there


Love how there is nothing in there to prevent white cracker from being said.


Why not just convert numbers like 1 to i or l then check with a manually created bad word list?

Would regex be really much faster than checking it against a 1000 or more bad word list?

Also bad word list can easily get updated by moderators as well, I really can’t understand the logic behind using so much regex.


A bad words list is a regex. ;-)

But note: for the most part this isn't using regexs, and to the extent it does, it seems largely intended to make the maintainers' lives easier by avoiding having to represent (and maintain) all the permutations they are trying to match for.

What's sad though is that they're doing many, many passes through the pattern matcher, rather than just building a single big DFA from the whole list of patterns they want to match, which gets traversed in one pass.


Could Postgres convert the function into a DFA using the JIT optimizer (based on LLVM)? That might delve into sufficiently smart compiler territory but recognizing a bunch of OR'ed string matches seems on the easier end of optimization passes.


Yup, that's definitely sufficiently smart compiler territory. If you want to write an optimization pass that handles that, go for it, but you won't find one already there.


RegEx => DFA conversation can lead to exponential growth in number of states... don't know how big it'd be in this case.


why the DB? Why not the client itself? How much data was wasted back and forth between the client and the backend? how many CPU cicles were spent in that poorly optimized SQL statement?


%nekker%

I pity those who chose that username based on The Witcher's Nekkers.


I really wish we could get away from enforcing this type of crap


I had to google "mike hawk" before I got it.


would have being better to have the data source in a json/csv formate, like an array of RegExpression. SQL is not compatible with different dbs.


[flagged]


You are so hungry to start something here, you've posted this exact same comment 3 times. Wouldn't you have more fun on 4chan?


Three different threads, all relevant. 4chan is where this was leaked/found out.


The flagged comment:

Actually in Twitch's codebase:

  OR replace($1,'_','') SIMILAR TO '%(hate|kill|keel|hang|burn|gasthe)%(black|bl4ck|black|jew|trans|gay|african|afrikan|minorit|asian|nig|n1g)%'
Twitch: "Hatred against Blacks, Jews, Asians, trans? BANNED! Hatred against Whites? I'LL ALLOW IT!"


these are so meme-able


Path in the leak is: safety-ml\offensive-usernames\data_pull\sql\bad.sql

Highlights:

    CREATE OR REPLACE FUNCTION is_tragedy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%george%floyd%'
    $$ LANGUAGE SQL;

    CREATE OR REPLACE FUNCTION is_derogatory (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%retard%'
    $$ LANGUAGE SQL;
Now we know we can make as many goergefloyd accounts as we want! *devious grin* *chuckles to self*


safety-ml - I assume this is some attempt at training a machine learning algorithm to find bad names.


This is what investors and the public buy as AI.

A hand coded if statement


More accurate then actual AI under-trained on insufficient data.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: