The first paragraph is written to imply they have been filtered purposely:
> has been extensively ‘filtered’ to remove black and Hispanic authors, as well as material related to gay and lesbian identities
But after it is explained that this is an unintended second order effect from trying to remove offensive content from the corpus. I'm not trying to justify the outcome (it's an issue regardless of intent) I just didn't think there was any need to strongly imply it was intended.
It was intentional at least in so far as that bias is a known problem with Google's machine learning processes, and that no substantial effort was made to avoid the bias.
Running over a pedestrian might be a second order effect of driving a car, except that driving a car includes applying the brakes to avoid it.
Which Black folk use colliquially, either to imply that the addressee needs to be more humilitous or apply greater self-improvement about a subject at hand, or to re-enforce camraderie.
Sounds to me like a "damned if you do, damned if you don't" basically impossible task without making ML aware of cultural intricacies, trigger warnings, age appropriate approaches and parental controls at the same time.
Like, it's not like this filtering was done for the purpose of silencing anyone — Google (among others) really learned the hard way to not feed smut to ML models, as it _will_ get regurgitated, always as a possible PR disaster in the making:
This is great! It has been a long time since I learned new dirty worlds. The inclusion of misspelled words and popular culture references is also intriguing.
Just a note on something I do have personal experience with:
The authors also observed that the text of many patents are initially obtained via imperfect examples of Optical Character Recognition (OCR), with their accompanying errors in English possibly passed through to the C4 data with little or no annotation that would distinguish it from acceptable English.
While patent offices do release PDFs of all their patent docs (and it isn't just patents; it's all the back-and-forth between the examiners & the applicant, too), a huge percentage are images of paper documents. You can always tell just by double-clicking on a word -- if the word doesn't highlight, it's an image.
OCR output from these things is generally terrible. There is no way it should ever be input to any ML model.
‘Our examination of the excluded data suggests that documents associated with Black and Hispanic authors and documents mentioning sexual orientations are significantly more likely to be excluded by C4.EN’s blocklist filtering, and that many excluded documents contained non-offensive or non-sexual content (e.g., legislative discussions of same-sex marriage, scientific and medical content).’
>> The greatest human rights issue of our time is LGBTQ voices being heard.
Maybe in the western world but we know that there are places where women have really no voice (or less) or places like NK where very few people have a voice...there is also the forced labor in China and Africa. I highly doubt the greatest human rights issue of our time is LBQT voices not being heard. That could only be said by someone who thinks less for other people in need. Minorities are always "heard" less but let's not make this the greatest human rights issue. People are being beheaded and people are dying of starvation right now. They've been filtered out by the whole world.
> If I could I would hook you up with your eyes held open Clockwork Orange style and make you watch Blues Clues gender parade on repeat until you understand the pain of trans oppression.
You know, statements like this don't help a cause, they only end up alienating people (like casual third party observers, for example) the advocate claims to want on their side. Statements like this hinder a cause and are self-defeating as they cause the advocate to inadvertently become part of the problem they're trying to resolve.
I'm not sure I buy this interpretation. As the article says, these are "identity mentions", not definitive classifications of the author's identity. Don't these results just indicate that some identity labels are more likely to be used in offensive ways?
‘Some filters are relatively straightforward, such as removing Lorem ipsum placeholder text. However, we find that another filter which removes documents that contain a token from a banned word list, disproportionately removes documents in dialects of English associated with minority identities (e.g., text in African American English, text discussing LGBTQ+ identities).’
A word so lovely that we really really don't want to risk putting it in our report...
Yeah, I have similar concerns about the different dialects thing. It's true that "the meaning of seemingly 'bad' words heavily depends on the social context", but most realistic systems are gonna be deployed in multiple social contexts and ideally shouldn't be offensive in any of them.
It's not that there are different contexts, it's that with particular identity groups, the "rules" for that context vary so much over time (and context) as well.
For most of the 90's saying black was "wrong" in favor of African American. Now we've gone full circle, and the same people who made a sour-face at black 25 years ago are capitalizing it.
That's probably the most benign, easy case. Of course ML can't keep up with loaded-terms and slurs; most people can hardly keep track of it all.
Or... it could mean that the dominant culture defaults to describing identity labels as offensive. The labels themselves would only be offensive if someone deems them so. Remember when "gay" was a slur, but now it's an acceptable term for homosexual? That is, unless you're an edgy 13yo boy on the internet and think calling something 'gay' is funny.
Again, this data doesn't show that the labels are offensive, just that they're more likely to be used in text which was classified as offensive. If there's lots of 13 year old boys on the Internet using "gay" as an insult, filtering out those insults from the corpus (and it's gottta be correct to filter them) could fully explain the high PMI for the word "gay".
Yes, and determining if a text is offensive is a very socially and culturally determined. Who determines the offensiveness of the context, and how? To shift gears a bit, how many rap song lyrics were filtered out from the corpus, for example? Does use of the n-word make an entire text de facto offensive? What about references to the common name of the moth Lymantria dispar dispar? What about the football team from Washington, D.C.?
A lot of this just feels like a repeat of the Net Nanny internet filter days, when keywords were idiotically filtered out.
These are all important questions, but I don't think the article meaningfully engages with any of them beyond noting their existence. A lot of people are in a position where they have to judge whether a text is offensive; if your system autofills racial slurs, and you try to explain that it's not a problem because in some contexts the slurs have been reclaimed, you're gonna get in serious trouble.
> has been extensively ‘filtered’ to remove black and Hispanic authors, as well as material related to gay and lesbian identities
But after it is explained that this is an unintended second order effect from trying to remove offensive content from the corpus. I'm not trying to justify the outcome (it's an issue regardless of intent) I just didn't think there was any need to strongly imply it was intended.