
A.I. Doesn't Get Black Twitter - jnordwick
https://www.inverse.com/article/21316-a-i-doesn-t-get-black-twitter-yet
======
panarky
This is an interesting ethical challenge.

If the training corpus for a machine learning model contains stereotypes and
biases, then the output of the model will reflect those prejudices.

When that model is used in large-scale applications, it will not just repeat
those biases, it will amplify them. It propagates biased language, which feeds
back into the model in a self-reinforcing vicious cycle of increasing bias.

Should we attempt to detect prejudices in machine learning models and actively
counter them, creating a self-reinforcing virtuous cycle of balance and
tolerance?

\----

As an aside, for an article about language, this one has a lot of typos.
Doesn't anyone proofread before posting?

~~~
bmmayer1
Especially as it relates to this: "This means that blogs or websites that
employ African-American language could actually be pushed down in search
results because of Google’s language processing."

Isn't the ideal system to not assume anything about the content it's
analyzing, but to base the model on behavioral patterns instead? In the case
of Google ranking, either a site has traffic/low bounce/backlinks/social cred
or it doesn't--why is Google attempting to 'read' and analyze the content
itself? (Other than to serve ads of course)

~~~
panarky
> Isn't the ideal system to not assume anything about the content it's
> analyzing ...

Because Google no longer just searches for literal strings of text, ranked by
links.

They now infer your meaning and find results that match your intent -- even if
the exact search string doesn't exist in the search results.

For more on this, read about Hinton's work on "thought vectors" [1].

So if Google can't determine the meaning of African American language as well
as it can standard English, then that content is likely to be less visible in
search results.

[1] [https://www.theguardian.com/science/2015/may/21/google-a-
ste...](https://www.theguardian.com/science/2015/may/21/google-a-step-closer-
to-developing-machines-with-human-like-intelligence)

~~~
ams6110
I can't say I would think that Google does a particularly good job at
interpreting the intent of my search queries. I often need to reformulate them
to get what I'm after. But I doubt that my searches are similar to those of
the general public so that is understandable.

~~~
visarga
I get the same feeling - that Google is often missing the meaning and giving
me unrelated keyword matches.

~~~
alphonsegaston
It's even worse than that - it explicitly ignores quoted keywords in many
cases. The only intent they care about is what guides you towards their
interests.

~~~
rasz_pl
my favorite is from today when searching for "comparison prohibited
benchmarks" google decide to give me results to "comparison ALLOWED
benchmarks"

------
aorth
I bet it doesn't get Scottish Twitter either. I saw this the other day and
thought it was pretty funny:

[https://mobile.twitter.com/MarkHamiIl/status/778141129564905...](https://mobile.twitter.com/MarkHamiIl/status/778141129564905472)

~~~
pjc50
Obligatory Burnistoun:
[https://www.youtube.com/watch?v=sAz_UvnUeuU](https://www.youtube.com/watch?v=sAz_UvnUeuU)

I thought most people would be able to get written Scots dialect, but I did
struggle with Walter Scott's Rob Roy where one of the characters speaks it
phonetically.

There's also a guy doing a newspaper column in Scots in The National, causing
quite a bit of controversy.

Spoken language is another matter. I've had to _interpret_ between an
Ulsterman and an Afrikaaner both of whom were nominally speaking English as a
first language.

------
fallous
This, I suspect but do not know, may be a failing of the current direction of
NL AI solutions. Selecting and using datasets to train NLP systems is a hard
thing, especially if you truly treat it as a pattern to be learned rather than
an enhancement of an underlying description of the particular language you're
trying to learn.

Are we headed towards a time when deeply trained networks seem to work in an
acceptably large number of cases, but how and why are unknown and susceptible
to fatal flaws that rarely but catastrophically reveal themselves? I suspect
that is the case with the current path of machine learning but hope I am
wrong. It just seems to me that the inevitable results of certain efforts in
machine learning are magic boxes that "work" but no one understands why or
how, which smacks of the days of medicine prior to an understanding of the
germ theory of disease where certain efforts at sanitation due to the belief
in "ill humors" did improve health but were based on utterly false underlying
theories, and those successes tended to reinforce other false solutions that
did not improve health but in fact was harmful.

~~~
earljwagner
Yes, there's growing awareness of this problem.

This blog post summarizes the paper "Machine Learning: The High-Interest
Credit Card of Technical Debt": [https://blog.acolyer.org/2016/02/29/machine-
learning-the-hig...](https://blog.acolyer.org/2016/02/29/machine-learning-the-
high-interest-credit-card-of-technical-debt/)

------
Animats
The article refers to another article about "Black Twitter"[1], but the
examples are all about hashtags. It doesn't seem to be a dialect; it's about
subject matter of interest to the community.

[1] [http://www.theatlantic.com/technology/archive/2015/04/the-
tr...](http://www.theatlantic.com/technology/archive/2015/04/the-truth-about-
black-twitter/390120/)

------
gcr
For context, Olga Russakovsky is the lead author behind the ImageNet Large
Scale Visual Recognition Challenge, This is one of the most popular computer
vision benchmarks in the world. Olga has expertise dealing with issues of
dataset bias.

A large number of ImageNet classes are dogs and birds, so most of the
representational power in CNNs trained on this set are devoted to
distinguishing between similar-looking dog breeds and bird species.

------
wRastel27
Is this a result of the AI being written by a bunch of non-black people? It
seems to me that this problem is more of a reflection of the people working on
the code.

~~~
a3n
There may be some of that, but I think it has more to do with AI still being
young, and it's easier to work with less ambiguous text for learning. That and
everyday speech move much faster than formal writing. Now that my son has gone
off to college and is less accessible as a resource, I expect to almost
totally lose touch with the speech of "these kids today."

~~~
gugagore
Do you think written Standard English is less ambiguous than any other
English, or written language? Because I don't see why that would be the case.

~~~
Dylan16807
>Do you think written Standard English is less ambiguous than any other
English

Definitely. Things are clearer when you write formally and strictly adhere to
grammatical rules. When I'm speaking casually I omit words and (ab)use
punctuation much differently, and while that is perfectly acceptable to do, it
is more complex and ambiguous to parse.

As far as I'm aware the same pattern exists in a lot of languages and
dialects, and it's possible to define formal AAVE too.

~~~
Dylan16807
To your comment that was detected as a duplicate:

>I don't think there is anything "formal" about Standard English.

"Standard" English as she is spoke? No. But a3n was specifically talking about
the formal version often called "Standard written English". It is very
prescriptive and less flexible, which makes it easier to parse.

> Can you give an example of what you're talking about? Which grammatical
> rules do you have in mind?

Let's start with "every rule involving punctuation".

~~~
gugagore
I'm not trying to be dense, but I still don't see it. Punctuation helps with
identifying boundaries of clauses, whether they're dependent or not, whether a
sentence is a question. Is this what NLP struggles with? I don't think so. One
of the big unsolved problems is understanding the referents of pronouns (and
other proforms, like "I will [do so] too"). Style guides and generally
prescriptivism does not help that.

Another ambiguity is parse ambiguities like " I saw the girl with the
telescope." What does "formal" English say about that?

~~~
a3n
That's where AI comes in. Unless there's a human at the ready.

------
eplanit
This reminds me of the Ebonics controversy of the 1990s[1].

[1]
[https://en.wikipedia.org/wiki/Oakland_Ebonics_controversy](https://en.wikipedia.org/wiki/Oakland_Ebonics_controversy)

------
Kapura
The best part about black twitter is how dynamic and innovative it is. There
is a certain playfulness with language which is frowned upon in many academic
settings & especially by linguistic prescriptivists, but healthy languages
change over time.

And "more diverse datasets" isn't enough. I get that you need to train an AI
on a set of old data, but any truly good NLP would be able to adapt to new
terms and phrases as they're invented. Not saying it's an easy problem to
solve, but many people I've spoken to don't seem to even be aware that
language recognition is necessarily a moving target.

------
rasz_pl
Is this really advocating AI blackface? Will work just as well as this
[https://www.youtube.com/watch?v=oCWqlDK52c4](https://www.youtube.com/watch?v=oCWqlDK52c4)

------
yepperz
It would be helpful if the article would cite an example of what it is
purporting to report. Otherwise it's pointless.

~~~
savanaly
This should help you get the gist of it?
[https://www.reddit.com/r/blackpeopletwitter/](https://www.reddit.com/r/blackpeopletwitter/)

------
xiaoma
"Approximately 8 of the 319 million people in the United States read the Wall
Street Journal, about 2 percent of the population. If you look at the language
— standardized English — being fed into many natural language processing
units, it’s based on the language of that 2 percent. "

It's hard to take an article seriously when it opens with a logical fallacy.
Yes, the WSJ uses standard English and only 8 million people read (subscribe
to?) it. That does _not_ mean that only 8 million people in the US use so-
called standard English. There are also a lot of people who don't subscribe to
that particular publication but still speak or write in "standard" English and
that group is much larger than the group of WSJ readers.

~~~
mattdeboard
Nah. It'd be wrong in the way you suggest if they said it's the language of
only that 2%, I think.

It IS "the" language of that two million. Just not ONLY those two million.

~~~
lqdc13
Then why bring up WSJ at all? What about all the other papers? and what about
all the other sources of training data for NLP research?

WSJ dataset is not the only such dataset. NLP is not very good with standard
English yet and usually doesn't generalize from topic to topic. Dialects and
other languages - especially those without formal rules - will come when we
can deal with standard English.

~~~
gordonguthrie
All language has rules - there is no language 'without formal rules'

~~~
morgante
What are the "formal rules" of AAVE?

~~~
alphonsegaston
Here's a good overview. The reference section at the end has a pretty clear
delineation of these rules, as well as links to further reading.

[http://s3.amazonaws.com/academia.edu.documents/41737546/The_...](http://s3.amazonaws.com/academia.edu.documents/41737546/The_Grammar_of_Urban_African_American_Vernacular_English.pdf?AWSAccessKeyId=AKIAJ56TQJRTWSMTNPEA&Expires=1474774127&Signature=UPP6KchH%2BO1ShoMCrbawjXUesJM%3D&response-
content-
disposition=inline%3B%20filename%3DThe_Grammar_of_Urban_African_American_Ve.pdf)

~~~
morgante
That link says access denied.

~~~
alphonsegaston
Guess it went down. This one lacks the outlined breakdown at the end, but try
the pdf link here:

[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.516....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.516.25)

EDIT:

If you want a really deep dive, you can also check out "African American
English: A Linguistic Introduction" from Cambridge University press:

www.cambridge.org/us/academic/subjects/languages-
linguistics/sociolinguistics/african-american-english-linguistic-
introduction?format=PB&isbn=9780521891387

~~~
morgante
Thanks for the link. I found the regional breakdowns interesting.

That being said, I think this research emphasizes that AAVE is anything but
_standardized_. That's not meant as a pejorative statement: it's just
acknowledging that, like most languages in history, AAVE has not gone through
a process of codification and standardization to formalize it.

~~~
alphonsegaston
Sure, but formal rules are not equivalent to a prescriptive grammar. AAVE has
formal, consistent rules, described by linguists. You can make mistakes in
AAVE just as in standard English (see: "African American Vernacular English Is
Not Standard English with Mistakes" [https://web.stanford.edu/~zwicky/aave-is-
not-se-with-mistake...](https://web.stanford.edu/~zwicky/aave-is-not-se-with-
mistakes.pdf) )

 _like most languages in history, AAVE has not gone through a process of
codification and standardization to formalize it_

I'm not sure what you're getting at - prescriptive grammars of the codified
form you're describing are the products of their political and economic
circumstances. There isn't some Hegelian trajectory of linguistic validity,
where all variants aspire towards legalism.

~~~
morgante
If not the process of standardization, what differentiates formal and informal
rules?

~~~
alphonsegaston
Formality, if present, can be equally derived from observation and description
of usage, using the linguistic analysis of the kind found in the articles I
linked above. Language, when viewed in total, has shared elements and patterns
that supersede the narrow focus of prescriptive impositions. It's why
linguists can observe things such as that the dropped copula exists in both
AAVE and Russian.

------
kriaq
Exactly why googles filtering system will fail

------
danharaj
There is nothing about a computer algorithm that renders it immune to the
power structures around it. The people who train these systems don't even
realize they are creating a biased model. Borrowing from postcolonial theory,
this is related to the concept of epistemic violence. The way we organize
human information, the way we categorize and evaluate viewpoints and ways of
understanding the world is a battleground for imperialism. Intent does not
matter. The algorithm has no intent, the researchers don't intend to create a
vehicle of oppression, and yet it happens anyway. They have created a
difference in economic utility between standard, hegemonic English, and
marginalized dialects of English which inevitably will have social
ramifications. If this study weren't conducted, would we have ever found out?

