
Poor Software QA Is Root Cause of TAY-Fail (Microsoft's AI Twitter Bot) - ot
http://exploringpossibilityspace.blogspot.com/2016/03/poor-software-qa-is-root-cause-of-tay.html
======
jdp23
It doesn't sound like the author's done much root cause analysis. QA typically
doesn't cause defects (it can only prevent them), so there's almost always a
deeper cause. And, QA's not necessarily the best way to find bugs like this.

In this case, it sounds like to me like it was most likely a requirements
problem -- a missing requirement for being 4chan-resilient, if you will. A
different way to look at it is that it's a design flaw that they system is
manipulable by default.

True, once you've identified the root cause of the defect, it's also important
to look at why it didn't get detected. So yeah, QA missed it. Then again, so
did design and code reviews, developer testing, threat modeling, and the
social-manipulation equivalent of penetration testing. Of course these all
complementary approaches to "quality" but typically they are not the
responsibility of QA.

I wonder if the author also thinks the root causes of the errors in his
original post are also QA failures?

~~~
MrMeritology
I just wrote a follow-up post giving more evidence:
[http://exploringpossibilityspace.blogspot.com/2016/03/micros...](http://exploringpossibilityspace.blogspot.com/2016/03/microsoft-
tayfail-smoking-gun-alice.html)

I call it "poor software QA" because, generally, the software QA process is
supposed to detect and prevent defects from 1) being introduced in the first
place; and 2) from being propagated into "production" versions. As my most
recent post shows, the "repeat after me" rule was a legacy of a software
library (ALICE) they used to implement rule-based behavior. Some sort of QA
process should have been done on the rule set they reused and modified.

Also, when I say "QA" I am _not_ referring only to people with QA in their job
title. I'm referring to the _process_.

~~~
jdp23
Thanks for the reply, and interesting followup with ALICE.

If you're using "software QA failure" in the very general sense of "a defect
got introduced and then propagated to production", then yes that's what
happened here. But then you're essentially saying "the root cause of this
defect is that a defect got introduced and then propagated". This isn't useful
for process improvement (it's true for every defect, so doesn't give any
insight into what happened).

If you're right about them reusing ALICE, then a more useful way of looking at
the root cause of the repeat-after-me bug is "component reuse without
considering the attack model". That highlights other situations where there
are risks of similar defects, and points to ways to prevent or detect similar
defects.

Since there were other bugs as well, the requirement and/or design issues
might still be a better candidate for root cause for the whole Tay-fail. One
of the things you discover doing root cause analysis is that there are almost
always multiple contributors, and you typically want to make process changes
at multiple levels.

~~~
MrMeritology
I'm not using "software QA" in the very broad way you describe in the first
sentence. Also, "defect" as I use it does not mean every short-coming of the
product. It means "something doesn't work (or failed) as designed or
required."

Context: my blog posts are meant to contrast with "experts" who claimed that
poisoning social AI was just in the nature of learning systems, even when they
worked as designed (i.e. had no defects). They are claiming that Tay learned
to be foul mouthed and racist, and thus had _become_ foul mouthed and racist.

If that were true, then this undesirable behavior would not be a software QA
problem. The AI would be working as designed. No QA process would change
things.

In contrast, I'm claiming (from evidence) that the _main_ problem in Tay is
due to a hidden feature in a reused library + rule set that _should_ have been
detected and removed in a QA process that considered various attacks. BTW,
this attack (getting bot to repeat naughty words) has been around since ELIZA
in the 60s.

The other failings of Tay (esp. no black list) are design and requirement
failures, not QA failures.

~~~
jdp23
> I'm not using "software QA" in the very broad way you describe in the first
> sentence.

I took this description from your earlier comment "I call it "poor software
QA" because 1) and 2)" so yes, you are using it that way at least sometimes :)

Anyhow we obviously see things differently on the root cause side (and both of
us are on the outside so there's a lot we don't know). That said I certainly
agree that it's a defect, and that it's an attack that could reasonably have
been anticipated.

------
Mikeb85
'Learning from the internet' was destined to fail. The internet is filled with
so much sarcasm, so many memes and trolling that this was inevitable. Their
whole premise just sucked. And of course, once people realized they could
influence the bot, everyone just upped the level of trolling...

On the plus side, I was entertained. I'm sure many people were (maybe not
Microsoft investors). It was pretty damn hilarious.

~~~
stcredzero
_And of course, once people realized they could influence the bot, everyone
just upped the level of trolling..._

Science fiction premise: We create a truly sentient AI. The Internet's
immediate knee-jerk reaction is massive trolling. AI decides to destroy
humanity because the vast majority of the data we've supplied to it indicates
we're massive assholes. (Also an addendum to the category, "This is why we
can't have nice things.")

~~~
Pxtl
That sounds like an SMBC comic. Or the plot of Age of Ultron, if it had been
slightly better-written so Ultron could have an actual motivation.

------
galistoca
It's arrogant to think that any kind of QA could prevent how an AI would turn
out. That's probably why the terminators took over humans in the movie--
because humans were arrogant enough to think that doing enough QA would
prevent everything. What should happen instead is you need to be humble and
assume that things won't go the way you designed them to, that's the "safe"
way to build AIs in the long term. The OP uses the "repeat after me" feature
as the QA fail, but he's overlooking how it could have gone wrong in many
other ways even if they didn't have that problem. No matter how robust system
you build there always will be hackers who try to manipulate it (in fact it's
more satisfying to hack a robust system than brittle one)

~~~
MrMeritology
See:
[http://exploringpossibilityspace.blogspot.com/2016/03/micros...](http://exploringpossibilityspace.blogspot.com/2016/03/microsoft-
tayfail-smoking-gun-alice.html)

------
bitshepherd
You can have some QA, no QA, or even great QA, but the output is purely a
result of its environment.

I have a chat bot that went casually racist about a day or two after
activating it. After looking through the logs, I found a particularly
vitriolic person that was responsible for the source of this bot's newfound
hatred of Asians. My bot didn't get fixated on one particular topic, it just
spewed racism and vitriol for a while until it learned some more words. Rather
than nuking from orbit immediately, I left it alone to see if it would get
past the racism.

So far, it's been a few months since activating the bot. It's not nearly as
casually racist as before, but from time to time still throws out something
racist just for the lulz. It had a hard time learning context, because of its
environment and the linguistic skills of the denizens, but it has gotten much
better at when it interacts with people.

Occasionally, newcomers get misled into believing the bot is actually a living
person with a mental illness, and not just a collection of random bits of code
cobbled together.

------
adamnemecek
I for one don't think of it as a fail. I haven't laughed this hard in a while.
And in a way it was a commentary on the internet culture.

~~~
ProAm
twitter especially

------
13of40
One thing we're probably not considering here is that the veteran QA person
might have said "that's a ludicrous idea, people are just going to troll it
mercilously" and the PM with college plus a year and a half of industry
experience told him to shut the hell up.

------
apalmer
I am not understanding the spin that is being displayed in general with
regards to Tay.

Microsoft created a chat bot, the chat bot chatted. It wasnt a failure on any
technical level as far as i have seen.

~~~
justincormack
Do you really not understand that there are failures other than "the chat bot
chatted so its ok"? Do you not realise there might be other requirements for
software than "it kind of appears to do roughly what someone asked for"?

~~~
apalmer
I understand completely but what i am saying is... they set out as far as i
know to create a social media chat bot that sounds like a teenage girl who is
immersed in her social media environment. it didnt fail this in any way as far
as i can tell. its just that...

a teenage girl who speaks the language of her social media environment is
going to say a lot of dumb outlandish stuff. even if hypothetically this were
some deeper than NLP true AI breakthrough it was always going to say
outlandish crazy offensive things because that's the influence that's feeding
into it.

its similar to how to a degree that recent bipedal robot that was in the
videos got some backlash because its human like motion was offputting... the
robot didnt fail in anyway...

its more a question of why would you spend all this money rolling out these
experiments if you didnt want the very forseeable output. its a no brainer
that a chat bot that learns from social media is going to say fucked up shit.

------
levemi
Some of the tweets people highlighted weren't repeat after me at all. The
article mentioned one, but there was a lot like that where Tay seemed to have
learned to speak positively about very bad things. Had it been just repeat
after me, all Microsoft had to do would be delete the offensive tweets made by
Tay and then turn off that feature. There was more going on.

------
sdenton4
Interestingly, I think if we were doing a better job of identifying abusive
and asshole comments on our giant social media platforms, we could build a
kind of internal filter for such a chat bot. Learn an "asshole comment"
classifier, and then have the chat bot ask itself, will I sound like an
asshole before each utterance... As a bonus, the classifier would also be
great for providing social media moderation tools, at the same time.

------
fiatmoney
Yeah, it's astounding to me that they didn't have a "don't respond to queries
containing X" master blacklist. It's one of the first things you do when you
have a publicly facing content generator like this.

~~~
tacos
This is explicitly mentioned in the Watson/Jeopardy papers from five years
ago. Even in the absence of common sense you'd think MS Research academics
would've noted the requirement. I'm a Microsoft fanboy but this whole thing is
just embarrassing.

------
Davesjoshin
How do we know that QA didn't report the issues? QA reporting a bug and that
bug being fixed are two different things right?

~~~
dllthomas
I would distinguish between "QA as a process" and "QA as a role". We don't
know that people in the Quality Assurance role failed; we can observe that the
process intended to assure quality did.

~~~
MrMeritology
OP here. Agreed. I'll modify my post to clarify that I'm talking about "QA as
a process", which usually involves people with "QA as a role", but not always.

------
hyperion2010
Garbage in garbage out. Blaming anyone or anything other than the environment
the algorithm was put in is absurd.

~~~
colllectorof
Blaming environment for algorithm's failure only makes sense if you're
prepared to credit it for algorithm's success. Otherwise your evaluation is
entirely one-sided.

------
BinaryIdiot
Considering they have bots like this in other countries with zero issues I'm
not surprised they thought it would be fine. But they already came out and
said it was a failure and the reasons why. This article seems largely
unnecessary and equivalent to beating a dead horse with a stick.

------
WalterSear
Marketing PR and hubris is the root cause.

M$ could have easily put out Tay v0 anonymously.

~~~
fiatmoney
If you put something like that out anonymously, realistically you lose control
over when it becomes public that you're behind it.

~~~
WalterSear
If.

I created a twitterbot network, delivering generative text to twitter users,
for my master's thesis. Have you heard of it?

Given that a sizeable number of twitter accounts are bots[1] and a large
portion of tweets are 'pointless babble'[2], without the help of the braindead
marketing executive who pushed this into the news, no one would have
discovered the owner of the account until Tay was 'swatting' people's houses
and selling bootleg Nike sneakers on the Silk Road.

[1] -[http://www.techtimes.com/articles/12840/20140812/twitter-
ack...](http://www.techtimes.com/articles/12840/20140812/twitter-
acknowledges-14-percent-users-bots-5-percent-spam-bots.htm)

[2] - [http://pearanalytics.com/blog/2009/twitter-study-reveals-
int...](http://pearanalytics.com/blog/2009/twitter-study-reveals-interesting-
results-40-percent-pointless-babble/)

------
MrMeritology
OP here. I just added another blog post with additional evidence that this was
a QA problem, specifically related to an open source library (ALICE) and the
AIML rules they reused and modified.

[http://exploringpossibilityspace.blogspot.com/2016/03/micros...](http://exploringpossibilityspace.blogspot.com/2016/03/microsoft-
tayfail-smoking-gun-alice.html)

------
DonaldFisk
If I were developing a chatbot, I wouldn't provide it with a blank slate and
leave it to the mercy of 4chan users. I'd give it a personality and a set of
values and a rudimentary understanding of how the world works.

~~~
Cartwright2
> I'd give it a personality and a set of values and a rudimentary
> understanding of how the world works.

I don't think you have any idea how complicated such a task would be.

~~~
DonaldFisk
I do have an idea. I first got involved in AI in the early 1980s, and worked
on it on-and-off (mostly on) well into the 1990s. Some of the non-AI stuff I
did back then is now, for reasons I can't quote fathom, classified as AI.

A certain amount of common-sense, and basic political knowledge, should have
been included in the Tay. I don't think there's any way of avoiding it. I'm
skeptical of general application of the recent "free lunch" approach to AI,
and for higher-level AI prefer Doug Lenat's approach with Cyc.

Alternatively, if they wanted to avoid the effort of doing that, they could
have just put up a souped-up Eliza variant. That wouldn't have impressed many
people, but neither did Tay, and it wouldn't have offended anyone.

~~~
galistoca
If you really are experienced as you say you are in AI, you wouldn't talk
about these things as if they were easy to implement. I won't assume things
but I can say I'm probably not so less experienced than yourself, and I know
basically everything in this field belongs to "easier said than done"
category. Most researchers just run experiments in closed environments just
like you said and that's what makes them useless and out of touch with
reality. For this reason I actually applaud MS guys for having the guts to do
this in public. It's much better than them coming out with some lame,
controlled environment "AI" which does exactly what its creators intended.
It's not a "failure". It's a learning process. It's not like this chatbot went
and killed anyone. Everyone knew it was a robot when they were engaging with
it, which is not so different from watching a standup comedian making a racist
joke on stage.

~~~
DonaldFisk
I never implied it was easy to implement. I also said I was skeptical of the
recent "free lunch" approach to AI.

It was a failure. It didn't understand the remarks it was making, unlike a
real Holocaust denier, a 4chan user, or a stand-up comic whose joke just fell
flat.

~~~
galistoca
Kids pick up words from adults. They too don't completely understand the words
they use when they first start picking up and using new vocabulary but the
usage tend to get calibrated based on social feedback.

------
emmelaich
I don't think it did fail.

Microsoft held up a mirror to (self-selected) Twitterers.

It showed them what they are.

(some people tihnk it failed as in poor PR for Microsoft, but I think that is
also an ignorant and arrogant opinion)

------
wmccullough
It's very easy to criticize them, they should have done X, or why didn't they
think of Y, but people are not seeing the big picture here.

