
Alibaba neural network defeats human in global reading test - ClintEhrlich
http://www.zdnet.com/article/alibaba-neural-network-defeats-human-in-global-reading-test/
======
cs702
This is not quite human-level question-answering in the everyday sense of
those words. The ZDNet headline is too clickbaity for my taste.

The answer to every question in the test is a preexisting snippet of text, or
"span," from a corresponding reading passage shown to the model. The model has
only to select which span in the reading passage gives the best answer --
i.e., which sequence of words already in the text best answers the
question.[a]

Actual current results:

[https://rajpurkar.github.io/SQuAD-
explorer/](https://rajpurkar.github.io/SQuAD-explorer/)

Paper describing the dataset and test:

[https://arxiv.org/abs/1606.05250](https://arxiv.org/abs/1606.05250)

[a] If this explanation isn't entirely clear to you, it might help to think of
the problem as a challenging classification task in which the number of
possible classes for each question is equal to the number of possible spans in
the corresponding reading passage.

~~~
ClintEhrlich
Agreed, but it was the least clickbaity headline I saw about this result.

Compare: "ROBOTS CAN NOW READ BETTER THAN HUMANS, PUTTING MILLIONS OF JOBS AT
RISK" [http://www.newsweek.com/robots-can-now-read-better-humans-
pu...](http://www.newsweek.com/robots-can-now-read-better-humans-putting-
millions-jobs-risk-781393)

~~~
cs702
Jeez...

Before you blink an eye there will be some MBA-types working on PowerPoint
proposals with detailed cost-benefit analyses for using those new AI machines
they heard about that can read better than human beings. Needless to say, the
technology will fall far short of expectations.

This is why there have been two AI winters already.

~~~
noobermin
I think it's incumbent on people like you to get the word out that ML isn't
going to put everyone's jobs at risk. Between this and self-driving cars,
local governments are beginning to weigh spending tax dollars on these
boondoggles, for example, self-driving cars, instead of on proven modes of
transit like public transport.

The futurist writers peddling this stuff need to take a moment to chill and
learn about the actual state of the underlying technology.

~~~
pishpash
It's not in their interest to chill and learn. It's in their interest to hype
and sell.

~~~
Eliezer
There's very few people in the whole ecosystem who take home a bigger paycheck
if they chill and learn. Earth gets what Earth pays for.

------
mark_l_watson
Great result. At my job I manage a machine learning team and so I am fairly
much all-in for deep learning to solve practical problems.

That said, I think the path to 'real' AGI lies in some combination of DL,
probabilistic graph models, symbolic systems, and something we have not even
imagined yet. BTW, a good paper just released on the limitations of DL by
Judea Pearl
[https://arxiv.org/abs/1801.04016](https://arxiv.org/abs/1801.04016)

~~~
jacquesm
> That said, I think the path to 'real' AGI lies in some combination of DL,
> probabilistic graph models, symbolic systems, and something we have not even
> imagined yet.

Well, that really made it much clearer to me ;)

~~~
ianamartin
He just means all the things that we already know don't work + something that
we don't know about yet that will. (emphasis on the part we have no fucking
clue about).

------
Jach
It would be interesting to know how well some of the entries on the Squad page
do for the Winograd Schema challenge
([https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS....](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html)).
Does anyone know if any of the systems have been tested on that as well?

------
cwyers
I am always annoyed at claims in supervised learning that a machine predictor
is better than humans. Humans obviously are the ones that scored the dataset
to begin with. If you read the paper, it goes on to say, in regards to human
evaluation:

> Mismatch occurs mostly due to inclusion/exclusion of non-essential phrases
> (e.g., monsoon trough versus movement of the monsoon trough) rather than
> fundamental disagreements about the answer.

I don't think I would call that "error," rather than ambiguity. In other
words, there's more than one possible answer to the questions under these
criteria -- English isn't a formal grammar where there's always one and only
one answer. For instance, here's one of the questions from the ABC Wikipedia
page:

> What kind of network was ABC when it first began?

> Ground Truth Answers: radio network radio radio network

> Prediction: October 12, 1943

Because the second human said "radio" instead of "radio network," I believe
this would count as a human miss. But the answer is factually correct.
Meanwhile, the prediction from the Stanford logistic regression (not the more
sophisticated Alibaba model in the article, where I don't think results are
published at this detail) is completely wrong. No human could make that
mistake. And yet these are treated as equally flawed answers by the EM metric.

And yet this gets headlined as "defeats humans," not "learns to mimic human
responses well."

~~~
rmellow
> I am always annoyed at claims in supervised learning that a machine
> predictor is better than humans. Humans obviously are the ones that scored
> the dataset to begin with

For some problems, sure. For prediction tasks on the other hand, you have an
actual ground truth that can be compared to human a priori prediction.

Neural net NLP results are rarely about actual intelligence or clever use of
latent variables that it figured out, and more "pattern matching" that
explains why its errors are so different from human errors. It doesn't
actually understand the problem, it's finding tricks to solve the questions
that happen to be regularities in the dataset that us humans can't really see.

------
cscurmudgeon
How well do these do on Winograd challenges?

[https://aaai.org/Conferences/AAAI-18/aaai18winograd/](https://aaai.org/Conferences/AAAI-18/aaai18winograd/)

~~~
nl
This is extractive question answering rather than reasoning, so this will be
challenging for it.

Nevertheless, most extractive systems learn some degree of co-reference
resolution.

I have a less advanced system than the Alibaba one, and it got both example
questions correct:

 _The trophy would not fit in the brown suitcase because it was too big._ What
was too big?

and

 _The town councilors refused to give the demonstrators a permit because they
feared violence._ Who feared violence?

------
pegasos1
This is clickbait. Unless models are invariant to adversarial examples in
SQuAD such as those described here:
[https://arxiv.org/abs/1707.07328](https://arxiv.org/abs/1707.07328), models
doing really well on SQuAD doesn't mean a ton.

~~~
typon
Can't they simply include adversarial examples in the unreleased test set?

------
nl
At NIPS 2017 there was a system which beat humans in a college QuizBowl
competition. In many ways I think that was more impressive than excellent
performance on SQuAD.

------
wanghq
Kudos to my colleagues. The iDST team is based in Bellevue, WA and hiring more
people. Let me know if you're interested.

Also, the Alibaba Cloud is looking for engineers. Pls check
[https://careers.alibaba.com/positionDetail.htm?positionId=b7...](https://careers.alibaba.com/positionDetail.htm?positionId=b7kSeJ8J2XQ3ynkotvAhPw%3D%3D)

------
Xeoncross
@syllogism, have you thought about a demo combining spaCy + ____ to tackle
SQuAD ([https://rajpurkar.github.io/SQuAD-
explorer/](https://rajpurkar.github.io/SQuAD-explorer/))?

~~~
nl
DrQA can use Spacy as a tokenizer and scores about a 2 points less on SQuAD.
[https://github.com/facebookresearch/DrQA#tokenizers](https://github.com/facebookresearch/DrQA#tokenizers)

------
stablemap
A counterpoint from Yoav Goldberg:

[http://u.cs.biu.ac.il/~yogo/squad-vs-
human.pdf](http://u.cs.biu.ac.il/~yogo/squad-vs-human.pdf)

------
anorphirith
is this still impressive in 2018? I honestly don't know

~~~
js8
Yeah, I don't think it is. Year 2017 brought out some really bad humans, and
it turns out, you don't really need better algorithms to beat them in
linguistics. See for yourself in this video:
[https://www.youtube.com/watch?v=L0eY5TGEK2I](https://www.youtube.com/watch?v=L0eY5TGEK2I)

------
spiderfarmer
Cool. An AMP page. Makes it look like Google published this article.

~~~
hughes
Seeing (google.com) in the title definitely influenced my click.

~~~
freehunter
People who think AMP is a solution and not just another problem need to take
this effect into account. Google isn’t just acting like a CDN here, they’re
being intentionally misleading.

~~~
MR4D
Normally I'd agree, but I just got a redirect to the ZDNet page.

Did someone change the link?

------
msla
Real link:

[http://www.zdnet.com/article/alibaba-neural-network-
defeats-...](http://www.zdnet.com/article/alibaba-neural-network-defeats-
human-in-global-reading-test/)

~~~
jnordwick
Flagged and voted this up. HN should scan for AMP links and ask the submitter
to fix the link.

~~~
ClintEhrlich
I realized my mistake after posting, but couldn't edit the URL. Sorry about
that.

