
Natural language benchmarks don’t measure AI models’ general knowledge well - optimalsolver
https://venturebeat.com/2020/08/12/natural-language-benchmarks-dont-measure-ai-models-general-knowledge-well-research-shows/
======
ninjin
Second author here with a link to the arXiv paper:

[https://arxiv.org/abs/2008.02637](https://arxiv.org/abs/2008.02637)

Although I have to say, VentureBeat did much better than most media outlets I
have seen writing about current research and what they write is not only
accurate but also largely devoid of hype. Perhaps we actually managed to “keep
the hype down” as we intended when writing this piece?

I will check in on this post now and then if you have questions and see if the
first author is interested in joining when he wakes up as he really did all
the legwork for this one.

~~~
nl
Nice paper and important work.

Given that nearest-neighbor outperforms on closed book, is it reasonable to
suspect the model is doing NN itself internally (which would explain the good
performance on close duplicates?)

And if this is the case do you think training-time processing of data to
attempt to move convert it to question/answer _form_ data rather than raw QA
would be a reasonable approach towards tackling this?

~~~
patrick-lewis
Hi, first author here.

> Given that nearest-neighbor outperforms on closed book, is it reasonable to
> suspect the model is doing NN itself internally (which would explain the
> good performance on close duplicates?)

I think this is definitely the case for the BART model. It is essentially
acting as a QA-pair memorizer over the training data, and at test time, it
just matches the question onto those seen at training time. Note that the
T5-11B+SSM closed-book model was able to do a little better on NQ, so very
large models with task-specific pretraining objectives do seem to do something
slightly more interesting than just NN, but still really struggle in some
settings.

> And if this is the case do you think training-time processing of data to
> attempt to move convert it to question/answer form data rather than raw QA
> would be a reasonable approach towards tackling this?

Great question! Converting sentences into a series of QA pairs is something
we're really interested in. The T5-11B+SSM model we evaluate in the paper uses
a special "Salient span masking" pretraining objective that does this to some
extent (only mask words at pretraining time that are likely to be "answers" to
factual questions), so in essence the pretraining task becomes pretty standard
cloze-question answering, and they find that leads to better downstream
results ([https://arxiv.org/abs/2002.08910](https://arxiv.org/abs/2002.08910))

~~~
nl
Thanks for the response.

> [BART] is essentially acting as a QA-pair memorizer over the training data,
> and at test time, it just matches the question onto those seen at training
> time.

Not super-surprising.

> The T5-11B+SSM model we evaluate in the paper uses a special "Salient span
> masking" pretraining objective that does this to some extent (only mask
> words at pretraining time that are likely to be "answers" to factual
> questions), so in essence the pretraining task becomes pretty standard
> cloze-question answering

This seems an obvious approach for cloze-type questions, but it seems non-
obvious how to extent beyond this.

Are you aware of any work probing the differences in the representation using
this style of masking vs a more normal language model objective? It would seem
to me that this is the key to significant progress here (and of course one
would speculate that a representation that works well for this would also work
well for all kinds of KB-related tasks).

Thinking about this for a few minutes things like masking names, colors and
numbers (the things that neural representations often confuse) and then asking
questions based on them might be interesting. I wonder if bAbI could be
extended for this?

------
nextos
Judea Pearl has been bringing up the lack of causal knowledge in ML very
often. He has even posted lots of interesting comments in Andrew Gelman's
blog, e.g.:
[https://statmodeling.stat.columbia.edu/2009/07/05/disputes_a...](https://statmodeling.stat.columbia.edu/2009/07/05/disputes_about/)

I tend to think that lots of solutions could come from topics like those
discussed in this book, with a lot of further development:
[http://probmods.org/](http://probmods.org/)

~~~
Isinlor
Understanding of causality is very likely an emergent property. While
extremely important, it's unlikely that we have some hard-coded low level
architecture of causal inference in brains. It probably will just arise as a
necessity of grounded understanding of the world.

~~~
mjburgess
I'm not exactly sure what you mean.

The conditions for causal inference being _possible_ are pretty clear and have
to do with the intentional modification of the local environment.

The intention to achieve some new environmental state, and _your_ action to
bring it about, is a dynamical activity that enables "deep" model building.

Causal inference is not going to be some "module" of the brain... it requires
a body. When you place your hand on a hot surface, _once_ , you immediately
understand that it is hot. It does not require "induction" (as hume supposed).
That is because our _body_ identifies causes.

Its therefore pretty trivial to observe no NLP system understands language, or
even can understand language, because it lacks this capacity to acquire
language semantics via participation in environmental exploration. It has no
body.

ie., you need to have experienced "on top", "green", etc. to know what "green
leaves grow on top of trees" _means_. There is no meaning in the frequency co-
incidence of symbols in text.

So no matter how much you are able to reproduce these patterns, they contain
no content. The content is in the _reader_.

~~~
s_brady
Your comment is interesting but seems to conflate two separate though related
areas. The need for a body arises from the embodied cognition school of AI
which suggests that intelligence is fundamentally embodied, hence the need for
a robot equivalent to a body for truly understanding language.

However this does not necessarily have to be related to causality, and
counterfactual statements about a causal model. The math behind
counterfactuals and causality is actually well understood now (see any of
Pearl's books). It does not actually require that a system be embodied, just
that the system have some suitable (and correct) causal model of the world.

It would of course be amazing to have both in one system, but that is not
required. An AI system that understood causality and language could be
bootstrapped from causal models supplied by humans - or even other AIs :)

~~~
mjburgess
I'm conflating them because they are deeply connected.

Causal analysis can be performed, via Pearl, on datasets _collected_ for
causal analysis.

You still need some mechanism to collect the data, ie., the scientist. This
requires solving the "relevance" (/framing) problem -- which, in my view,
cannot be solved under a congitivist (/computational) theory of mind.

"Data" which is _relevant_ to a causal hypothesis isn't selected via
inference, the body "selects" it.

eg., when my hand is on a hot surface, it's temperature isnt "chosen as the
relevant casual variable".

The body is the primary solution to the relevance problem. So you can't just
"shove in causal math" into a computational system and expect it to grasp
anything.

I also don't think bootstrapping will take you very far: causal models of,
eg., dogs are _very deep_. ie., we understand their 2d, 3d, skeletal,
behavioural, color, sound etc. "dimensions".

To say, "the dog was well behaved" requires an extraordinarily deep model of
"dog".

The only way i see this being built is via play, ie., via hypothetical
interaction with an environment -- as we do -- with bodies capable of
discerning relevance.

~~~
nextos
>Causal analysis can be performed, via Pearl, on datasets collected for causal
analysis.

There's a lot of past and current work on causal inference on observational
data too.

------
skybrian
It seems like there are not many common-knowledge questions that you can't
find answers to on the Internet? Most of us rely on web search these days.

Also, _making up answers_ isn't necessarily a good skill to train for if
accuracy is needed. It would be more useful to quote and cite the source.

------
tsimionescu
I'm always amazed that some people think performing statistical analysis
(training neural networks) on text can lead to actual intelligence and
knowledge about the world, if only we increase the number of model parameters
by a few more orders of magnitude (GPT-3 and the likely strategy for GPT-4).

Of course, all this without any model of how likely it is that the knowledge
is embedded in the text, such as trying to train simple models on descriptions
of a simple world. I think it's very likely a model couldn't even learn simple
arithmetic (addition, subtraction multiplication, division on the rational
numbers, let's say) given all of human writing on arithmetic, nevermind being
able to reason about the entire world.

~~~
IfOnlyYouKnew
It amazes me people think their brain works differently.

It's pretty close to believing in body/mind dualism, the only thing in
neuroscience more outdated than Freud.

Your brain works within the same laws of physics as the outside world. We
don't know how the brain works exactly. But once we understand it, it is
unlikely to be qualitatively different from a neural network.

On the other side of the equation, the emergent behaviour of deep networks
make them fundamentally different from the sort of statistics that came
before. So even if the building blocks are old and boring, the larger system
isn't.

~~~
mjburgess
> it is unlikely to be qualitatively different from a neural network

I'm sorry but this shows a profound misunderstanding of what a NN is, and what
the brain is.

There are no "neural network"s. The NN algorithm is a method for optimizing
the parameters of a piece-wise linear regression model.

These regression models have _no_ homology to any brain structure and the
process of producing them ("training") has no neurological analogue either.
They can be produced with a variety of algorithms.

Here's one: f(x) = max(0, 2max(0, 3x - 0.1) - 0.5)

The phrase "neural network" is, as peddled by the media and poorly informed
lecturers, a lie. It is neither neural nor a network. It's just gradient
descent with more parameters.

A model of the brain would model, at least: neuroplasticity, biochemical
signalling, activation frequency, etc.

There is _nothing_ about a peice-wise linear regression model which does any
of that. No matter how many circles and lines you draw. (NB. essentially any
mathematical function can be drawn as a "neural network". The "network" is
just a way of diagramming function application & dot-products).

Aside from all that, the body-brain system of animals is a physical process
whose properties are not abstract. The reason we are intelligent is because we
have bodies capable of causal analysis; and that capability is biophysical.

To put it another way, no algorithm which runs on a digital computer will turn
it into gold. Not even one called, "the midas algorithm".

~~~
plutonorm
Causal. Imagine a video. Each frame is plotted in space, so instead of a
sequence of 2d images, you have a cube. The z axis of the cube shows each
frame of the video in turn. This is an equivalent representation to the
representation that we are used to. We have just switched the time dimension
for a physical dimension. What you are reverently calling causality, is only
correlation along the z spacial dimension. In the same way an image can be
compressed due to the similarities across it's 2 dimensions, so can a 3d
representation of video be compressed along all 3 of it's axis. An
appreciation of causality is only induction and induction is only a
correlation, if a then probably b. Finding correlations in one of the
dimensions you are more familiar with, like the pixels of an image is no
different to finding it in time. Within language the elements of time are
encoded, just as they are within the video represented as a cube. "The boy
threw the ball, the ball landed and then rolled." Just because the time
dimension is represented to the neural network within the input parameters of
a single iteration of the neural network, does not make it any less able to
understand correlations across time.

~~~
mjburgess
So this is to repeat hume and mistake causal analysis as a kind of induction
or inference.

It isnt.

Our bodies are the primary site of "causal analysis". Eg., when I touch the
hot surface of a stove I do not _infer_ that temperature is a cause of my hand
being hot.

Such an inference, as the basis for our models of the world, would be -- as
you/hume/etc. say -- deeply insufficient.

The operation of the world _on our bodies_ is already laden with causal
information. A hand striking a face does not just create a "painful sense
impression"... rather the body encodes it as _caused_ by the object you also
_saw_.

It is the action of the world upon the body that is the bedrock of our causal
model building. Scientific/inferential processes sit _on top of that_ to
disambiguate between possible distal causes.

A machine, as a person, given "mere data" cannot hope to do much.

~~~
hackinthebochs
None of this explains how modelling time sequences of events does not
sufficiently approximate the kind of causal knowledge you mention.

~~~
mjburgess
Suppose I place a pot on a stove and the water boils.

Now I feed into the machine what data?

Here, it gets this: all gravitational, electomagnetic, etc data within 1km;
all geometric information about all objects within 1km (, and _all_ of their
properties, etc.)

Now, machine, what caused the pot to boil?

It has no clue. There are an infinite number of antecedent temporal events.

The problem isn't the mathematics of causal inference. The problem is
relevance. That isn't an inferential problem.

~~~
hackinthebochs
>It has no clue. There are an infinite number of antecedent temporal events.

Right, determining the cause of the pot boiling given only local information
about this one event is impossible. But that's not a good representation of
learning from a real-world data set. A real training corpus might also have an
example of a pot boiling in a campfire, a water heater heating up water using
fire, someone touching an open flame and going "ouch!", fire burning down a
house, etc. All these examples taken together can reasonably lead one to infer
that fire causes things to become hot. This is the kinds of regularity found
in real world datasets and a good learning algorithm will extract this
regularity in the course of predicting the dataset.

~~~
mjburgess
Who prepares the dataset?

You are just shifting relevance problem to the human to solve. That's my
point.

The "regularity" isn't very hard to find when the data which exposes it is
_already_ selected.

~~~
hackinthebochs
No, I don't see your point. No one "prepared" common crawl to contain multiple
instances of relevant examples for GPT-3 to learn from. It's just that a
sufficiently large training corpus will naturally have this sort of regularity
in it. _My_ point is that a general learning algorithm trained on a
sufficiently large and representative corpus will capture causal regularity
without the sort of fine-tuning of the algorithm or the training data you are
suggesting. You haven't given any reason to think otherwise.

~~~
mjburgess
identifying causes cannot be done statistically, as a fact of statistics

events A then B then C do not imply C is caused by B is caused by A

This problem is __worse __the more data you have, as with my example above of
giving the machine /every/ event within a 1km radius of a pot boiling.

Identifying a cause is a dynamic experimental process. A machine can only
accept a highly prepared dataset which _has already been chosen_ because the
associations _are known_ to be causal.

~~~
hackinthebochs
But this problem isn't specific to machines. The fact that my hand hurts after
touching an open flame stove doesn't _entail_ the flame caused my pain. No
amount of first-hand experience with flames can entail that flames cause pain.
All we can do is increase the likelihood of this model. Experimental processes
are in the same boat, except that the statistical power is greater. But our
lived experiences with flames and pain and all other stimuli is a sort of low
power ongoing experiment. But with enough of these poor experimental runs we
can converge on an approximately true model. The same goes for a statistical
learning algorithm and a large training corpus.

~~~
mjburgess
Yes but the body solves the problem of concluding that the stove caused the
pain.

The body is a mechanism of relevance. The more data you feed a machine, the
worse it performs.

The body isnt necessarily right about causation (though it often is) but it
provides a minimal mechanism of relevance to enable inference.

Data itself contains no causal information.

~~~
plutonorm
You are just pushing the solution to the problem of relevance (which isn't
actually a real problem) down the stack, into the body. How does the body
magically create this new kind of relation called cause? You are mistaking the
image in your mind for reality. The image you see in your mind is the product
of an extraordinarily large neural network that has done all the statistical
analysis for you.

------
Veedrac
The article doesn't give examples, but the paper does. Here are a few examples
of question overlap, for context.

    
    
        Test Question:  who plays max voice in a goofy movie
        Train Question: who does max voice in a goofy movie
        Answer:         Jason Marsden
    
        Test Question:  when will the 2018 oscar nominations be announced
        Train Question: when are the oscar nominations for 2018 announced
        Answer:         January 23 2018
    
        Test Question:  who has scored more goals in the premier league
        Train Question: most goals scored by a premier league player
        Answer:         Alan Shearer
    
        Test Question:  where are the cones in the eye located
        Train Question: where are cone cells located in the eye
        Answer:         retina
    
        Test Question:  who led the conquest of the incas in south america
        Train Question: conquistador who defeated the incan empire in peru
        Answer:         francisco pizarro
    

It makes sense that questions like these are especially susceptible to brute
memorization. These are certainly bugs in the benchmark—and, in fact, about a
third of questions in each dataset have this issue.

The paper also considers answer overlap; that is, when the answer to the
question also occurs in the training set. This alone does not imply
memorization, but it does open the door for some shortcuts. Some examples from
the paper are:

    
    
        Open Natural Questions
        Duplicated: Phil Simms, Brian Johnson, 8, the Indians, the 1830s
        Unique:     Cloves, Matt Monro, 1,020 – 1,080 kg, Hermann Ebbinghaus, Matt Flinders
    
        TriviaQA
        Duplicated: David Bowie, Battle of camlann, Heligoland, Henry VII, Niagra Falls
        Unique:     Death in the afternoon, Clash of the Titans, ice-cream sundae, Camshaft, Cumberland
    
        WebQuestions
        Duplicated: Harvard, Alderaan, India, 2011, Zeus
        Unique:     Queen Victoria, Braslia, Paddington, Tom Corbett, Gary
    

It's less obvious how to treat answer overlap. In particular, removing answer
overlap might bias the dataset towards harder questions. Some of the reduction
in model scores is presumably because models have to work harder to understand
the question, as merely looking for topical similarities will be ineffective,
but it also removes questions whose answers are general enough to apply to
many questions, like ‘8’, ‘the 1930s’, ‘Harvard’, ‘2011’, etc. This means that
reductions in the score don't clearly say how much cheating happened. However,
the success of the BERT-based Nearest Neighbor model, which retrieves the
answer of the most semantically similar fine-tuning sample, scores similarly
to BART, which seems much too high for comfort.

It should be noted that a lot of the discussions about causal knowledge and
intelligence aren't too relevant to these questions, as they are largely tests
of memory or retrieval. It is expected that these models answer the questions
by searching their training data or their document index. The issue is that
their training data isn't meant to contain copies of the test questions, just
the information sufficient to answer them.

~~~
patrick-lewis
Hi, First author here.

Thanks for the comment and for adding examples, and for your nuanced comments
of the answer overlap split.

My position is that these datasets are still useful for QA, but what was
lacking was an analysis of how easy/hard the questions in them were, and what
kind of modelling was needed to do well. These overlap phenomena are less like
"bugs" maybe, but more like poorly understood features.

We need models that can accurately recall QA pairs they have seen before, so
being able to score well on "memorizable" QA pairs is still important to do
well, but we also want models that can do more than that. One single accuracy
number on a leaderboard cannot capture all the behavioural information we need
to properly understand the capabilities of these models.

