
NLP's Clever Hans Moment Has Arrived - pgodzin
https://thegradient.pub/nlps-clever-hans-moment-has-arrived/
======
unhammer
Great article, and some great links from it too. The list of models "following
the letter but not the spirit" at
[https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...](https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml)
has some hilarious and creative examples that one would hope never make it
into production:

> AI trained to classify skin lesions as potentially cancerous learns that
> lesions photographed next to a ruler are more likely to be malignant.

> Agent pauses the game indefinitely to avoid losing

> A robotic arm trained to slide a block to a target position on a table
> achieves the goal by moving the table itself.

> Evolved player makes invalid moves far away in the board, causing opponent
> players to run out of memory and crash

> Genetic algorithm for image classification evolves timing attack to infer
> image labels based on hard drive storage location

> Deep learning model to detect pneumonia in chest x-rays works out which
> x-ray machine was used to take the picture; that, in turn, is predictive of
> whether the image contains signs of pneumonia, because certain x-ray
> machines (and hospital sites) are used for sicker patients.

> Creatures bred for speed grow really tall and generate high velocities by
> falling over

> Neural nets evolved to classify edible and poisonous mushrooms took
> advantage of the data being presented in alternating order, and didn't
> actually learn any features of the input images

~~~
fauigerzigerk
Some of these approaches would probably be called genius or at least very
creative had they been found by a human :)

~~~
mlthoughts2018
Yes, some hedge fund would probably be happy to hire these AIs.

~~~
inimino
Past performance is not indicative of future results.

~~~
ethbro
That's why successful hedge funds hedge their own pay from clients with legal
clauses to limit liability in the case of losses.

Aka "heads, we win; tails, I don't lose"

------
s_Hogg
> _"they obviously are not smart enough to import tensorflow as tf"_

Yeah. This is part of why DistilBERT (and the fact that you can do pretty well
without BERT) is interesting, to me. It seems like for a very long time there
have been people complaining about certain individuals and orgs throwing
cash/compute at problems to look good rather than solve anything. The
difference is nowadays it's starting to be less of a fringe view, thank
heavens.

The most interesting thing about NLP (as someone who works in it), is
precisely that it is very, very hard to get anywhere. And that in turn is why
the field keeps turning up so many new NN designs: the flipside the author
rightly points out is that this has to happen for data as well, if we aren't
to fool ourselves about our progress.

Great read.

~~~
dlkf
> The difference is nowadays it's starting to be less of a fringe view, thank
> heavens.

Well put. The timing is good too, because NLU-is-just-around-the-corner hype
is starting to have some really negative social consequences. Yesterday a
disturbing article about automated test scoring in the states was trending on
HN.

~~~
lonelappde
Mass production essay grading was a disaster before AI showed up. AI is just
automating the mistakes of the past.

------
mirekrusin
What if this is the answer? What if our intellect is nothing more than few
orders of magnitude more performant network of Clever Hanses in our head
developed over years? Look at the history - it's actually hard to find
something that is not biased - false, biased believes are spread throughout
the timeline of human kind. Given enough time, our knowledge gets closer
towards truth - but maybe it's nothing more than a time-trimmed tree of Clever
Hanses? Like a child constantly creating false, biased interpretation of
surrounding reality, discarding most of it, leaving bits that pass test of
time.

~~~
foldr
We know that "this" can't be the answer because we know that humans aren't
fooled by the inputs that were crafted to fool the language model.

~~~
almostarockstar
In that specific example, yes. But in general, I think the "we are all Clever
Hans" is right at least to some degree.

~~~
foldr
Why? The idea that human intelligence also "fake" is just a convenient excuse
for the lack of any real progress in AI. We hear the same thing at the tail
end of every overblown AI hype cycle. Well, gee, maybe humans aren't that
smart anyway!

~~~
pas
Why are we equating brute force with fake?

The human mind has many faculties, and the brain has correspondingly many
functionally different elements. Plus years of brute force learning.

Compared to GPT2, AlphaZero and whatever we are waay more complex and had a
lot more training done.

Probably the super secret sauce is in the hyperparameters that determine how
to cobble all those functional components together to get something more than
the simple sum of them. (But they are also learned and brute forced through
evolution.)

AI needs better brains. (So R&D can proceed faster. We can test new ideas
cheaper, we can optimize algorithms, so they'll learn and perform better.)

~~~
foldr
I'm not sure that I am equating brute force with fake.

------
vnorilo
From the paper discussed in the article:

> Our main finding is that these results are not meaningful and should be
> discarded.

As often happens, ML found a way to exploit a trivial bias in the dataset. Tip
of my hat to the researchers for actually doing a good job! Also really
enjoyed this read.

------
nopinsight
In natural language, the number of unique words is large but some of them tend
to be highly correlated, which means significant implicit redundancy.

Example: For years, they have strived to make a ....... in the community.

Only a few possibilities out of 30,000-100,000 English words in use can fit
properly in the blank.

Thus, many tasks, even those designed to test for real understanding, can be
“solved” pretty well by putting together a number of relatively shallow cues.
BERT and similar models learn from huge datasets (billions of bytes) and they
probably capture millions of those correlations in the model. (These models
are indeed wonderful accomplishments and are very useful for many things, but
not ultimate solutions to true natural language understanding.)

For more info, see:
[https://super.gluebenchmark.com/](https://super.gluebenchmark.com/)

I agree with the author that we need even better datasets evolved under public
scrutiny since some of the datasets we currently have are already designed
very well by their authors but the problem of designing datasets that can
withstand correlation detections by DNNs (and still amenable to standardized
evaluation method) could be too challenging for any single team under limited
time.

~~~
mikekchar
I don't know if I'm just not wired up like most people, but I have real
difficulty in fill in the blank tests. These kinds of tests are extremely
common for language proficiency tests and I fail at them even for my native
English language. For example:

For years, they have strived to make a _pizza_ in the community. (But since
they lack flour for the crust, it has not ended well). For me, I can fit
nearly any noun into that sentence and imagine a viable scenario where it
would be reasonable. I honestly don't know what you were getting at. For
others, I am sure that it is obvious and they can tell you the few words you
were imagining.

My English language ability is quite good, so I wonder why I can't perform at
these kinds of tests. I also wonder if knowing why I can't do this is useful
for NLP.

~~~
inimino
"difference" is the word there.

If you are allowed two words, "real difference".

The way to find it is to find candidate words and try them one by one until
you hear something that sounds like you've heard it a hundred times before.

> I also wonder if knowing why I can't do this is useful for NLP.

Yes. I wonder if the inability is acquired or learned or innate? Could you
learn to do it? Do you have an aversion to catch phrases and well-used
(hackneyed) clichés? Do you prefer to weave your own words into sentences?

I notice you use some odd prepositions in odd orders. For example in your
second sentence most people would say "I have real difficulty with fill-in-
the-blank tests."

You also wrote "perform at these kinds of tests," which is a place where
almost all American native English speakers would use "on" rather than "at".

If I had to guess, you're much less sensitive than the average person to
slight variations in word pair frequencies, and you could certainly create a
test to test this hypothesis. For example you could choose any n-gram
likelihood data derived from well-written English texts, and write a program
to measure your ability to distinguish high- from low-frequency word pairs.
This should be lower than other people in your peer group.

------
DoctorOetker
what if most humans exploit the very same Clever Hans effect? A translation
could be deemed correct by the submitting human translator(s), where the
translation implies less or more rammifactions than the original sentence.

If one thinks about loose associations that newcomers make compared to experts
in a domain, this seems very similar to me.

When designing a vector graphic one can enable "snap to grid", similarily at
some point we will have to "snap to makes sense" by means of verifiers or
provers. "Why" type questions ultimately ask for a proof or derivation, which
in the past (before the advent of logic, to which a philosopher would have to
carefully abide) was not objectively verifiable, but at some point it is
foreseeable that neural networks will be asked to construct (or append) a
formal belief system, predict a conclusion, and justify it by supplying a
proof.

~~~
perl4ever
"When designing a vector graphic one can enable "snap to grid", similarily at
some point we will have to "snap to makes sense" by means of verifiers or
provers."

Yes, I think this is how the human mind works, kind of, but I don't think the
part that provides the grid is like the verifiers or provers we have
implemented. There is something that provides a substructure, and I think that
you can see in mental illness where it's not functioning properly, but even
when healthy, it's not _that_ logical. When humans do logical reasoning,
that's a very high level activity, superimposed on top of the other layers,
IMO.

I think that the "snap to" part is going to be very difficult to develop
because we are not conscious of it. Where to even begin? It might be fruitful
to study instances where it isn't working properly - like thought disorders in
schizophrenia.

So yeah, I think human thinking may have severe flaws that are rather similar
to the ones discussed in the article, but that doesn't prove that current AI
has all the components needed to match humans.

~~~
DoctorOetker
>So yeah, I think human thinking may have severe flaws that are rather similar
to the ones discussed in the article, but that doesn't prove that current AI
has all the components needed to match humans.

I adamantly agree, there is probably a lot of generally applicable (i.e. not
domain-specific) "tricks" or "implicit insights" that mammal brains use which
we haven't discovered yet.

Another good example is, how it's often stated that humans have no trouble
learning things "in one shot", but how on closer inspection that may or may
not be true and is very hard to verify:

Consider how when we have a clear negative or positive experience (say being
thrown out of class or standing in a corner facing the wall versus getting a
compliment on your work etc), we afterwards typically replay the sequence of
events and how it led to the current situation. It is unclear if this
replaying occurs only for the highest-level abstract thoughts possibly
centralized into a couple of regions recording and replaying episodic memory,
or if this is actually happening in a more decentralized way throughout the
brain while we are only subjectively aware of this episodic aspect at the
conscious level. If this self replaying of the last lessons that apply locally
is spread and operating independently throughout the brain, could this be an
explanation of the brain wave patterns? Is this origin of the "aha" signals or
is that just a fantastic concoction of the reproducibility crisis?

Local feedback signals (numbering on the order of number of synapses) vs
global feedback signals (numbering on the order of kinds of hormones,
diffusing neurotransmitters, blood sugar etc) ===

We still know very little of the feedback mechanisms in the brain, is it a low
number of global feedback signals, or a large number of local feedback
signals?

A) global feedback: the hardcoded feedback mechanism is primitive wetware
(listening to the low number of chemical signals) while the high-speed
feedback is learned by neurons influencing each other in the prograde
direction as emergent behaviour resulting from this primitive wetware (the
only feedback being through axons literally feeding back as opposed to
feedforward networks); or

B) local feedback: if the high-speed feedback is hardcoded wetware too, i.e.
retrograde signalling of adjoint derivatives across the synapse, which would
mean that the reverse accumulation automatic differentiation we use is closer
to biology than currently accepted.

For example in A) suppose there is a low bandwidth (low frequency) feedback
reward signal, say blood sugar rising after eating a given sweet for the first
few times, then the shortest path between the taste buds and the chewing and
swallowing motor neurons would improve their weights for intake, a first level
anticipation, but then later seeing or touching the sweet or actively seizing
one might cause the previously trained neurons to train the newly recruited
signal paths by some other neurotransmitter reward by anticipation.

Fast-approach-but-imprecise adaptation vs Slow-approach-but-precise fine-
tuning ===

Another facet is that we currently have "1 training regime" for digital neural
networks, with which I mean: we use gradient descent both to 1) adapt the
weights from a totally randomized initial state to a somewhat acceptable
state, and then after a nonexistent pause, 2) to continue improving the
weights near the top of the hill (or bottom of valley...) of the scoring
landscape. We have no guarantee that nature uses a single regime, let me
concoct a hypothetical (and thus improbable) guess: perhaps it uses gradient
descent to get the weights approximately where they should be, but then a per-
synapse lock-in amplifier correlates (multiplies over time 2 signals while
summing) the positive / negative feedback with its own positive or negative
variation on the weight, smoothes this product-sum correlation signal (low
pass filtering) to get a very high precision feedback signal which has higher
precision than a single instantaneous "sample" of feedback (local or global).
Throughout the brain synapes some would be closer to the instantaneous
feedback signal regime (in response to new lessons. or for short term memory),
while others would be closer to the LIA regime, for longer term memory and /
or precision. That would be trillions of lock-in amplifiers in a single human
brain...

EDIT: in fact SGD stochastic gradient descent can already be seen as a global
LIA mechanism, but with all weights / synapses using the same low pass filter.
In theory we could give each synapse / weight not only a weight but also each
their own timescale (or example count scale, the tau in (alpha) and (1-alpha)
multiplier factors when expressed as a filter), and adapt the timescale
depending on the local feedback signal (adjoint derivative) in the
backpropagation algorithm. To vary the weight just use weight times (1 + 0.001
times the randombit bit r), and multiply it with the feedback signal (the
component of the usual gradient, i.e. the derivative that corresponds to this
synapse / weight). A similar trick could be used to vary the timescale. One
could also hardcode the timescale for certain synapses as a hyperparameter to
forcibly locate short term vs long term memory pathways.

EDIT2: As an example of the function of weights that would settle (or were
forced) at long vs short timescales at low or high levels:

low level, short term: adapting to reverb entering or leaving a church,
adapting to noise when entering or leaving a party,

low level, long term: associating spectral peaks among frequency bins (the
fundamental and higher harmonics for sounds of different timbres would remain
in roughly the same place, regardless of background noise)

high level, short term: unimportant small talk, or important but quickly dealt
with information (did I already pay this drink or can I walk away?)

high level, long term: what is my pin code?

~~~
inimino
> Another facet is that we currently have "1 training regime" for digital
> neural networks, with which I mean: we use gradient descent

Speak for yourself! There are plenty of other ways. Gradient descent is just
currently in fashion.

~~~
DoctorOetker
sure there are other means of training, I was highlighting that nearly all
approaches use only one improvement approach during training (gradient
descent, genetic algorithms, ...) and if multiple are used the whole model is
using the same balance of multiple approaches.

~~~
inimino
Ah then agreed. My take is if you have a feedforward network, add time delays,
because synapses aren't running in lockstep, and add connections arbitrarily
with any two neurones, only limited by the physical distance between them, and
then add in every kind of diffuse effect of neurotransmitters and all the
brain chemicals we haven't discovered yet, and then figure out how to train
it... you still won't have a human brain equivalent, because it is harder than
that. But you might get something that can learn! All you need is a dopamine
reward button, and some way to whack it with a stick every now and then.

I think we're more likely to create AGI than to understand how the brain works
in our lifetimes.

------
PeterStuer
Great article. Should be required reading for test designers as humans also
tend to pick up on the implicit heuristic clues that bypass the need for
knowledge and understanding as prep schools teach "don't know the answer?
These 5 simple rules will improve your guess"

~~~
tonypace
The standardized test organizations give briefings to teachers on the most
common Hans cues for students. In order, they are differing answer length,
answers that mirror the question's grammar, and the infamous letter C.

~~~
PeterStuer
I had to Google for the 'infamous letter C'. In case others were wondering, it
seems to stem from folk wisdom still perpetuated that for a blind guess the
third answer when given 4 or more options in a multiple choice test is more
likely to be correct than the others.

The linked article suggests this does not hold (for the ACT), but doe suggest
picking a single letter and sticking with it for all blind guesses would
outperform a purely random guessing strategy.

[https://blog.prepscholar.com/most-common-answer-on-
act](https://blog.prepscholar.com/most-common-answer-on-act)

------
js8
I have said it before and I will say it again - the problem is that humans are
MORE than just pattern recognizers. We build models that we are trying to make
logically consistent; while the function that provides pattern recognition
doesn't have to be.

To see that better, consider what I call "simplified Chinese room". It's a
variation on a traditional Chinese room, where inside the room, there is only
a pattern recognizer, which basically will match the input to arbitrarily many
inputs (but not all possible) it learned before and chooses the output for the
best match.

Now imagine I want to train this "simplified Chinese room" on solving
satisfiability problem. Because in that problem, an arbitrarily small change
in the input (introducing a contradictory clause) can completely change the
output. It is therefore impossible, I believe, to learn the concept of
satisfiability by just using pattern recognition (storing and comparing,
according to some metric, previously seen inputs and corresponding correct
outputs). Instead, you need to build a mental model which is internally self-
consistent with these example pairs.

~~~
visarga
> Instead, you need to build a mental model

<rant>That's where graph neural nets come into place. They can learn
relations, scaling to a large number of objects. A traditional approach would
have to learn all possible combinations, hitting the combinatorial explosion.
Graph neural nets can solve problems such as shortest path, sorting and
dynamic programming. I think in the future if we are to get closer to human
level we need graphs as the intermediate representation. Graphs could
represent the objects in an image/phrase and their relations, then answer
about the attributes of an object, the relation between two objects or
classify the graph itself. All simulators are evolving graphs as well, and
code/automata could be represented and executed as a graph. The transformer
could be considered an implicit graph where the adjacency matrix is computed
from the nodes at each iteration. The closest to AGI in my view would be model
based RL implemented with graphs.</>

~~~
ozymandias12
Amazing rant.

But what bothers me is that the human animal is trying to create an artificial
human mind, the most amazing piece of meat we have laying around.

If we could create the reasoning of a dog first, to then upscale to more
complex logic, I'd be more confident research is going places (as in, we have
biological evidence building blocks to understand first).

------
runT1ME
Wonderful article that both talks about overall problems and at the same time
gives concrete easy to understand ways to test your model.

------
miltondts
I suspect a human trained exactly like this and with zero previous knowledge
would have have the same or worse performance. What is the surprise here? That
deep learning is not magic? Do people working in A.I. not know how animals and
humans learn?

------
draw_down
> Without getting into a Chinese Room argument about what it means to
> _understand_ something,

Yep, probably best to avoid. I still don’t think I’ve seen a convincing
rebuttal.

