
NLP’s generalization problem, and how researchers are tackling it - onuralp
https://thegradient.pub/frontiers-of-generalization-in-natural-language-processing/
======
YeGoblynQueenne
To be fair, poor generalisation is not a problem of NLP, but of NLP using deep
neural networks, which is a very recent phenomenon, following from the success
of deep learning for image and speech processing.

If you asked me (you shouldn't, I'm biased), I'd tell you that we're never
going to get deep neural nets to behave well enough to learn meaning. Those
damn things are way too smart for us. They're so smart that they can always
find the easiest, dumbest way to map inputs to outputs- by simply overfitting
to their training set (or their entire data set, if cross-validation is
thorough enough). You can see lots of examples of that in the article above.

The big promise of deep learning was (and is) that it would free us from the
drudgery of feature engineering- but, the amount of work you need to do to
convince a deep net to learn what you want it to, rather than what it wants to
learn, is starting to approach that of hand-crafting features.

And we still don't have language models that make any sense at all.

~~~
canjobear
What are the methods you're thinking of that have better generalization
properties?

~~~
YeGoblynQueenne
Actually, I wasn't thinking of any specific methods, but now that you
mentioned it, Inductive Logic Progamming (the subject of my PhD - see my
comment about being biased) is a fine example.

For a slightly more impartial opinion, here's a DeepMind paper that performs
neural ILP: [https://deepmind.com/blog/learning-explanatory-rules-
noisy-d...](https://deepmind.com/blog/learning-explanatory-rules-noisy-data/)

The authors begin by extolling the virtues of ILP, including its
generalisation abilities, as follows:

 _Second, ILP systems tend to be impressively data-efficient, able to
generalise well from a small handful of examples._ [1]

You can find more references to the generalisation power of ILP algorithms
sprinkled throughout that text and in any case the entire paper is about
getting the "best of both worlds" between ILP's generalisation,
interpretability, ability for transfer learning and data efficiency and deep
learning's robustness to noise and handling of non-symbolic data (I disagree
about these last two bits with the authors, but, OK).

From my part, below is an example of learning a general form of the (context-
free) a^nb^n grammar from 4 positive and 0 negative examples, using the Meta-
Interpretive Learning system Metagol (a state-of-the-art ILP learner,
referenced in the DeepMind paper; my PhD research is based on Metagol). You
can clone metagol from its github page:

[https://github.com/metagol/metagol](https://github.com/metagol/metagol)

Metagol is written in Prolog. To run the example, you'll need a Prolog
interpreter, either Yap [2] or Swi-Prolog [3]. And Metagol.

Copy the code below into a text file, call it something like "anbn.pl" and
place it, e.g. in the "examples" directory in metagol's root directory.

    
    
      % Load metagol
      :-['../metagol']. % e.g. place in metagol/examples.
      
      % Second-order metarules providing inductive bias
      metarule([P,Q,R], ([P,A,B]:- [[Q,A,C],[R,C,B]])).
      metarule([P,Q,R], ([P,A,D]:- [[Q,A,B],[P,B,C],[R,C,D]])).
      
      % Grammar terminals, provided as background knowledge
      'A'([a|A], A).
      'B'([b|A], A).
      
      % Terminals actually declared as background knowledge primitives
      prim('A'/2).
      prim('B'/2).
      
      % Code to start training
      learn_an_bn:-
      	% Example sentences in the a^nb^n language
      	Pos = ['S'([a,a,b,b],[])
      	      ,'S'([a,b],[])
      	       % ^^ Place second to learn clauses in terminating order
      	      ,'S'([a,a,a,b,b,b],[])
      	      ,'S'([a,a,a,a,b,b,b,b],[])
      	      ]
      	% You can actually learn _without_ any negative examples.
      	,Neg = []
      	,learn(Pos, Neg).
    

Load the file into Prolog with the following query:

    
    
      [anbn].
    

Finally, start training by calling learn_an_bn:

    
    
      ?- learn_an_bn.
      % learning S/2
      % clauses: 1
      % clauses: 2
      'S'(A,B):-'A'(A,C),'B'(C,B).
      'S'(A,B):-'A'(A,C),'S'(C,D),'B'(D,B).
      true .
    

That should take a millisecond or two, on an ordinary laptop.

You can test the results by copy/pasting the two clauses of the predicate
'S'/2 into a prolog file (anbn.pl will do fine), (re)loading it and running a
few queries like the following:

    
    
      ?- 'S'(A,[]). % Run as generator
      A = [a, b] ;
      A = [a, a, b, b] ;
      A = [a, a, a, b, b, b] ;
      A = [a, a, a, a, b, b, b, b] ;
      A = [a, a, a, a, a, b, b, b, b|...] ;
      
      ?- 'S'([a,a,b,b],[]). % Run as acceptor
      true .
      
      ?- 'S'([a,a,b,b,c],[]). % Run as acceptor with invalid string
      false.
      
      ?- 'S'([a,a,b,b,c],Rest). % Split the string to valid + suffix (Rest)
      Rest = [c] .
    

Note that the leraned grammar is a general form of a^nb^n, for example it
accepts strings it's never even seen in testing (let alone training):

    
    
      ?- 'S'([a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,b,b,b,b,b],[]).
      true .
    

In any case, it's just a couple of first-order rules so it can be readily
inspected to judge whether it's as general an a^nb^n grammar as can be, or
not.

I guess you might not be much impressed by mere learning of a puny little
grammar of a's and b's. You might be slightly more impressed if you know that
learning a Context-Free language from only positive examples is actually
impossible [4]. Metagol learns it thanks to the strong inductive bias provided
by the two second-order metarules, at the start of the example. But, that's
another huge can of worms. You asked me about generalisation :)

btw, no, you can't learn a^nb^n with deep learning- or anything else I'm aware
of. The NLP people here should be able to confirm this.

_________________________

[1]
[https://arxiv.org/pdf/1711.04574.pdf](https://arxiv.org/pdf/1711.04574.pdf)

[2] [http://www.dcc.fc.up.pt/~vsc/Yap/](http://www.dcc.fc.up.pt/~vsc/Yap/)
(Yap is fastest)

[3] [http://www.swi-prolog.org/](http://www.swi-prolog.org/) (Swi has more
features)

[4]
[https://scholar.google.gr/scholar?hl=en&as_sdt=0%2C5&q=langu...](https://scholar.google.gr/scholar?hl=en&as_sdt=0%2C5&q=language+identification+in+the+limit+mark+e+gold+&btnG=)

Well, actually, it _is_ possible - but you need infinite examples or an Oracle
already knowing the language.

~~~
canjobear
LSTMs can learn a count mechanism that lets them recognize a^n b^n and a^n b^n
c^n:
[https://arxiv.org/pdf/1805.04908.pdf](https://arxiv.org/pdf/1805.04908.pdf)

~~~
YeGoblynQueenne
LSTMS can't learn a^nb^n or a^nb^nc^n, neither can they learn to count, and
that paper shows why (because they generalise poorly).

From the paper (section 5, Experimental Results):

>> 2\. These LSTMs generalize to much higher n than seen in the training set
(though not infinitely so).

The next page, under heading Results, further explains that "on a^nb^n, the
LSTM generalises "well" up to n = 256, after which it accumulates a deviation
making it reject a^nb^n but recognise a^nb^n+1 for a while until the deviation
grows".

In other words- the LSTM in the paper fails to learn a _general_
representation of the a^nb^n, i.e. one for unbounded n.

This is typical of attempts to learn to count with deep neural nets- they
learn to count up to a few numbers above their largest training example. Then
they lose the thread.

You can test the grammar learned by Metagol on arbitrarily large numbers using
the following query:

    
    
      ?- _N = 100_000, findall(a, between(1,_N,_), _As), findall(b, between(1,_N,_),_Bs), append(_As,_Bs,_AsBs), 'S'(_AsBs,[]).
      true .
    

You can set _N to the desired size. Obviously, expect a bit of a slowdown for
larger numbers (or a mighty crash for lack of stack space).

Again, note that Metagol has learned the entire language from 4 examples. The
LSTM in the paper learned a limited form from 100 samples.

Results for the LSTM are similar for a^nb^nc^n. The GRU in the paper does much
worse.

Btw, note that we basically have to take the authors' word for what their
networks are actually learning. They say they're learning to count - OK. No
reason not to believe them. Then again, you have to take them at their word.
The first-order theory learned by Metagol is easy to inspect and verify. The
DeepMind paper I quoted above makes that point about interpretability also
(that you don't have to speculate about what your model is actually
reprsenting, because you can just, well, read it).

I have an a^nb^nc^n Metagol example somewhere. I'll dig it up if required.

~~~
mooneater
Thanks for great insight.

Am I right in thinking Metagol requires all training examples to be flawless?
LSTM presumably can handle some degree of erroneous training examples.

The ideal learning system would combine these properties: sample efficiency
more like Metagol but also some degree of tolerance to errors in training data
like deep learning.

~~~
YeGoblynQueenne
Yes, classification noise is an issue, but there are ways around it and
they're not particularly complicated. For instance, the simplest thing you can
do is repeated random subsampling, which is not a big deal given the high
sample efficiency and the low training times (seconds, rather than hours, let
alone days or weeks).

See for instance this work, where Metagol is trained on noisy image data by
random subsampling:

[https://www.doc.ic.ac.uk/~shm/Papers/logvismlj.pdf](https://www.doc.ic.ac.uk/~shm/Papers/logvismlj.pdf)

The DeepMind paper flags up ILP's issues with noisy data as a show stopper,
but like I say in my comment above, I disagree. The ILP community has found
various ways to deal with noise over the years since the '90s.

If you are wondering what the downsides are of Meta-Interpretive Learning, the
real PITA with Metagol for me is the need to hand-craft inductive bias. This
is not different to choosing and fine-tuning a neural net architecture, or
choosing Bayesian priors etc, and in fact might be simpler to do in Metagol
(because inductive bias is clearly and cleanly encoded in metarules) but it's
still a pain. A couple of us are working on this currently. It's probably
impossible to do any sort of learning without _some_ kind of structural bias-
but it may be possible to figure the right kind of structure out
automatically, in some cases, under some assumptions etc etc.

I think there's certainly an "ideal system" that is some kind of "best of both
worlds" between ILP and deep learning, but I wouldn't put my money on some
single algorithm doing both things at once, like the δILP system in the
DeepMind paper. I'd put my money (and research time) on combining the two
approaches as separate module, perhaps a deep learning module for "low-level"
perceptual tasks and a MIL module for "high-level" reasoning. That's what each
system does best, and there's no reason to try to add screw-driving
functionality to a hammer, or vice-versa.

------
DanielBMarkham
I'm not sure we read something as much as we develop a shared mental model and
associated new language with another human. Most of the time these models are
just slightly different from before, so the change is so subtle as to be
invisible. We have the appearance of a universal language called, say,
"English". We don't actually have one. It's close enough.

If you take a look around, printed words don't exist in most languages. Most
languages are spoken. Printing is an extremely new thing we've only had for a
very short amount of time. The only thing print can do is present a stilted,
over-formal version of what listening to a monotone person give a speech in a
dark room might be like. That's good enough for most cases, since the brain
makes stuff up it needs. But I don't think it's language, at least not in the
same way real spoken languages are language. Real languages are a messy and
confusing affair, even more than printed words can be. The way it works is
through the interaction over time, not over definitions. (Something something
language games)

I wish the guys luck. I'm not so sure we understand the problem yet. We may
end up creating a machine that makes up answers to our questions after reading
such that it sounds like a real person is doing the work. That's cool -- but
it'd be a horrible disaster for humanity if something like that started being
the primary interaction point with people. Over time it would make us a
horribly stupid and unimaginative species. Like everything else in tech, we
have good intentions and endless optimism. We're going to solve a problem
whether we understand what it is or not, dang it. plus ça change... (And no, I
don't think giving up the idea is any good. I'm just encouraging more
understanding of the real goal versus the apparent goal)

~~~
falcor84
On the one hand I agree, but on the other hand, I'd like to sprinkle a bit
more of that endless optimism and assume that embodied cognition is the next
logical step. Perhaps once these NLP agents are put into robotic bodies and
begin interacting with our world more fully, they will "evolve" to attach
"real meaning" to their language processing.

~~~
DanielBMarkham
I love the optimism, I'd just like to see tech folks clearly delineate "This
is what we're doing. This is what it might _look like_ we're doing but we're
not." The hype cycle in tech, combined with a tech-illiterate public isn't
such a good thing, especially in a democracy. Too often the only way we figure
things out is by getting screwed. That's a recurring behavior pattern that has
to end. Somehow.

~~~
pX0r
> Too often the only way we figure things out is by getting screwed. That's a
> recurring behavior pattern that has to end. Somehow.

That's just it - exploration vs exploitation trade-off. Don't believe this can
be dusted off.

------
MAXPOOL
"a mouth without a brain" analogy is good one. Current NLP is impressive but
there are limits.

People have spatiotemporal model of the world, different physical models,
social and behavioral models of the world, organizational model of the
society, economic model, etc. Humans parse the language and transform it into
multiple models of the world where many indented meanings and semantics are
self-evident and it becomes "a common sense". They have crude understanding of
how fabrics, paper, gas, liquid, rubber, iron, rock, etc. behave and they
understand written text based on this more complete model zoo.

There is similar limit in computer vision. Humans reason about 2d images using
internal 3d model. Even if they see a completely new object shape, they can
usually infer what the other side of the object looks like using basic
symmetries and physical models.

Image understanding must eventually transform into spatiotemporal + physical
model and there are several approaches underway. NLP has much harder problem,
because the problem is more abstract and complex.

------
binalpatel
Related article on the same website: [https://thegradient.pub/nlp-
imagenet/](https://thegradient.pub/nlp-imagenet/)

NLP (or specifically NLP using deep learning) seems to be having a breakout
moment in the last year or so where there have been large advancements back to
back.

Generalization is hard - you're often tuning millions of parameters at once,
and often the most "sane" thing for the loss function to do is rote
memorization. It'll be interesting to see what comes about from this
discussion.

~~~
romaniv
_> NLP (or specifically NLP using deep learning) seems to be having a breakout
moment in the last year or so where there have been large advancements back to
back._

I don't know about that. Everything with deep learning seems to attract so
much hype that it's hard to measure the actual progress without being a
researcher.

On the other hand, "classic" AI projects seem to get no recognition, even when
they deliver astounding results.

For example, how many people here heard about MIT's Genesis project?

([http://groups.csail.mit.edu/genesis/](http://groups.csail.mit.edu/genesis/))

If you don't know your past, everything seems like progress.

------
topicseed
Great article! It is very true that NLP is amongst the most lagging divisions
within machine learning. Mainly because text content is very unstructured
everywhere, let alone working consistently for different languages.

It is fascinating to see how things got better over the last couple of years
though!

~~~
misterman0
"Mainly because text content is very unstructured everywhere"

Encoded into this somewhat unstructured data is the very thing NLP is after:
the meaning.

If you cannot teach a machine a language by talking to it with "unstructured"
language, how are you supposed to make it both understand it, grok it, and
speak it?

We have no trouble teaching small kids about history and math using
untructured language.

Language is maybe unstructured, but the internets has lots of it, so to me,
the state of NLP today is a dissapointment. I'm convinced though that general
AI will be achieved through NLP. I mean, most of what I know I have either
been told or I have read it somewhere. And my parent's didn't use annotated
data much.

~~~
GolDDranks
As a (former) linguist, I'm baffled about talking about language as
"unstructured". Sure it isn't _rigidly_ structured, but "un"? There's a lot of
structure into it, it's just so complex (and so connected to extralinguistic
structures) that our ML models don't grok it!

~~~
killjoywashere
Is a (never) linguist, I think the edge case successes demonstrate how far we
really have to go in ML generally. Driving is an entirely engineered
phenomenon. Down to the species level of things that might run out in front of
it, every atmospheric perturbation, we can define every event that's going to
affect a car, and they have been designed over more than 100 years to deal
with them.

Medical imaging: another area where highly trained humans have thought
rigorously (for about the same amount of time) about what this all means.

What ML is lacking is the data and labels a baby bootstraps from into the
"common world". How many ML models have been training on the taste of breast
milk, smell of mother, the smell of dirt, the upward view of everything (think
about how much time babies, since the stone age, have spent laid on their
backs, looking up.

These things have to be segregated in a fairly unsupervised way using little
more than reflexes (cry, suck, fencer, grasp, etc) for a while: smell of mom
sometimes comes with warm milk, but not always. Associating this warm body
with the smell, not just food. Sound of mom doesn't always come with smell of
mom.

------
andreyk
If nothing else, you should open up this article to look through the images
just below intro - often surprising how non-intelligent these learned NLP
models often actually are, especially for non AI-researchers.

------
marcoperaza
This isn't an NLP problem, it's an AI problem. Despite all the hype, we
haven't actually made any progress toward solving strong AI, human-like
general intelligence. NLP, computer vision, and many other AI problems are
probably AI-complete, meaning that solving them entails solving strong AI. And
the inability to generalize is exactly what separates the partial solutions we
currently have from strong AI.

------
akshayB
Recently there has been lot of interesting research in field of NLP. Lot of
new models, algorithms and techniques have been developed along with lot of
research papers published. Personally I feel that NLP is now heading in
direction where we had an aha moment with ImageNet.

------
taeric
A thought I had the other day was how much of a driver was our need for rhythm
and rhyme in the development of synonyms. Which led me to thinking how much is
language driven by fun, as much as it is by any real logic. It seems word
choice for any given topic is dominated by early mover and then just plain
pleasant sound.

To that end, any system that is just looking for the logic of an underlying
system is going to be stymied by the fundamental lack of it in language. There
is the appearance of logic. And it can actually get you quite far. However, I
question if it can encompass all of it. At least until the next best seller
comes out.

To that end, I do not find it surprising that the best model will be an
ensemble model. More, as long as we are dealing with English (a non-
pictographic and a non-phonetic language), then we are ultimately trying to
build a model on top of a malleable base where appeals to the logic of
yesterday in a language have a non zero chance of failing today. (Edit to make
this point clearer, it is amusing to me how many lines in classic Shakespear
no longer pop in the same way for current language. "Now is it Rome indeed,
and room enough". That a pair of words could cease to be homophones is not
something that seems at all obvious. Of course, I have not studied this in
decades, so mayhap my grade school lied to me.)

------
jostmey
We basically expect deep neural networks to acquire language without a body
that it can use to interact with an environment. I believe you cannot separate
language from the body or the environment. A corpus of text is simply not
holistic enough to capture the meaning behind the symbols.

~~~
wodenokoto
How would you design a training scenario with a virtual agent(s)?

------
mnsc
Many miles down the path where hours (slowly) lays to rest, I can imagine a
vivid portrait being hung in a tree outside a dungeon. Of course created with
non-existing poisonous paint. A painting of good natured, voluptuous beings
huddled down and whispering to each other. Whispers that should then be set
free to travel the invisible rails of the web trainset everyone play with
almost everywhere. The trainset that the strict parents watches without
blinking their tiny hidden trinoculars. Small talk carefully elaborated and
twisted to do with the ultimate meaning as you did with the dead cat, in the
backyard, but not only buried within but hidden with fragrant leafs that
hasn't been seen, smelled or heard before. Later, another image can be
conjured of the train with heavy soldiers that hasn't gotten a single flag on
the journey, embarking, and upon inspection of their long seen captain,
immediately recognized for their true unchanging self.

