Actually, I wasn't thinking of any specific methods, but now that you
mentioned it, Inductive Logic Progamming (the subject of my PhD - see my
comment about being biased) is a fine example.
The authors begin by extolling the virtues of ILP, including its
generalisation abilities, as follows:
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. [1]
You can find more references to the generalisation power of ILP algorithms
sprinkled throughout that text and in any case the entire paper is about
getting the "best of both worlds" between ILP's generalisation,
interpretability, ability for transfer learning and data efficiency and deep
learning's robustness to noise and handling of non-symbolic data (I disagree
about these last two bits with the authors, but, OK).
From my part, below is an example of learning a general form of the (context-free) a^nb^n
grammar from 4 positive and 0 negative examples, using the
Meta-Interpretive Learning system Metagol (a state-of-the-art ILP learner,
referenced in the DeepMind paper; my PhD research is based on Metagol). You can clone metagol from its github page:
Metagol is written in Prolog. To run the example, you'll need a Prolog
interpreter, either Yap [2] or Swi-Prolog [3]. And Metagol.
Copy the code below into a text file, call it something like "anbn.pl" and
place it, e.g. in the "examples" directory in metagol's root directory.
% Load metagol
:-['../metagol']. % e.g. place in metagol/examples.
% Second-order metarules providing inductive bias
metarule([P,Q,R], ([P,A,B]:- [[Q,A,C],[R,C,B]])).
metarule([P,Q,R], ([P,A,D]:- [[Q,A,B],[P,B,C],[R,C,D]])).
% Grammar terminals, provided as background knowledge
'A'([a|A], A).
'B'([b|A], A).
% Terminals actually declared as background knowledge primitives
prim('A'/2).
prim('B'/2).
% Code to start training
learn_an_bn:-
% Example sentences in the a^nb^n language
Pos = ['S'([a,a,b,b],[])
,'S'([a,b],[])
% ^^ Place second to learn clauses in terminating order
,'S'([a,a,a,b,b,b],[])
,'S'([a,a,a,a,b,b,b,b],[])
]
% You can actually learn _without_ any negative examples.
,Neg = []
,learn(Pos, Neg).
Load the file into Prolog with the following query:
That should take a millisecond or two, on an ordinary laptop.
You can test the results by copy/pasting the two clauses of the predicate
'S'/2 into a prolog file (anbn.pl will do fine), (re)loading it and running
a few queries like the following:
?- 'S'(A,[]). % Run as generator
A = [a, b] ;
A = [a, a, b, b] ;
A = [a, a, a, b, b, b] ;
A = [a, a, a, a, b, b, b, b] ;
A = [a, a, a, a, a, b, b, b, b|...] ;
?- 'S'([a,a,b,b],[]). % Run as acceptor
true .
?- 'S'([a,a,b,b,c],[]). % Run as acceptor with invalid string
false.
?- 'S'([a,a,b,b,c],Rest). % Split the string to valid + suffix (Rest)
Rest = [c] .
Note that the leraned grammar is a general form of a^nb^n, for example it
accepts strings it's never even seen in testing (let alone training):
In any case, it's just a couple of first-order rules so it can be readily
inspected to judge whether it's as general an a^nb^n grammar as can be, or not.
I guess you might not be much impressed by mere learning of a puny little
grammar of a's and b's. You might be slightly more impressed if you know that
learning a Context-Free language from only positive examples is actually
impossible [4]. Metagol learns it thanks to the strong inductive bias provided
by the two second-order metarules, at the start of the example. But, that's
another huge can of worms. You asked me about generalisation :)
btw, no, you can't learn a^nb^n with deep learning- or anything else I'm aware
of. The NLP people here should be able to confirm this.
> You might be slightly more impressed if you know that learning a Context-Free language from only positive examples is actually impossible.
Isn’t this kind of obvious, since there’s no way to distinguish the true grammar from the grammar accepting all strings (and thus any positive examples)?
LSTMS can't learn a^nb^n or a^nb^nc^n, neither can they learn to count, and that paper shows why (because they generalise poorly).
From the paper (section 5, Experimental Results):
>> 2. These LSTMs generalize to much higher n than seen in the training set
(though not infinitely so).
The next page, under heading Results, further explains that "on a^nb^n, the
LSTM generalises "well" up to n = 256, after which it accumulates a deviation
making it reject a^nb^n but recognise a^nb^n+1 for a while until the deviation
grows".
In other words- the LSTM in the paper fails to learn a general
representation of the a^nb^n, i.e. one for unbounded n.
This is typical of attempts to learn to count with deep neural nets- they learn to count up to a few numbers above their largest training example. Then they lose the thread.
You can test the grammar learned by Metagol on arbitrarily large numbers
using the following query:
You can set _N to the desired size. Obviously, expect a bit of a slowdown for larger numbers (or a mighty crash for lack of stack space).
Again, note that Metagol has learned the entire language from 4 examples. The LSTM in
the paper learned a limited form from 100 samples.
Results for the LSTM are similar for a^nb^nc^n. The GRU in the paper does much worse.
Btw, note that we basically have to take the authors' word for what their
networks are actually learning. They say they're learning to count - OK. No
reason not to believe them. Then again, you have to take them at their word.
The first-order theory learned by Metagol is easy to inspect and verify. The
DeepMind paper I quoted above makes that point about interpretability also
(that you don't have to speculate about what your model is actually
reprsenting, because you can just, well, read it).
I have an a^nb^nc^n Metagol example somewhere. I'll dig it up if required.
Am I right in thinking Metagol requires all training examples to be flawless? LSTM presumably can handle some degree of erroneous training examples.
The ideal learning system would combine these properties: sample efficiency more like Metagol but also some degree of tolerance to errors in training data like deep learning.
Yes, classification noise is an issue, but there are ways around it and
they're not particularly complicated. For instance, the simplest thing you can
do is repeated random subsampling, which is not a big deal given the high
sample efficiency and the low training times (seconds, rather than hours, let
alone days or weeks).
See for instance this work, where Metagol is trained on
noisy image data by random subsampling:
The DeepMind paper flags up ILP's issues with noisy data as a show stopper,
but like I say in my comment above, I disagree. The ILP community has found
various ways to deal with noise over the years since the '90s.
If you are wondering what the downsides are of Meta-Interpretive Learning, the
real PITA with Metagol for me is the need to hand-craft inductive bias. This
is not different to choosing and fine-tuning a neural net architecture, or choosing
Bayesian priors etc, and in fact might be simpler to do in Metagol (because
inductive bias is clearly and cleanly encoded in metarules) but it's still a
pain. A couple of us are working on this currently. It's probably impossible
to do any sort of learning without some kind of structural bias- but it may
be possible to figure the right kind of structure out automatically, in some
cases, under some assumptions etc etc.
I think there's certainly an "ideal system" that is some kind of "best of both
worlds" between ILP and deep learning, but I wouldn't put my money on some
single algorithm doing both things at once, like the δILP system in the
DeepMind paper. I'd put my money (and research time) on combining the two
approaches as separate module, perhaps a deep learning module for "low-level"
perceptual tasks and a MIL module for "high-level" reasoning. That's what each
system does best, and there's no reason to try to add screw-driving
functionality to a hammer, or vice-versa.
For a slightly more impartial opinion, here's a DeepMind paper that performs neural ILP: https://deepmind.com/blog/learning-explanatory-rules-noisy-d...
The authors begin by extolling the virtues of ILP, including its generalisation abilities, as follows:
Second, ILP systems tend to be impressively data-efficient, able to generalise well from a small handful of examples. [1]
You can find more references to the generalisation power of ILP algorithms sprinkled throughout that text and in any case the entire paper is about getting the "best of both worlds" between ILP's generalisation, interpretability, ability for transfer learning and data efficiency and deep learning's robustness to noise and handling of non-symbolic data (I disagree about these last two bits with the authors, but, OK).
From my part, below is an example of learning a general form of the (context-free) a^nb^n grammar from 4 positive and 0 negative examples, using the Meta-Interpretive Learning system Metagol (a state-of-the-art ILP learner, referenced in the DeepMind paper; my PhD research is based on Metagol). You can clone metagol from its github page:
https://github.com/metagol/metagol
Metagol is written in Prolog. To run the example, you'll need a Prolog interpreter, either Yap [2] or Swi-Prolog [3]. And Metagol.
Copy the code below into a text file, call it something like "anbn.pl" and place it, e.g. in the "examples" directory in metagol's root directory.
Load the file into Prolog with the following query: Finally, start training by calling learn_an_bn: That should take a millisecond or two, on an ordinary laptop.You can test the results by copy/pasting the two clauses of the predicate 'S'/2 into a prolog file (anbn.pl will do fine), (re)loading it and running a few queries like the following:
Note that the leraned grammar is a general form of a^nb^n, for example it accepts strings it's never even seen in testing (let alone training): In any case, it's just a couple of first-order rules so it can be readily inspected to judge whether it's as general an a^nb^n grammar as can be, or not.I guess you might not be much impressed by mere learning of a puny little grammar of a's and b's. You might be slightly more impressed if you know that learning a Context-Free language from only positive examples is actually impossible [4]. Metagol learns it thanks to the strong inductive bias provided by the two second-order metarules, at the start of the example. But, that's another huge can of worms. You asked me about generalisation :)
btw, no, you can't learn a^nb^n with deep learning- or anything else I'm aware of. The NLP people here should be able to confirm this.
_________________________
[1] https://arxiv.org/pdf/1711.04574.pdf
[2] http://www.dcc.fc.up.pt/~vsc/Yap/ (Yap is fastest)
[3] http://www.swi-prolog.org/ (Swi has more features)
[4] https://scholar.google.gr/scholar?hl=en&as_sdt=0%2C5&q=langu...
Well, actually, it is possible - but you need infinite examples or an Oracle already knowing the language.