Actually, I wasn't thinking of any specific methods, but now that you mentioned ...

antidesitter · on Aug 25, 2018

> You might be slightly more impressed if you know that learning a Context-Free language from only positive examples is actually impossible.

Isn’t this kind of obvious, since there’s no way to distinguish the true grammar from the grammar accepting all strings (and thus any positive examples)?

canjobear · on Aug 24, 2018

LSTMs can learn a count mechanism that lets them recognize a^n b^n and a^n b^n c^n: https://arxiv.org/pdf/1805.04908.pdf

YeGoblynQueenne · on Aug 24, 2018

LSTMS can't learn a^nb^n or a^nb^nc^n, neither can they learn to count, and that paper shows why (because they generalise poorly).

From the paper (section 5, Experimental Results):

>> 2. These LSTMs generalize to much higher n than seen in the training set (though not infinitely so).

The next page, under heading Results, further explains that "on a^nb^n, the LSTM generalises "well" up to n = 256, after which it accumulates a deviation making it reject a^nb^n but recognise a^nb^n+1 for a while until the deviation grows".

In other words- the LSTM in the paper fails to learn a general representation of the a^nb^n, i.e. one for unbounded n.

This is typical of attempts to learn to count with deep neural nets- they learn to count up to a few numbers above their largest training example. Then they lose the thread.

You can test the grammar learned by Metagol on arbitrarily large numbers using the following query:

  ?- _N = 100_000, findall(a, between(1,_N,_), _As), findall(b, between(1,_N,_),_Bs), append(_As,_Bs,_AsBs), 'S'(_AsBs,[]).
  true .

You can set _N to the desired size. Obviously, expect a bit of a slowdown for larger numbers (or a mighty crash for lack of stack space).

Again, note that Metagol has learned the entire language from 4 examples. The LSTM in the paper learned a limited form from 100 samples.

Results for the LSTM are similar for a^nb^nc^n. The GRU in the paper does much worse.

Btw, note that we basically have to take the authors' word for what their networks are actually learning. They say they're learning to count - OK. No reason not to believe them. Then again, you have to take them at their word. The first-order theory learned by Metagol is easy to inspect and verify. The DeepMind paper I quoted above makes that point about interpretability also (that you don't have to speculate about what your model is actually reprsenting, because you can just, well, read it).

I have an a^nb^nc^n Metagol example somewhere. I'll dig it up if required.

mooneater · on Aug 26, 2018

Thanks for great insight.

Am I right in thinking Metagol requires all training examples to be flawless? LSTM presumably can handle some degree of erroneous training examples.

The ideal learning system would combine these properties: sample efficiency more like Metagol but also some degree of tolerance to errors in training data like deep learning.

YeGoblynQueenne · on Aug 27, 2018

Yes, classification noise is an issue, but there are ways around it and they're not particularly complicated. For instance, the simplest thing you can do is repeated random subsampling, which is not a big deal given the high sample efficiency and the low training times (seconds, rather than hours, let alone days or weeks).

See for instance this work, where Metagol is trained on noisy image data by random subsampling:

https://www.doc.ic.ac.uk/~shm/Papers/logvismlj.pdf

The DeepMind paper flags up ILP's issues with noisy data as a show stopper, but like I say in my comment above, I disagree. The ILP community has found various ways to deal with noise over the years since the '90s.

If you are wondering what the downsides are of Meta-Interpretive Learning, the real PITA with Metagol for me is the need to hand-craft inductive bias. This is not different to choosing and fine-tuning a neural net architecture, or choosing Bayesian priors etc, and in fact might be simpler to do in Metagol (because inductive bias is clearly and cleanly encoded in metarules) but it's still a pain. A couple of us are working on this currently. It's probably impossible to do any sort of learning without some kind of structural bias- but it may be possible to figure the right kind of structure out automatically, in some cases, under some assumptions etc etc.

I think there's certainly an "ideal system" that is some kind of "best of both worlds" between ILP and deep learning, but I wouldn't put my money on some single algorithm doing both things at once, like the δILP system in the DeepMind paper. I'd put my money (and research time) on combining the two approaches as separate module, perhaps a deep learning module for "low-level" perceptual tasks and a MIL module for "high-level" reasoning. That's what each system does best, and there's no reason to try to add screw-driving functionality to a hammer, or vice-versa.