If you asked me (you shouldn't, I'm biased), I'd tell you that we're never going to get deep
neural nets to behave well enough to learn meaning. Those damn things are way
too smart for us. They're so smart that they can always find the easiest,
dumbest way to map inputs to outputs- by simply overfitting to their training
set (or their entire data set, if cross-validation is thorough enough). You
can see lots of examples of that in the article above.
The big promise of deep learning was (and is) that it would free us from the
drudgery of feature engineering- but, the amount of work you need to do to
convince a deep net to learn what you want it to, rather than what it wants to
learn, is starting to approach that of hand-crafting features.
And we still don't have language models that make any sense at all.
My understanding is that current language models are very good at capturing the syntactic structure of the sentences, but fail when modeling would require common sense or additional knowledge sources. Is this correct?
For a slightly more impartial opinion, here's a DeepMind paper that performs
neural ILP: https://deepmind.com/blog/learning-explanatory-rules-noisy-d...
The authors begin by extolling the virtues of ILP, including its
generalisation abilities, as follows:
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. 
You can find more references to the generalisation power of ILP algorithms
sprinkled throughout that text and in any case the entire paper is about
getting the "best of both worlds" between ILP's generalisation,
interpretability, ability for transfer learning and data efficiency and deep
learning's robustness to noise and handling of non-symbolic data (I disagree
about these last two bits with the authors, but, OK).
From my part, below is an example of learning a general form of the (context-free) a^nb^n
grammar from 4 positive and 0 negative examples, using the
Meta-Interpretive Learning system Metagol (a state-of-the-art ILP learner,
referenced in the DeepMind paper; my PhD research is based on Metagol). You can clone metagol from its github page:
Metagol is written in Prolog. To run the example, you'll need a Prolog
interpreter, either Yap  or Swi-Prolog . And Metagol.
Copy the code below into a text file, call it something like "anbn.pl" and
place it, e.g. in the "examples" directory in metagol's root directory.
% Load metagol
:-['../metagol']. % e.g. place in metagol/examples.
% Second-order metarules providing inductive bias
metarule([P,Q,R], ([P,A,B]:- [[Q,A,C],[R,C,B]])).
metarule([P,Q,R], ([P,A,D]:- [[Q,A,B],[P,B,C],[R,C,D]])).
% Grammar terminals, provided as background knowledge
% Terminals actually declared as background knowledge primitives
% Code to start training
% Example sentences in the a^nb^n language
Pos = ['S'([a,a,b,b],)
% ^^ Place second to learn clauses in terminating order
% You can actually learn _without_ any negative examples.
,Neg = 
% learning S/2
% clauses: 1
% clauses: 2
You can test the results by copy/pasting the two clauses of the predicate
'S'/2 into a prolog file (anbn.pl will do fine), (re)loading it and running
a few queries like the following:
?- 'S'(A,). % Run as generator
A = [a, b] ;
A = [a, a, b, b] ;
A = [a, a, a, b, b, b] ;
A = [a, a, a, a, b, b, b, b] ;
A = [a, a, a, a, a, b, b, b, b|...] ;
?- 'S'([a,a,b,b],). % Run as acceptor
?- 'S'([a,a,b,b,c],). % Run as acceptor with invalid string
?- 'S'([a,a,b,b,c],Rest). % Split the string to valid + suffix (Rest)
Rest = [c] .
I guess you might not be much impressed by mere learning of a puny little
grammar of a's and b's. You might be slightly more impressed if you know that
learning a Context-Free language from only positive examples is actually
impossible . Metagol learns it thanks to the strong inductive bias provided
by the two second-order metarules, at the start of the example. But, that's
another huge can of worms. You asked me about generalisation :)
btw, no, you can't learn a^nb^n with deep learning- or anything else I'm aware
of. The NLP people here should be able to confirm this.
 http://www.dcc.fc.up.pt/~vsc/Yap/ (Yap is fastest)
 http://www.swi-prolog.org/ (Swi has more features)
Well, actually, it is possible - but you need infinite examples or an Oracle
already knowing the language.
Isn’t this kind of obvious, since there’s no way to distinguish the true grammar from the grammar accepting all strings (and thus any positive examples)?
From the paper (section 5, Experimental Results):
>> 2. These LSTMs generalize to much higher n than seen in the training set
(though not infinitely so).
The next page, under heading Results, further explains that "on a^nb^n, the
LSTM generalises "well" up to n = 256, after which it accumulates a deviation
making it reject a^nb^n but recognise a^nb^n+1 for a while until the deviation
In other words- the LSTM in the paper fails to learn a general
representation of the a^nb^n, i.e. one for unbounded n.
This is typical of attempts to learn to count with deep neural nets- they learn to count up to a few numbers above their largest training example. Then they lose the thread.
You can test the grammar learned by Metagol on arbitrarily large numbers
using the following query:
?- _N = 100_000, findall(a, between(1,_N,_), _As), findall(b, between(1,_N,_),_Bs), append(_As,_Bs,_AsBs), 'S'(_AsBs,).
Again, note that Metagol has learned the entire language from 4 examples. The LSTM in
the paper learned a limited form from 100 samples.
Results for the LSTM are similar for a^nb^nc^n. The GRU in the paper does much worse.
Btw, note that we basically have to take the authors' word for what their
networks are actually learning. They say they're learning to count - OK. No
reason not to believe them. Then again, you have to take them at their word.
The first-order theory learned by Metagol is easy to inspect and verify. The
DeepMind paper I quoted above makes that point about interpretability also
(that you don't have to speculate about what your model is actually
reprsenting, because you can just, well, read it).
I have an a^nb^nc^n Metagol example somewhere. I'll dig it up if required.
Am I right in thinking Metagol requires all training examples to be flawless? LSTM presumably can handle some degree of erroneous training examples.
The ideal learning system would combine these properties: sample efficiency more like Metagol but also some degree of tolerance to errors in training data like deep learning.
See for instance this work, where Metagol is trained on
noisy image data by random subsampling:
The DeepMind paper flags up ILP's issues with noisy data as a show stopper,
but like I say in my comment above, I disagree. The ILP community has found
various ways to deal with noise over the years since the '90s.
If you are wondering what the downsides are of Meta-Interpretive Learning, the
real PITA with Metagol for me is the need to hand-craft inductive bias. This
is not different to choosing and fine-tuning a neural net architecture, or choosing
Bayesian priors etc, and in fact might be simpler to do in Metagol (because
inductive bias is clearly and cleanly encoded in metarules) but it's still a
pain. A couple of us are working on this currently. It's probably impossible
to do any sort of learning without some kind of structural bias- but it may
be possible to figure the right kind of structure out automatically, in some
cases, under some assumptions etc etc.
I think there's certainly an "ideal system" that is some kind of "best of both
worlds" between ILP and deep learning, but I wouldn't put my money on some
single algorithm doing both things at once, like the δILP system in the
DeepMind paper. I'd put my money (and research time) on combining the two
approaches as separate module, perhaps a deep learning module for "low-level"
perceptual tasks and a MIL module for "high-level" reasoning. That's what each
system does best, and there's no reason to try to add screw-driving
functionality to a hammer, or vice-versa.
If you take a look around, printed words don't exist in most languages. Most languages are spoken. Printing is an extremely new thing we've only had for a very short amount of time. The only thing print can do is present a stilted, over-formal version of what listening to a monotone person give a speech in a dark room might be like. That's good enough for most cases, since the brain makes stuff up it needs. But I don't think it's language, at least not in the same way real spoken languages are language. Real languages are a messy and confusing affair, even more than printed words can be. The way it works is through the interaction over time, not over definitions. (Something something language games)
I wish the guys luck. I'm not so sure we understand the problem yet. We may end up creating a machine that makes up answers to our questions after reading such that it sounds like a real person is doing the work. That's cool -- but it'd be a horrible disaster for humanity if something like that started being the primary interaction point with people. Over time it would make us a horribly stupid and unimaginative species. Like everything else in tech, we have good intentions and endless optimism. We're going to solve a problem whether we understand what it is or not, dang it. plus ça change... (And no, I don't think giving up the idea is any good. I'm just encouraging more understanding of the real goal versus the apparent goal)
That's just it - exploration vs exploitation trade-off. Don't believe this can be dusted off.
We create meaning through human interaction. It's a very dynamic and fuzzy experience. We make it out to be a lot more concrete than it actually is. (Part of that is because of the rise of written language, which gives the illusion that if you can point at text on a page that somehow that text represents in an absolute fashion some concept. It does not.) Add programming to the mix, which is almost an entirely mathematical construct, and it's quite easy to get the wrong idea.
This is the kind of thing programmers have blind spots about, and frankly it drives a lot of people crazy. So it's exactly the kind of domain where well-intentioned people can totally fuck things up for large portions of humanity. I'm not really crazy about seeing another social-media-level "oops!" in my lifetime.
People have spatiotemporal model of the world, different physical models, social and behavioral models of the world, organizational model of the society, economic model, etc. Humans parse the language and transform it into multiple models of the world where many indented meanings and semantics are self-evident and it becomes "a common sense". They have crude understanding of how fabrics, paper, gas, liquid, rubber, iron, rock, etc. behave and they understand written text based on this more complete model zoo.
There is similar limit in computer vision. Humans reason about 2d images using
internal 3d model. Even if they see a completely new object shape, they can usually infer what the other side of the object looks like using basic symmetries and physical models.
Image understanding must eventually transform into spatiotemporal + physical model and there are several approaches underway. NLP has much harder problem, because the problem is more abstract and complex.
NLP (or specifically NLP using deep learning) seems to be having a breakout moment in the last year or so where there have been large advancements back to back.
Generalization is hard - you're often tuning millions of parameters at once, and often the most "sane" thing for the loss function to do is rote memorization. It'll be interesting to see what comes about from this discussion.
I don't know about that. Everything with deep learning seems to attract so much hype that it's hard to measure the actual progress without being a researcher.
On the other hand, "classic" AI projects seem to get no recognition, even when they deliver astounding results.
For example, how many people here heard about MIT's Genesis project?
If you don't know your past, everything seems like progress.
It is fascinating to see how things got better over the last couple of years though!
Encoded into this somewhat unstructured data is the very thing NLP is after: the meaning.
If you cannot teach a machine a language by talking to it with "unstructured" language, how are you supposed to make it both understand it, grok it, and speak it?
We have no trouble teaching small kids about history and math using untructured language.
Language is maybe unstructured, but the internets has lots of it, so to me, the state of NLP today is a dissapointment. I'm convinced though that general AI will be achieved through NLP. I mean, most of what I know I have either been told or I have read it somewhere. And my parent's didn't use annotated data much.
Medical imaging: another area where highly trained humans have thought rigorously (for about the same amount of time) about what this all means.
What ML is lacking is the data and labels a baby bootstraps from into the "common world". How many ML models have been training on the taste of breast milk, smell of mother, the smell of dirt, the upward view of everything (think about how much time babies, since the stone age, have spent laid on their backs, looking up.
These things have to be segregated in a fairly unsupervised way using little more than reflexes (cry, suck, fencer, grasp, etc) for a while: smell of mom sometimes comes with warm milk, but not always. Associating this warm body with the smell, not just food. Sound of mom doesn't always come with smell of mom.
It's a (imprecise) term in Computer Science which may not refer to the same thing you are thinking about. Hence, the confusion.
The problem is that we don't know how meaning is encoded into language utterances and we don't know how meaning is represented, once it's decoded from those utterances (i.e. in our minds). It's very unlikely that these elements, the encoding process that turns meaning to language and back again and the reprsentation of meaning, are carried around in language utterances themselves . And yet- we keep trying to figure out both, the encoding process and the representation, just by looking at the encoded utterances.
Imagine having a compressed string and trying to figure out a) the compression algorithm and b) the uncompressed string, withouth having ever seen examples of either. That's what natural language understanding from raw, unstructured text is like.
 Edit: Is that even possible? Is it possible to send an encoded message including its own encoding procedure, so that the message can be decoded even when the procedure is not known beforehand? Wouldn't that require that the procedure is somehow possible to decode independently of the message? Is there another way?
>> we keep trying to figure out both, the encoding process and the representation, just by looking at the encoded utterances
I think what we're trying to do or what I'm trying to do at least is to find a model that would produce the same interpretations of an utterance as a human would. I don't see why we couldn't find such a model pretty soon, given the vast amount of data out there, however unstructured it might be.
When you say that we have "vast amounts of data" you mean that we have vast amounts of text- but by modelling text we will not model meaning, we will only model text. We have not observed "meaning" and we have no examples of meaning turning into text and back again.
If I may be allowed the simile, training on text to model meaning is a bit like looking at a screen hiding a figure of a person and trying to learn something about the person from the screen they're hiding behind.
You can't model phenomena you can't observe.
Crucially it's unstructured data, plus interaction. You could use it in a sentence hundreds of times and hope the kid learns about cats, but it's much easier to just let the kid play with a cat.
I think you're wrong. General AI is much more related to reinforcement learning and embodiment. Language without embodiment is ungrounded.
To that end, any system that is just looking for the logic of an underlying system is going to be stymied by the fundamental lack of it in language. There is the appearance of logic. And it can actually get you quite far. However, I question if it can encompass all of it. At least until the next best seller comes out.
To that end, I do not find it surprising that the best model will be an ensemble model. More, as long as we are dealing with English (a non-pictographic and a non-phonetic language), then we are ultimately trying to build a model on top of a malleable base where appeals to the logic of yesterday in a language have a non zero chance of failing today. (Edit to make this point clearer, it is amusing to me how many lines in classic Shakespear no longer pop in the same way for current language. "Now is it Rome indeed, and room enough". That a pair of words could cease to be homophones is not something that seems at all obvious. Of course, I have not studied this in decades, so mayhap my grade school lied to me.)