If you think about audio/visual data, deep nets make sense: if you tweak a few pixel values in an image, or if you shift every pixel value by some amount, the image will still retain basically the same information. In this context, linearity (weighting values and summing them up) make sense. It's not clear whether this makes sense in language. On the other hand, deep methods are state of the art on most NLP tasks, but their improvement over other methods isn't the huge gap as in computer vision. And while we know there are tight similarities between lower-level visual features in deep nets and the initial layers of the visual cortex, the justification for deep learning in NLP is simpler and less specific: what I see is the fact that networks have huge capacity to fit to data and are deep (rely on a hierarchy of features). My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
I think there are similar limitations with control and inference. When it comes to AlphaGo the deep learning component is responsible for estimating the value of the game state; the planning component is done with older methods. This is much more speculative, but when it comes to the work on Atari games, for example, I suspect that most of what is being learned (and solved) is perception of useful features from the raw game images. I wonder whether the features for deducing game state score are actually complex.
I think what I'm trying to say is that when we look at the success of deep learning, we have to separate out what part of that is due to the fact that deep learning is the go-to blackbox classifier, and what part of this is due to the systems we use actually being a good model for the problem. If the model isn't good, does that model merely need to be tweaked from what we currently use, or does the model have to completely change?
Arguing from the other direction, neural networks have also already proven to deal with very sharp features. For example the value and policy networks in AlphaGo are able to pick up on subtle changes in the game position. The changes from the placement of single stones can be vast in Go and by no means this is only solved by the Monte Carlo tree search. Without MCTS, AlphaGo still wins in ~80% of the time against the best hand-crafted Go program. The value and policy networks have pretty much evolved a bit of boolean logic, simply from the gradient from the smoothness that results from averaging over a lot of training data.
I have a pet theory that the discovery of sharp features and boolean programs might heavily rely on noise. If the error surface becomes too discrete, we basically need to backup to pure random optimization (i.e. trying any direction by random chance and keep it, if it is better). That allows us to skip down the energy surface even without the presence of a gradient. Of course, such noise can also lead to forgetting, but it just seems that elsewhere the gradient will be non-zero again, so any mistakes will be correct by more learning (or it simply leads to further improvement if the step was into the right direction). Surely, our episodic memory helps in the absence of gradient information as well. If we encounter a complex, previously unknown Go strategy, for example, it will likely not smoothly improve all our Go playing abilities by a small amount. Instead, we store a discrete chaining of states and actions as an episodic memory which allows us to reuse that knowledge simply by recalling it at a later point in time.
Isn't that basically Monte Carlo?
It is funny how every AI post on HN turns into a speculative discussion forum full of words "I think", "likely", "I suspect", "My guess" etc, when all the research is available for free and everyone is free to download and read it to get a real understanding of what's going on in the field.
>what I see is the fact that networks have huge capacity to fit to data and are deep (rely on a hierarchy of features).
Actually recurrent neural networks like LSTM are turing-complete, i.e. for every halting algorithm it is trivial to implement an RNN that computes it.
It is non-trivial to learn these parameters from algorithm IO data, but for many tasks it is possible too.
>I suspect that most of what is being learned (and solved) is perception of useful features from the raw game images.
It is not this simple, deep enough convnets can represent computations, the consensus is that middle and upper layers of convnets represent some useful computation steps. Also note that human brain can only do so much computation steps to answer questions when in dialogue, due to time and speed limits.
>My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
This is being worked on, see the first link for Memory Networks and Stack RNNs, DeQue RNNs, Tree RNNs. Deep learning is a very generic term, there are dozens of various feedforward and recurrent architectures that are fully differentiable. The full potential of such models has not been nearly reached yet and maybe language understanding will be solved in the coming years (again, the first link shows that it is in process of being solved).
EDIT (reply to below): in general these statements are either vague and nonspecific, or perfectly correct and non-informative, comments that don't have much to do with my original point.
>Turing-completeness is quite broad and nonspecific, like I said.
It is, but feedforward models (and almost every Bayesian/statistical model) don't possess it even in theory, while RNNs do.
>Doing "some computation" is an obvious statement that doesn't add any information.
Let me be more specific: currently researchers think that later stages of CNNs do something that is more interpretable as computation than as mere pattern matching. Our world doesn't require 50-level hierarchy, but resnets with 50+ layers do good, looks like because they learn some non-trivial computation.
>the jury is still out on whether any of those RNN approaches will be the needed breakthrough.
Sure, we'll see. Maybe there won't be need in any breakthrough, just incremental improvement of models. And even current models when scaled up to next-gen hardware (see nervana) can surprise us again with their performance.
My skepticism is about success in the sense of commercially useful systems that can process language and function "off the leash" of human supervision without the results being dominated by unacceptably bad results.
Look at the XBOX ONE Kinect vs the XBOX 360 Kinect. On paper the newer product is much better than the old product, but neither one is any easier or fun to use than picking up the gamepad. In the current paradigm, researchers can keep putting up better and better numbers without ever crossing the threshold to something anybody can make a living off.
This is probably due to the fact that the field is very interesting and has lots of undefined boundaries, so people like to take educated guesses based on the knowledge they might have and on their intuition. Fair enough for this discussion.
> maybe language understanding will be solved in the coming years
OK, here comes my guess: I think reasoning about and producing computer programs should be easier than reasoning about and producing natural language. So if that's possible (big if), then it should come first. And then maybe the NLP will be solved with the help of code writing computers. Or maybe just by code writing computers, and nobody here has a job anymore :)
I wonder if it's just a different kind of "noise". Higher level, more structured.
> My guess is we may need a fundamental breakthrough in a newfangled hierarchical learning system that is better suited for language to “solve” NLP.
It seems fairly evident that there are many hierarchies inside the brain, each level working with outputs from lower-level processing units. In a sense, something like AlphaGo is hierarchy-poor - it has a few networks loosely correlated with a decision mechanism.
But the brain probably implements a "networks upon networks" model, that may also include hierarchical loops and other types of feedback.
I think, to have truly human level NLP, we'd have to simulate reasonably close the whole hierarchy of meaning, which in turn is given by the whole hierarchy of neural aggregates.
EX: "How long do stars last?" Means something very different in a science class than a tabloid headline. Is that tabloid talking divorce or obscurity? Notice how three sentences in I am clarifying last.
EDIT: a combination of noise, I should say, and paucity of information.
Asking a computer to solve all the ambiguity in human language perfectly is asking it to solve it far better than any human can.
For human-level NLP, you need to model the mechanism by which the relationship network is generated, and ground it in a set of experiences - or some digital analogue of experiences.
Naive statistical methods are not a good way to approach that problem.
So no, Wikipedia will not provide enough context, for all kinds of reasons - not least of which is the fact that human communications include multiple layers of meaning, some of which are contradictory, while others are metaphorical, and all of the above can rely on unstated implication.
Vector arithmetic is not a useful model for that level of verbal reasoning.
For AI to determine their own goals, well now you get into awareness ... consciousness. At a fundamental mathematical level, we still have no idea how these work.
We can see electrical signals in the brain using tools and know it's a combination of chemicals and pulses that somehow make us do what we do ... but we are still a long way from understanding how that process really works.
I'd actually just say that we've not really defined these very well, and so arguing about how far along the path we are to them isn't that productive.
In similar context they probably end up parsed to the same question assuming correct inflection, posture, etc. Spoken conversations are messy, but they also have redundancy and pseudo checksum's. Written language tends to be more formal because it's a much narrower channel and you don't get as much feedback.
PS: It's also really common for someone to ask a question when they don't have enough context to understand what question they should be asking.
A further comment on deep methods being state of the art currently:
I wonder how well these tasks really measure progress in natural language understanding (I really don't like isolating that term as some distinct subdiscipline of broader AI goals, but so be it). Some of Chris Manning's students have at least started down the path of examining some of these new-traditional tasks in language, and found that perhaps they are not so hard as they claim to be.
 A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. Chen, Bolton & Manning [https://arxiv.org/abs/1606.02858]
IME as a chatbot developer, people don't talk to them in conversational english so much as spit out what they want it to do.
But something about the very use of hierarchy in trying to solve NLP makes me queasy. I think it's more (poetically-metaphorically) like Reed-Solomon codes than hierarchies ( to the extent that those don't actually overlap ). There is Unexplained Conservation of Information That Really Isn't There To Start With.