>> Over the last few decades, innate priors have gone out of fashion, and today Deep Learning research prizes closely-supervised end-to-end learning (supported by big-data and big-compute) as the dominant paradigm.
I don't get why deep learning researchers are so hung up on learning everything from scratch. The tendency for ever more compute and data is just unsustainable. For problems like natural language, where distinct evens may be infinite, you can keep throwing data at the problem and you'll never make a dent to it. There are problems that grow at a pace that cannot be matched by any computer, no matter how powerful.
What's more, as a civilisation we have nothing if not knowledge about the world. We have been accumulating it for thousands of years. It's what makes the difference between an intelligent human and an educated intelligent human. And it's a big difference. So, if we have all this background knowledge, why not use it, and make our lives easier?
I think it's a pretty well established in learning theory that the combination of neural network topology + the training procedure sets up an implicit prior (it establishes a bias towards the type of things the network will learn more easily from the input data). In fact, this is a corollary of the No Free Lunch Theorem.
The problem is that these are extremely high-dimensional spaces, and even narrowing down the space within which the prior might be defined (much less finding the prior itself) is mathematically difficult. Hence the progress is slow and halting—it's consisted of discussions like what the representations fully-trained neural networks find within the early activation layers, which means the question is being investigated empirically, without the benefit of a fully-fledged theory.
Initeresting, the 'nature vs. nurture' debate can be viewed as basically a discussion of learning priors in the context of human beings. Turns out the truth is pretty subtle—for many problem domains, humans (and animals) can be made to learn a very large set of things with repeated training, but we also have strong priors, so some things can be taught much more easily and quickly than others.
Our brain networks are primed to learn certain things, e.g. ducklings have a prior that the first animate object they see is their mother. It's possible to force a duckling to unlearn its maternal imprinting, but only with repeated and large amounts of negative conditioning. It seems pretty likely that this prior is embedded in both the topology and neurochemical functioning of a normally developing brain.
> the combination of neural network topology + the training procedure sets up an implicit prior [...]
That's what I got from the koan. Sussman thinks that a randomly wired NN has no prior, but that's false. Same as Minsky "thinks" that the room is empty when he closes his eyes.
EDIT: actually if you read the source (spetharrific.tumblr.com) it spells this out.
That's certainly the moral. I think what's changed with the new deep learning paradigm is a willingness to accept that implicit prior, not so much because it's good as because we don't know how to improve it consistently.
>> I don't get why deep learning researchers are so hung up on learning everything from scratch.
It is easier to formulate the problem when you are dealing with only one specific task. Transfer learning is useful, but in its current format it is limited, and can only borrow features from tasks that are similar. Meta-learning is trying to tackle this problem specifically, promising result is presented, but still long way to go to be useful in real-world production.
In order to NOT learning from scratch, we need understand the representation of knowledge. For neural networks, the knowledge is parameters that learnt. However, the parameter is tied to the task that your trained your network with, so we are back to square one. I doubt continuous differentiable tensors are the ONLY representation of human's knowledge. It might represent the hard-to-explain intuition part of it, however, a large, probably larger part of our knowledge is better represented by category/hierarchy/graph/rules that is not easily differentiable, which current deep learning techniques struggle to model.
So, IMO, it is not really that researchers are obsessed with supervised learning, it is probably that they have no better alternatives.
"Do not try and bend the spoon, that's impossible. Instead, only try to realize the truth...there is no spoon. Then you'll see that it is not the spoon that bends, it is only yourself."
―Spoon Boy to Neo
category/hierarchy/graph/rules are important but I suspect only our conscious/social minds works in that abstraction.
But the representation isn't static. We can learn new symbol systems and adjust them all the time. Plus create rules of thumb to make re-processing easier.
and go back and/forth. Think of an elephant. now an Asian elephant walking by a river. your image probably was pretty abstract then got more vivid. Then re-adjusted to add the river. So you went from symbols to a more detailed model like a generative model, but you probably only have had a few examples of elephants to generate from.
I don't think this is actually true for most deep learning researchers. There is a lot of work now on incorporating knowledge bases and more domain info in learning, and it's likely to just get more and more common.
In theory, perhaps- in practice it's not feasible. I was having a similar conversation today with my thesis supervisor and he quoted something Turing had said in his original machine intelligence paper, that "There's not enough information in data to learn from nothing". This was in the context of machine learning, because Turing was wondering about the best way to create a learning machine. I think the bottom line is that he didn't think you could do it without some background knowledge, of some sort.
Edit: I haven't read Turing's paper recently and I might be misquoting him, but I do think that was his general intuition, that you can't just learn the world by observing the world.
The alphazero paper mentions that go has especially simple rules and that chess and shogi are more complicated. But the rules are still pretty clear. To some extend, they must have at least hard coded in the training data who won a particular game; or the game rules have to be hard coded. Something has to be hard coded. That is far from "nothing" prior!
But the paper is cryptic to me, overall ("Mastering Chess and Shogi by Self-Play with a
General Reinforcement Learning Algorithm").
Also, the parent comment is paradox. Data is information and that is something, so learning from data is learning from something.
If I remember correctly, in the "Mastering Chess etc" paper the structure of the neural net was mapped onto the chess board itself. The movement range of different pieces was also hard-coded into the network architecture -for example, there was a vector encoding "queen moves" and another for "knight moves" which taken together cover all possible chess piece moves. The encoding of the moves was not used to generate moves, per se- I don't remember the details very well but that was a separate module that I believe fed moves to the MCTS algorithm. The move range was encoded as a kind of constraint, to keep the network from exploring unproductive moves.
So the network didn't have to learn anything about the rules of chess, shogi or go and indeed, it did not. That knowledge was given to it directly. As far as I know, this is the done thing with most game-playing systems, especially ones for classic board games that have usually very simple rules that are easy to hand-code (so that there's no real need to learn them from scratch).
It would be interesting to see if it's possible to machine-learn the _rules_ of a game (i.e. given a move, recognise it as legal or not). A quick scan of internet search results confirms my recollection that most published work in game-playing agents focuses on learning how to play well, rather than how to play in the first place (indeed "learning to play chess" is used to mean "learning to win" in many publications).
Well, AlphaZero doesn't use expert knowledge of Go, specifically, but it still uses Monte Carlo Tree Search, which is a very strong encoding of what matters in games in general, so I'd say, no. It's not learning from nothing.
Well it's a specific algorithm, sure. If that's enough to dismiss it as 'not learning from nothing', then nothing could ever qualify as learning from nothing.
My understanding is that work on deep learning for malware classification is pretty early. Having said that, my understanding is that they are getting closer.
Results in this field usually look way better than they would be in a production environment.
Regarding articles you mentioned:
Using ROC curve for evaluation in this case is a red-flag because it doesn't take the data imbalance into account. Precision-Recall curve is way more suitable. You can have great AUC on the ROC curve but precision can be near zero in highly imbalanced problems such as malware detection. Precision is the probability that a positive detection is a true positive. Which is usually the measure you are most interested in.
The problem is that precision changes if you change the class priors. Because of that, the results are always very dataset specific.
With that said, I do not say that machine learning or neural networks do not work on this task. It just doesn't work in the end-to-end manner where you just feed raw binaries as input to some generic architecture as we can do with images in some tasks.
In the video, both LeCun and Manning argue convincingly that this is not the case. LeCun believes that in some idealised future that is stil far away, priors will not be necessary. Both seem to think that, right now, you can't do anything completely from scratch.
I wonder what examples you have in mind btw, other than self-play.
>As an example (28:57), he described how the human brain does not have any innate convolutional structure – but it doesn’t need to, because as an effective unsupervised learner, the brain can learn the same low-level image features (e.g. oriented edge detectors) as a ConvNet, even without the convolutional weight-sharing constraint.
I think the first part of the sentence should be that the brain doesn't have an innate weight sharing (like is stated in the end of the sentence), not that it is not convolutional. I believe the convolutional structure is actually copied from the visual cortex (but with no weight sharing as far as we know)
Convolutions are mathematically defined by application of the same kernel at each possible position. If that kernel has finite support, you also get locality. The visual cortex has locality, but without weight-sharing between functionally identical neurons, it's not convolutional.
The visual cortex neurons have not only locality, but also they learn/recognize similar features. This was the original inspiration for the conv nets: it was a way to learn efficient local features and apply them to the whole image. To me it seems that cortex neurons and the initial layers of conv nets do similar work, but with constraints of their implementations: biological neurons cannot share weights, and for artificial neurons it is more efficient to learn and compute dense convolution.
Disclaimer: I'm learning deep learning mainly from HN comments and just want to provoke more insights. I have no idea what weight sharing is or how kernels are represented in networks, but I do know that e.g. a blur filter or edge filter is represented as convolution matrix.
It is dangerously confusing to reapply the neural-net-terminology to neuronal-nets isn't it? The weight of a kernel of biological neurons, what is that supposed to mean?
If you haven't stopped reading yet, please consider: in case, as I have to assume, you mean there is a specific ensemble of neurons that represents a kernel of given weights corresponding to exactly one area of retina, then isn't sharing between "pixels" achieved simply by the eye's jittering?
For better or worse, assume I'm the adversary in a GAN and ignore me if it doesn't make sense.
Just started watching the video so they may mention this but one thing I find fascinating is that some recent work suggests that the optimization algorithm (usually stochastic gradient descent) or complexity of the loss surface (ie having lots of local minima that are almost as good as the global maxima) may actually be seen to induce a kind of regularization prior.
Ie these seemingly really complex models are actually biased to find simpler solutions that generalize well in a way that turns out to often work better then trying to explicitly learn a simple model.
Can anyone point to any reference where models depart from "closely-supervised end-to-end learning"?
In the article, LeCun & Manning argue this paradigm has some limitations, and I do agree. I'm thinking the field will evolve to systems becoming a combination of probabilistic logic-based engines (which represent formal causal reasoning) plus lots of deep models (which represent intuition, hypothesis generation and specialized tasks like vision).
Generative Adversarial Networks (GAN) and Autoencoders are the ones that come to mind. There are also all the models used in research leading up to the "deep learning" craze: Helmholtz machine, Boltzman machine, Deep belief networks.
There are lots of examples. In the NLP space, Word2Vec and other embeddding models learn analogy relationships in an unsupervised way. Things like the OpenAI unsupervised sentiment detection model[1] show there is a lot more that can be done in this area.
I don't get why deep learning researchers are so hung up on learning everything from scratch. The tendency for ever more compute and data is just unsustainable. For problems like natural language, where distinct evens may be infinite, you can keep throwing data at the problem and you'll never make a dent to it. There are problems that grow at a pace that cannot be matched by any computer, no matter how powerful.
What's more, as a civilisation we have nothing if not knowledge about the world. We have been accumulating it for thousands of years. It's what makes the difference between an intelligent human and an educated intelligent human. And it's a big difference. So, if we have all this background knowledge, why not use it, and make our lives easier?