>> Our work has connections to existing work not only in deep learning, but also
to various other fields
Perhaps I'm, er, biased, but the most obvious similarity that immediately
jumped out to me when I started reading this article is with Inductive
Programming, the broader field of machine learning algorithms that learn
programs, that includes Inductive Functional Programming and Inductive Logic
Programming (which I study; hence, "biased").
In fact, I'd go as far as to say that this is a new kind of Inductive
Programming. Perhaps it should be called Inductive Continuous Programming or
Inductive Neural Progamming or some such. But what's exceedingly clear to me
is that what the article describes is a search for a program that is composed
of continuous functions, guided by a strong structural bias.
The hallmarks of Inductive Progamming include an explicit encoding of strong
structural biases, including by the use of a library of primitives from which
the learned model is composed; a strong Occamist bias; learned models that are
programs. This work ticks all the boxes. The authors might benefit from
looking up the relevan bibliography.
A couple of starting pointers:
Magic Haskeller, an Inductive Functional Programming system.
> In fact, I'd go as far as to say that this is a new kind of Inductive Programming.
I wouldn't. Traditional NNs are widely called funciton approximators, or function induction. That's because they're not Turing-complete, and in this case don't even have memory. This makes them decidedly not programs, just functions.
Of course, they could be modified with recurrence (a suggestion of future work), or even more complete memory systems (e.g. equipping a neural controller with a tape), making it Turing-complete.
Keep in mind the functional approximation power isn't increased by architecture search alone (there's no restriction from doing weight training). What seems novel here is how significant architecture alone is (I wouldn't guess it being this powerful), and this possibly has implications on the connection with the human brain -- in which my vague impression is that connections themselves are much more tuned than weights; also possibly giving greater emphasis on architecture training.
[sorry- splitting this off 'cause it's getting a bit large and to make the
whole thing easier to read]
>> What seems novel here is how significant architecture alone is (I wouldn't
guess it being this powerful), and this possibly has implications on the
connection with the human brain -- in which my vague impression is that
connections themselves are much more tuned than weights; also possibly giving
greater emphasis on architecture training.
You are surprised by how significant architecture is. I am not surprised at
all. I think it's been very clear for a while now that the success of deep
neural networks is entirely due to their architectures that are fine-tuned to
specific tasks. On the one hand, it's obvious that this is the case if you
look at the two major successes of deep learning: CNNs for vision and LSTMs
(and variants) for sequence learning. Both are extremely precise, extremely
intricate architectures with an intense focus one one kind of task, and that
kind of task, alone. On the other hand, the vast majority of neural net
publications are specifically about new architectures, or, rather, tweaks to
existing architectures.
In fact, in my boldest moments I'd go as far as to suggest that weight
training with gradient optimisation has kept deep neural nets back. Gradient
optimisation gets stuck in local minima, and it will always get stuck in
local minima, and therefore always be dumb as bricks. Which is evident in the
way deep neural nets overfit and are incapable of extrapolating outside their
training datasets.
> Gradient optimisation gets stuck in local minima, and it will always get stuck in local minima, and therefore always be dumb as bricks. Which is evident in the way deep neural nets overfit and are incapable of extrapolating outside their training datasets.
This seems extremely harsh, given the vast successes of deep neural nets calling them 'dumb as bricks' (there are too many examples of extrapolation outside datasets to count[1]). They have proven generalization capability when care is taken against over fitting (almost any model has this ability). Also there are results showing getting stuck in local minima is unlikely in highly-dimensional parameter spaces (the curvature in a large number of directions needs to coincide), and in any case rarely you're seeking the global minimum because of said over-fitting issues.
It is also not controversial that gradient optimisation gets stuck in local
optima. A good exercise to convince yourself of this is to implement a simple
gradient descent algorithm and observe its behaviour.
>> [1] It should be self-evident because they would hardly be of any use if
unable to extrapolate at all.
That is why deep neural nets are trained with such vast amounts of data
and computing power: it's to compensate for their inability to generalise to unseen
instances.
That, too, is not controversial. It is common knowledge that the recent
success of deep learning is due to the availability of larger datasets and
more powerful computers than in past years. For example, see this interview with Geoff Hinton:
Geoffrey Hinton: I think it’s mainly because of the amount of computation and
the amount of data now around but it’s also partly because there have been
some technical improvements in the algorithms. Particularly in the algorithms
for doing unsupervised learning where you’re not told what the right answer is
but the main thing is the computation and the amount of data.
The algorithms we had in the old days would have worked perfectly well if
computers had been a million times faster and datasets had been a million
times bigger but if we’d said that thirty years ago people would have just
laughed.
I hope this is helpful. I understand there are misconceptions about the
strengths and limitations of deep neural nets that are not very easy to see
with the naked eye. But it's a good idea to approach such an overhypoted
technology with a skeptical mind and, e.g., ask yourself: why is all that data
necessary, if they are so good at learning?
>> I wouldn't. Traditional NNs are widely called funciton approximators, or function induction. That's because they're not Turing-complete, and in this case don't even have memory. This makes them decidedly not programs, just functions.
But what the article describes is not function approximation anymore. Their system doesn't optimise a set of parameters. The search for an optimal set of parameters has been replaced entirely with a search for an optimal architecture with a single, shared parameter.
Note also that you don't need something to be Turing-complete for it to be a program. For instance, a regular automaton is a program, but it's not Turing complete. Turing completeness makes something, well, a Universal Turing Machine, which can compute any program. But a program is just a set of instructions.
> Note also that you don't need something to be Turing-complete for it to be a program. For instance, a regular automaton is a program, but it's not Turing complete. Turing completeness makes something, well, a Universal Turing Machine, which can compute any program. But a program is just a set of instructions.
A regular automaton is a program (and a function is a program), but all programs cannot be expressed as finite (regular) automatons (FAs). So FAs are a subset of programs. If we equate programs with mathematical algorithms, in general they really need the unlimited memory aspect (or at least the capability of abstracting memory, which in practice is always finite of course).
To make things clear, an example: a fixed function (or FA) cannot sort a list of numbers of variable size. You can train your function to sort any K numbers, which are the input variables. But because it has no memory, you cannot input variables sequentially (in the FA case you cannot input arbitrarily many variables). You've essentially created a sorting network[1].
Not only that, but for each input growth you need to re-train your architecture. The nice thing about program inference is that it hopefully captures the fundamental algorithm behind a program, which generalizes to arbitrary sizes. It's not that arbitrary sizes are necessary in practice (again computers are always bounded), it's that this abstraction/generalization is extremely useful, saving data and allowing large (arbitrary) variation in input.
> a Universal Turing Machine, which can compute any program
I feel there's a bit of confusion here. Turing Machines themselves can compute any program (i.e. you can represent any program as a TM) -- that follows essentially from the definition of algorithm (a set of well defined steps given unlimited paper); again languages accepted by FAs are a restricted subset of TMs (the former accepts a regular language while the latter recursively enumerable languages). There are special Turing Machines that simulate arbitrary Turing Machines -- those are called Universal Turing machines. It's a desirable characteristic of any computer, of course (the ability to input arbitrary programs).
We refer to program description languages that can describe any Turing Machine as Turing-complete. You could make a Universal Turing machine that accepts this kind of language directly or you could in principle just build a Turing machine that executes the given algorithm (for example using a gate array/FPGA to encode the state automaton, plus some large memory for the tape).
Yes, sloppy definition of Turing completenes on my part. I apologise unrservedly.
But, I still don't understand your objection. You agree that a regular automaton is a program, and that it's not Turing complete. So a program does not need to be Turing complete.
I didn't say, nor do I think, that the algorithm described in this article is Turing complete. I think it perfoms program induction. That's as far as I went. I still don't see where Turing completeness comes into it.
Loved how they implemented CartPole in JS so you can play with it.
I can't wait for more browser-based RL env so people can get a feeling of how things works with 0 setup. Reminds me of DeepTraffic: https://selfdrivingcars.mit.edu/deeptraffic/
This might be a dumb question. How does say the discovered CartPoleSwingUp nets respond to changing the mass of the cart and/or the pole? That is, if I take the net found after 1024 generations, and doubled the mass of the pole. Would it still be able to upright the pole, assuming the motors could output the required force?
If we compare the WANN architecture evolved for MNIST, to a simple one layer NN (e.g. a linear classifier, where inputs are connected directly to the outputs with different weights), it seems to me that the evolved architecture uses activations and skip connections to simulate weights. Therefore given enough variety in the arrangement of activation functions and number of hops, each pixel position ends up being weighted in a unique but consistent way. I wonder if the authors tried to establish the correspondence between the weights in a linear classifier, and path modulations from input pixel to the output. To put another way, is there an equivalent weight arrangement for a linear classifier to the WANN with a shared weight?
I skimmed through the article. It's very accessible and well-written, compared to my experience with some of the other literature in ML.
As a bit of an outsider I'd like to ask, how does this compare to the traditional analogue? In what scenarios is this superior to the classic BP NNs?
I understand that the point of the paper is to show that shared-weight training is possible, but the results look promising, hence the question.
There was a mention that one of the test cases achieved results competitive with current state of the art, but using 10 times less connections. That's pretty impressive.
We think this approach has the potential to develop networks useful in applications where we need to train the weights really quickly for the network to adapt to a given task.
We put in the discussion section that the ability to quickly fine-tune weights might find uses in few-shot learning and continual lifelong learning where agents continually acquire, fine-tune, and transfer skills throughout their lifespan.
The question is, which "super task" do you optimize your WANN for so that it is useful for many subtasks that you did not optimize for to begin with? We think perhaps optimizing a WANN to be good at exploration and curiosity might be a good start for it to develop good priors that are useful for new unseen tasks in its environment. There's a lot more work that can be done from here ...
If I understand correctly, the idea isn’t that you’d only use the architecture untrained, but that its goal is better architecture discovery for given tasks or groups of tasks.
So in the article we also reported in the experimental results section, the performance increase when we fit the individual parameters using the network topology found, and compared it to the non-tuned parameters.
This reminds me a lot of the work on compressed neural network from Jan Koutnik and his colleagues. They don't evolve topology of a NN, but they learn weights of a neural network in some compressed space. That seems to be very similar to weight sharing.
For example, in the case of the cart pole (without swing up) benchmark a simple linear controller with equal positive weights is required which can easily be encoded with this approach.
Another way to look at this is something analogous to a Fourier transform. Given a function, try a frequency and see what you get. So maybe some of the things that conceptually make that nicer might also have an analogue here?
This paper really is a breath of fresh air, for mine
To me this is a strangely written paper although the results might be some what interesting. They just optimize over a different set of functions than you normally do in NN training and find solutions to various problems.
That they claim that the weights "are the solution" in normal NN training is wrong. Of course the weights AND the architecture are both important since they define the set of functions you search over. That you can find solutions by only searching over architecture is not surprising since the architecture is the combination of different operations.
just starting to read this. It reminds me of meiosis networks + stochastic delta rule where the weights were drawn from a random distribution (normal I think).
I don't want to be dismissive since this is pretty cool, but from just skimming the page it seems like this is basically a version to neural nets that is roughly equivalent to genetic programming?
Genetic programming is just one way to optimize candidates embedded in a high-dimensional space. Neural network research has been using population-based optimization techniques for a long time, that's not the innovation here.
This is an extension of NEAT (from 2002), which evolved neural network architectures and weights simultaneously. The interesting thing is by having a shared weight, they appear to be finding much more minimal networks, which can still be trained to have results close to the state of the art.
I really love the look and feel of this. Can anyone give a few pointers as to how to achieve something similar? I know it's rather simplistic, but it's really elegant.
We are currently cleaning up the code and it should be released soon on GitHub if you are interested to play with modifying the experiments described and try your own ideas.
The layout template is based on the distill.pub template I used earlier in another project. The GitHub repo has the instructions on how to build articles using it:
if a well designed architecture can perform better than learned weights of a general architecture, doesn't it make sense we design a system that can learn to architecture itself as it is learning the weights? Maybe that's the famous "Learning to learn" paper?
Perhaps I'm, er, biased, but the most obvious similarity that immediately jumped out to me when I started reading this article is with Inductive Programming, the broader field of machine learning algorithms that learn programs, that includes Inductive Functional Programming and Inductive Logic Programming (which I study; hence, "biased").
In fact, I'd go as far as to say that this is a new kind of Inductive Programming. Perhaps it should be called Inductive Continuous Programming or Inductive Neural Progamming or some such. But what's exceedingly clear to me is that what the article describes is a search for a program that is composed of continuous functions, guided by a strong structural bias.
The hallmarks of Inductive Progamming include an explicit encoding of strong structural biases, including by the use of a library of primitives from which the learned model is composed; a strong Occamist bias; learned models that are programs. This work ticks all the boxes. The authors might benefit from looking up the relevan bibliography.
A couple of starting pointers:
Magic Haskeller, an Inductive Functional Programming system.
http://nautilus.cs.miyazaki-u.ac.jp/%7Eskata/MagicHaskeller....
Metagol, an Inductive Logic Programming system:
https://github.com/metagol/metagol
ILASP, a system for the Inductive Learning of Answer Set Programs:
http://ilasp.com/