Perhaps I'm, er, biased, but the most obvious similarity that immediately
jumped out to me when I started reading this article is with Inductive
Programming, the broader field of machine learning algorithms that learn
programs, that includes Inductive Functional Programming and Inductive Logic
Programming (which I study; hence, "biased").
In fact, I'd go as far as to say that this is a new kind of Inductive
Programming. Perhaps it should be called Inductive Continuous Programming or
Inductive Neural Progamming or some such. But what's exceedingly clear to me
is that what the article describes is a search for a program that is composed
of continuous functions, guided by a strong structural bias.
The hallmarks of Inductive Progamming include an explicit encoding of strong
structural biases, including by the use of a library of primitives from which
the learned model is composed; a strong Occamist bias; learned models that are
programs. This work ticks all the boxes. The authors might benefit from
looking up the relevan bibliography.
A couple of starting pointers:
Magic Haskeller, an Inductive Functional Programming system.
Metagol, an Inductive Logic Programming system:
ILASP, a system for the Inductive Learning of Answer Set Programs:
I wouldn't. Traditional NNs are widely called funciton approximators, or function induction. That's because they're not Turing-complete, and in this case don't even have memory. This makes them decidedly not programs, just functions.
Of course, they could be modified with recurrence (a suggestion of future work), or even more complete memory systems (e.g. equipping a neural controller with a tape), making it Turing-complete.
Keep in mind the functional approximation power isn't increased by architecture search alone (there's no restriction from doing weight training). What seems novel here is how significant architecture alone is (I wouldn't guess it being this powerful), and this possibly has implications on the connection with the human brain -- in which my vague impression is that connections themselves are much more tuned than weights; also possibly giving greater emphasis on architecture training.
>> What seems novel here is how significant architecture alone is (I wouldn't
guess it being this powerful), and this possibly has implications on the
connection with the human brain -- in which my vague impression is that
connections themselves are much more tuned than weights; also possibly giving
greater emphasis on architecture training.
You are surprised by how significant architecture is. I am not surprised at
all. I think it's been very clear for a while now that the success of deep
neural networks is entirely due to their architectures that are fine-tuned to
specific tasks. On the one hand, it's obvious that this is the case if you
look at the two major successes of deep learning: CNNs for vision and LSTMs
(and variants) for sequence learning. Both are extremely precise, extremely
intricate architectures with an intense focus one one kind of task, and that
kind of task, alone. On the other hand, the vast majority of neural net
publications are specifically about new architectures, or, rather, tweaks to
In fact, in my boldest moments I'd go as far as to suggest that weight
training with gradient optimisation has kept deep neural nets back. Gradient
optimisation gets stuck in local minima, and it will always get stuck in
local minima, and therefore always be dumb as bricks. Which is evident in the
way deep neural nets overfit and are incapable of extrapolating outside their
This seems extremely harsh, given the vast successes of deep neural nets calling them 'dumb as bricks' (there are too many examples of extrapolation outside datasets to count). They have proven generalization capability when care is taken against over fitting (almost any model has this ability). Also there are results showing getting stuck in local minima is unlikely in highly-dimensional parameter spaces (the curvature in a large number of directions needs to coincide), and in any case rarely you're seeking the global minimum because of said over-fitting issues.
 It should be self-evident because they would hardly be of any use if unable to extrapolate at all. See GANs (https://www.thispersondoesnotexist.com/), AlphaGo (https://en.wikipedia.org/wiki/AlphaGo), etc.
For a high-level discussion, see the following article, by Francois Chollet,
the maintainer of the deep learning library, Keras:
It is also not controversial that gradient optimisation gets stuck in local
optima. A good exercise to convince yourself of this is to implement a simple
gradient descent algorithm and observe its behaviour.
>>  It should be self-evident because they would hardly be of any use if
unable to extrapolate at all.
That is why deep neural nets are trained with such vast amounts of data
and computing power: it's to compensate for their inability to generalise to unseen
That, too, is not controversial. It is common knowledge that the recent
success of deep learning is due to the availability of larger datasets and
more powerful computers than in past years. For example, see this interview with Geoff Hinton:
Geoffrey Hinton: I think it’s mainly because of the amount of computation and
the amount of data now around but it’s also partly because there have been
some technical improvements in the algorithms. Particularly in the algorithms
for doing unsupervised learning where you’re not told what the right answer is
but the main thing is the computation and the amount of data.
The algorithms we had in the old days would have worked perfectly well if
computers had been a million times faster and datasets had been a million
times bigger but if we’d said that thirty years ago people would have just
I hope this is helpful. I understand there are misconceptions about the
strengths and limitations of deep neural nets that are not very easy to see
with the naked eye. But it's a good idea to approach such an overhypoted
technology with a skeptical mind and, e.g., ask yourself: why is all that data
necessary, if they are so good at learning?
But what the article describes is not function approximation anymore. Their system doesn't optimise a set of parameters. The search for an optimal set of parameters has been replaced entirely with a search for an optimal architecture with a single, shared parameter.
Note also that you don't need something to be Turing-complete for it to be a program. For instance, a regular automaton is a program, but it's not Turing complete. Turing completeness makes something, well, a Universal Turing Machine, which can compute any program. But a program is just a set of instructions.
A regular automaton is a program (and a function is a program), but all programs cannot be expressed as finite (regular) automatons (FAs). So FAs are a subset of programs. If we equate programs with mathematical algorithms, in general they really need the unlimited memory aspect (or at least the capability of abstracting memory, which in practice is always finite of course).
To make things clear, an example: a fixed function (or FA) cannot sort a list of numbers of variable size. You can train your function to sort any K numbers, which are the input variables. But because it has no memory, you cannot input variables sequentially (in the FA case you cannot input arbitrarily many variables). You've essentially created a sorting network.
Not only that, but for each input growth you need to re-train your architecture. The nice thing about program inference is that it hopefully captures the fundamental algorithm behind a program, which generalizes to arbitrary sizes. It's not that arbitrary sizes are necessary in practice (again computers are always bounded), it's that this abstraction/generalization is extremely useful, saving data and allowing large (arbitrary) variation in input.
> a Universal Turing Machine, which can compute any program
I feel there's a bit of confusion here. Turing Machines themselves can compute any program (i.e. you can represent any program as a TM) -- that follows essentially from the definition of algorithm (a set of well defined steps given unlimited paper); again languages accepted by FAs are a restricted subset of TMs (the former accepts a regular language while the latter recursively enumerable languages). There are special Turing Machines that simulate arbitrary Turing Machines -- those are called Universal Turing machines. It's a desirable characteristic of any computer, of course (the ability to input arbitrary programs).
We refer to program description languages that can describe any Turing Machine as Turing-complete. You could make a Universal Turing machine that accepts this kind of language directly or you could in principle just build a Turing machine that executes the given algorithm (for example using a gate array/FPGA to encode the state automaton, plus some large memory for the tape).
But, I still don't understand your objection. You agree that a regular automaton is a program, and that it's not Turing complete. So a program does not need to be Turing complete.
I didn't say, nor do I think, that the algorithm described in this article is Turing complete. I think it perfoms program induction. That's as far as I went. I still don't see where Turing completeness comes into it.
I can't wait for more browser-based RL env so people can get a feeling of how things works with 0 setup. Reminds me of DeepTraffic: https://selfdrivingcars.mit.edu/deeptraffic/
The car demo looks pretty cool maybe I can use that for another project.
This might be a dumb question. How does say the discovered CartPoleSwingUp nets respond to changing the mass of the cart and/or the pole? That is, if I take the net found after 1024 generations, and doubled the mass of the pole. Would it still be able to upright the pole, assuming the motors could output the required force?
As a bit of an outsider I'd like to ask, how does this compare to the traditional analogue? In what scenarios is this superior to the classic BP NNs?
I understand that the point of the paper is to show that shared-weight training is possible, but the results look promising, hence the question.
There was a mention that one of the test cases achieved results competitive with current state of the art, but using 10 times less connections. That's pretty impressive.
We think this approach has the potential to develop networks useful in applications where we need to train the weights really quickly for the network to adapt to a given task.
We put in the discussion section that the ability to quickly fine-tune weights might find uses in few-shot learning and continual lifelong learning where agents continually acquire, fine-tune, and transfer skills throughout their lifespan.
The question is, which "super task" do you optimize your WANN for so that it is useful for many subtasks that you did not optimize for to begin with? We think perhaps optimizing a WANN to be good at exploration and curiosity might be a good start for it to develop good priors that are useful for new unseen tasks in its environment. There's a lot more work that can be done from here ...
Is that accurate?
So in the article we also reported in the experimental results section, the performance increase when we fit the individual parameters using the network topology found, and compared it to the non-tuned parameters.
Here are some related papers:
- original idea: http://people.idsia.ch/~tino/papers/koutnik.gecco10.pdf
- vision-based TORCS: http://repository.supsi.ch/4548/1/koutnik2013fdg.pdf
- backpropagation with compressed weights: http://www.informatik.uni-bremen.de/~afabisch/files/2013_NN_...
For example, in the case of the cart pole (without swing up) benchmark a simple linear controller with equal positive weights is required which can easily be encoded with this approach.
Thanks for the references. The GECCO paper on compressed network search has been a big influence on previous projects I worked on, see:
it’s a small community!
I have a couple astrophysics CNN problems for which I am not sure the best architecture and I am now curious to try this out.
This paper really is a breath of fresh air, for mine
> a randomly-initialized LSTM  with a learned linear output layer can predict time series where traditional RNNs fail
This is an extension of NEAT (from 2002), which evolved neural network architectures and weights simultaneously. The interesting thing is by having a shared weight, they appear to be finding much more minimal networks, which can still be trained to have results close to the state of the art.
We are currently cleaning up the code and it should be released soon on GitHub if you are interested to play with modifying the experiments described and try your own ideas.