
Weight-Agnostic Neural Networks - hardmaru
https://weightagnostic.github.io
======
YeGoblynQueenne
>> Our work has connections to existing work not only in deep learning, but
also to various other fields

Perhaps I'm, er, biased, but the most obvious similarity that immediately
jumped out to me when I started reading this article is with Inductive
Programming, the broader field of machine learning algorithms that learn
programs, that includes Inductive Functional Programming and Inductive Logic
Programming (which I study; hence, "biased").

In fact, I'd go as far as to say that this is a new kind of Inductive
Programming. Perhaps it should be called Inductive Continuous Programming or
Inductive Neural Progamming or some such. But what's exceedingly clear to me
is that what the article describes is a search for a _program_ that is
composed of continuous functions, guided by a strong structural bias.

The hallmarks of Inductive Progamming include an explicit encoding of strong
structural biases, including by the use of a library of primitives from which
the learned model is composed; a strong Occamist bias; learned models that are
programs. This work ticks all the boxes. The authors might benefit from
looking up the relevan bibliography.

A couple of starting pointers:

Magic Haskeller, an Inductive Functional Programming system.

[http://nautilus.cs.miyazaki-u.ac.jp/%7Eskata/MagicHaskeller....](http://nautilus.cs.miyazaki-u.ac.jp/%7Eskata/MagicHaskeller.html)

Metagol, an Inductive Logic Programming system:

[https://github.com/metagol/metagol](https://github.com/metagol/metagol)

ILASP, a system for the Inductive Learning of Answer Set Programs:

[http://ilasp.com/](http://ilasp.com/)

~~~
darkmighty
> In fact, I'd go as far as to say that this is a new kind of Inductive
> Programming.

I wouldn't. Traditional NNs are widely called funciton approximators, or
function induction. That's because they're not Turing-complete, and in this
case don't even have memory. This makes them decidedly not programs, just
functions.

Of course, they could be modified with recurrence (a suggestion of future
work), or even more complete memory systems (e.g. equipping a neural
controller with a tape), making it Turing-complete.

Keep in mind the functional approximation power isn't increased by
architecture search alone (there's no restriction from doing weight training).
What seems novel here is how significant _architecture alone_ is (I wouldn't
guess it being this powerful), and this possibly has implications on the
connection with the human brain -- in which my vague impression is that
connections themselves are much more tuned than weights; also possibly giving
greater emphasis on architecture training.

~~~
YeGoblynQueenne
[sorry- splitting this off 'cause it's getting a bit large and to make the
whole thing easier to read]

>> What seems novel here is how significant architecture alone is (I wouldn't
guess it being this powerful), and this possibly has implications on the
connection with the human brain -- in which my vague impression is that
connections themselves are much more tuned than weights; also possibly giving
greater emphasis on architecture training.

You are surprised by how significant architecture is. I am not surprised at
all. I think it's been very clear for a while now that the success of deep
neural networks is entirely due to their architectures that are fine-tuned to
specific tasks. On the one hand, it's obvious that this is the case if you
look at the two major successes of deep learning: CNNs for vision and LSTMs
(and variants) for sequence learning. Both are extremely precise, extremely
intricate architectures with an intense focus one one kind of task, and that
kind of task, alone. On the other hand, the vast majority of neural net
publications are specifically about new architectures, or, rather, tweaks to
existing architectures.

In fact, in my boldest moments I'd go as far as to suggest that weight
training with gradient optimisation has kept deep neural nets back. Gradient
optimisation gets stuck in local minima, and it will _always_ get stuck in
local minima, and therefore always be dumb as bricks. Which is evident in the
way deep neural nets overfit and are incapable of extrapolating outside their
training datasets.

~~~
darkmighty
> Gradient optimisation gets stuck in local minima, and it will always get
> stuck in local minima, and therefore always be dumb as bricks. Which is
> evident in the way deep neural nets overfit and are incapable of
> extrapolating outside their training datasets.

This seems extremely harsh, given the vast successes of deep neural nets
calling them 'dumb as bricks' (there are too many examples of extrapolation
outside datasets to count[1]). They have proven generalization capability when
care is taken against over fitting (almost any model has this ability). Also
there are results showing getting stuck in local minima is unlikely in highly-
dimensional parameter spaces (the curvature in a large number of directions
needs to coincide), and in any case rarely you're seeking the global minimum
because of said over-fitting issues.

[1] It should be self-evident because they would hardly be of any use if
unable to extrapolate at all. See GANs
([https://www.thispersondoesnotexist.com/](https://www.thispersondoesnotexist.com/)),
AlphaGo
([https://en.wikipedia.org/wiki/AlphaGo](https://en.wikipedia.org/wiki/AlphaGo)),
etc.

~~~
YeGoblynQueenne
That neural nets overfit to their training data and are incapable of
extrapolation is not controversial.

For a high-level discussion, see the following article, by Francois Chollet,
the maintainer of the deep learning library, Keras:

[https://blog.keras.io/the-limitations-of-deep-
learning.html](https://blog.keras.io/the-limitations-of-deep-learning.html)

It is also not controversial that gradient optimisation gets stuck in local
optima. A good exercise to convince yourself of this is to implement a simple
gradient descent algorithm and observe its behaviour.

>> [1] It should be self-evident because they would hardly be of any use if
unable to extrapolate at all.

That is why deep neural nets are trained with such vast amounts of data and
computing power: it's to compensate for their inability to generalise to
unseen instances.

That, too, is not controversial. It is common knowledge that the recent
success of deep learning is due to the availability of larger datasets and
more powerful computers than in past years. For example, see this interview
with Geoff Hinton:

[http://techjaw.com/2015/06/07/geoffrey-hinton-deep-
learning-...](http://techjaw.com/2015/06/07/geoffrey-hinton-deep-learning-in-
baby-steps-and-the-future-of-google-and-ai/)

 _Geoffrey Hinton: I think it’s mainly because of the amount of computation
and the amount of data now around but it’s also partly because there have been
some technical improvements in the algorithms. Particularly in the algorithms
for doing unsupervised learning where you’re not told what the right answer is
but the main thing is the computation and the amount of data.

The algorithms we had in the old days would have worked perfectly well if
computers had been a million times faster and datasets had been a million
times bigger but if we’d said that thirty years ago people would have just
laughed._

I hope this is helpful. I understand there are misconceptions about the
strengths and limitations of deep neural nets that are not very easy to see
with the naked eye. But it's a good idea to approach such an overhypoted
technology with a skeptical mind and, e.g., ask yourself: why is all that data
necessary, if they are so good at learning?

------
MasterScrat
Loved how they implemented CartPole in JS so you can play with it.

I can't wait for more browser-based RL env so people can get a feeling of how
things works with 0 setup. Reminds me of DeepTraffic:
[https://selfdrivingcars.mit.edu/deeptraffic/](https://selfdrivingcars.mit.edu/deeptraffic/)

~~~
hardmaru
Thanks, glad you liked it!

The car demo looks pretty cool maybe I can use that for another project.

------
magicalhippo
Very nice and interesting article!

This might be a dumb question. How does say the discovered CartPoleSwingUp
nets respond to changing the mass of the cart and/or the pole? That is, if I
take the net found after 1024 generations, and doubled the mass of the pole.
Would it still be able to upright the pole, assuming the motors could output
the required force?

------
p1esk
If we compare the WANN architecture evolved for MNIST, to a simple one layer
NN (e.g. a linear classifier, where inputs are connected directly to the
outputs with different weights), it seems to me that the evolved architecture
uses activations and skip connections to simulate weights. Therefore given
enough variety in the arrangement of activation functions and number of hops,
each pixel position ends up being weighted in a unique but consistent way. I
wonder if the authors tried to establish the correspondence between the
weights in a linear classifier, and path modulations from input pixel to the
output. To put another way, is there an equivalent weight arrangement for a
linear classifier to the WANN with a shared weight?

------
dmos62
I skimmed through the article. It's very accessible and well-written, compared
to my experience with some of the other literature in ML.

As a bit of an outsider I'd like to ask, how does this compare to the
traditional analogue? In what scenarios is this superior to the classic BP
NNs?

I understand that the point of the paper is to show that shared-weight
training is possible, but the results look promising, hence the question.

There was a mention that one of the test cases achieved results competitive
with current state of the art, but using 10 times less connections. That's
pretty impressive.

~~~
hardmaru
Hi, thanks for the feedback.

We think this approach has the potential to develop networks useful in
applications where we need to train the weights really quickly for the network
to adapt to a given task.

We put in the discussion section that the ability to quickly fine-tune weights
might find uses in few-shot learning and continual lifelong learning where
agents continually acquire, fine-tune, and transfer skills throughout their
lifespan.

The question is, which "super task" do you optimize your WANN for so that it
is useful for many subtasks that you did not optimize for to begin with? We
think perhaps optimizing a WANN to be good at exploration and curiosity might
be a good start for it to develop good priors that are useful for new unseen
tasks in its environment. There's a lot more work that can be done from here
...

~~~
stochastic_monk
If I understand correctly, the idea isn’t that you’d only use the architecture
untrained, but that its goal is better architecture discovery for given tasks
or groups of tasks.

Is that accurate?

------
_k8si
This works for NLP too:
[https://arxiv.org/abs/1901.10444](https://arxiv.org/abs/1901.10444)

------
brechmos
Given a network topology found using this technique, I wonder how it would
compare with the same topology with fit parameters?

~~~
hardmaru
Hi,

So in the article we also reported in the experimental results section, the
performance increase when we fit the individual parameters using the network
topology found, and compared it to the non-tuned parameters.

~~~
afabisch
This reminds me a lot of the work on compressed neural network from Jan
Koutnik and his colleagues. They don't evolve topology of a NN, but they learn
weights of a neural network in some compressed space. That seems to be very
similar to weight sharing.

Here are some related papers:

\- original idea:
[http://people.idsia.ch/~tino/papers/koutnik.gecco10.pdf](http://people.idsia.ch/~tino/papers/koutnik.gecco10.pdf)

\- vision-based TORCS:
[http://repository.supsi.ch/4548/1/koutnik2013fdg.pdf](http://repository.supsi.ch/4548/1/koutnik2013fdg.pdf)

\- backpropagation with compressed weights: [http://www.informatik.uni-
bremen.de/~afabisch/files/2013_NN_...](http://www.informatik.uni-
bremen.de/~afabisch/files/2013_NN_LCS.pdf)

For example, in the case of the cart pole (without swing up) benchmark a
simple linear controller with equal positive weights is required which can
easily be encoded with this approach.

~~~
hardmaru
Hi,

Thanks for the references. The GECCO paper on compressed network search has
been a big influence on previous projects I worked on, see:

[https://news.ycombinator.com/item?id=16694153](https://news.ycombinator.com/item?id=16694153)

[https://news.ycombinator.com/item?id=14883694](https://news.ycombinator.com/item?id=14883694)

it’s a small community!

------
s_Hogg
Another way to look at this is something analogous to a Fourier transform.
Given a function, try a frequency and see what you get. So maybe some of the
things that conceptually make that nicer might also have an analogue here?

This paper really is a breath of fresh air, for mine

------
raindeer3
To me this is a strangely written paper although the results might be some
what interesting. They just optimize over a different set of functions than
you normally do in NN training and find solutions to various problems. That
they claim that the weights "are the solution" in normal NN training is wrong.
Of course the weights AND the architecture are both important since they
define the set of functions you search over. That you can find solutions by
only searching over architecture is not surprising since the architecture is
the combination of different operations.

------
s_Hogg
Is there meant to be a direct link between this sort of thing and the
randomly-initialised approach to Echo State Networks?

~~~
s_Hogg
apologies, reading further they refer to:

> a randomly-initialized LSTM [12] with a learned linear output layer can
> predict time series where traditional RNNs fail

~~~
hardmaru
Yeah, though I think we could have referenced a few more works in the
reservoir computing area. Will keep that in mind for the next draft.

~~~
theSage
just starting to read this. It reminds me of meiosis networks + stochastic
delta rule where the weights were drawn from a random distribution (normal I
think).

~~~
hardmaru
Added a few cites to these works, thanks.

------
amelius
Isn't this similar to a neural network where the weights are binary: 0 or 1?

------
drabiega
I don't want to be dismissive since this is pretty cool, but from just
skimming the page it seems like this is basically a version to neural nets
that is roughly equivalent to genetic programming?

~~~
Scaevolus
Genetic programming is just one way to optimize candidates embedded in a high-
dimensional space. Neural network research has been using population-based
optimization techniques for a long time, that's not the innovation here.

This is an extension of NEAT (from 2002), which evolved neural network
architectures and weights simultaneously. The interesting thing is by having a
shared weight, they appear to be finding much more minimal networks, which can
_still be trained_ to have results close to the state of the art.

------
ahpearce
I really love the look and feel of this. Can anyone give a few pointers as to
how to achieve something similar? I know it's rather simplistic, but it's
really elegant.

~~~
hardmaru
Hi,

We are currently cleaning up the code and it should be released soon on GitHub
if you are interested to play with modifying the experiments described and try
your own ideas.

~~~
ahpearce
Thanks for that! I was originally referring to the layout of the article
itself, but I'd be interested in both, if possible!

~~~
hardmaru
The layout template is based on the distill.pub template I used earlier in
another project. The GitHub repo has the instructions on how to build articles
using it:

[https://github.com/designrl/designrl.github.io](https://github.com/designrl/designrl.github.io)

------
msoad
if a well designed architecture can perform better than learned weights of a
general architecture, doesn't it make sense we design a system that can learn
to architecture itself as it is learning the weights? Maybe that's the famous
"Learning to learn" paper?

~~~
p1esk
Yes it does:
[https://arxiv.org/abs/1905.01392](https://arxiv.org/abs/1905.01392)

