Weight-Agnostic Neural Networks

YeGoblynQueenne · on June 13, 2019

>> Our work has connections to existing work not only in deep learning, but also to various other fields

Perhaps I'm, er, biased, but the most obvious similarity that immediately jumped out to me when I started reading this article is with Inductive Programming, the broader field of machine learning algorithms that learn programs, that includes Inductive Functional Programming and Inductive Logic Programming (which I study; hence, "biased").

In fact, I'd go as far as to say that this is a new kind of Inductive Programming. Perhaps it should be called Inductive Continuous Programming or Inductive Neural Progamming or some such. But what's exceedingly clear to me is that what the article describes is a search for a program that is composed of continuous functions, guided by a strong structural bias.

The hallmarks of Inductive Progamming include an explicit encoding of strong structural biases, including by the use of a library of primitives from which the learned model is composed; a strong Occamist bias; learned models that are programs. This work ticks all the boxes. The authors might benefit from looking up the relevan bibliography.

A couple of starting pointers:

Magic Haskeller, an Inductive Functional Programming system.

http://nautilus.cs.miyazaki-u.ac.jp/%7Eskata/MagicHaskeller....

Metagol, an Inductive Logic Programming system:

https://github.com/metagol/metagol

ILASP, a system for the Inductive Learning of Answer Set Programs:

http://ilasp.com/

darkmighty · on June 13, 2019

> In fact, I'd go as far as to say that this is a new kind of Inductive Programming.

I wouldn't. Traditional NNs are widely called funciton approximators, or function induction. That's because they're not Turing-complete, and in this case don't even have memory. This makes them decidedly not programs, just functions.

Of course, they could be modified with recurrence (a suggestion of future work), or even more complete memory systems (e.g. equipping a neural controller with a tape), making it Turing-complete.

Keep in mind the functional approximation power isn't increased by architecture search alone (there's no restriction from doing weight training). What seems novel here is how significant architecture alone is (I wouldn't guess it being this powerful), and this possibly has implications on the connection with the human brain -- in which my vague impression is that connections themselves are much more tuned than weights; also possibly giving greater emphasis on architecture training.

YeGoblynQueenne · on June 13, 2019

[sorry- splitting this off 'cause it's getting a bit large and to make the whole thing easier to read]

>> What seems novel here is how significant architecture alone is (I wouldn't guess it being this powerful), and this possibly has implications on the connection with the human brain -- in which my vague impression is that connections themselves are much more tuned than weights; also possibly giving greater emphasis on architecture training.

You are surprised by how significant architecture is. I am not surprised at all. I think it's been very clear for a while now that the success of deep neural networks is entirely due to their architectures that are fine-tuned to specific tasks. On the one hand, it's obvious that this is the case if you look at the two major successes of deep learning: CNNs for vision and LSTMs (and variants) for sequence learning. Both are extremely precise, extremely intricate architectures with an intense focus one one kind of task, and that kind of task, alone. On the other hand, the vast majority of neural net publications are specifically about new architectures, or, rather, tweaks to existing architectures.

In fact, in my boldest moments I'd go as far as to suggest that weight training with gradient optimisation has kept deep neural nets back. Gradient optimisation gets stuck in local minima, and it will always get stuck in local minima, and therefore always be dumb as bricks. Which is evident in the way deep neural nets overfit and are incapable of extrapolating outside their training datasets.

darkmighty · on June 14, 2019

> Gradient optimisation gets stuck in local minima, and it will always get stuck in local minima, and therefore always be dumb as bricks. Which is evident in the way deep neural nets overfit and are incapable of extrapolating outside their training datasets.

This seems extremely harsh, given the vast successes of deep neural nets calling them 'dumb as bricks' (there are too many examples of extrapolation outside datasets to count[1]). They have proven generalization capability when care is taken against over fitting (almost any model has this ability). Also there are results showing getting stuck in local minima is unlikely in highly-dimensional parameter spaces (the curvature in a large number of directions needs to coincide), and in any case rarely you're seeking the global minimum because of said over-fitting issues.

[1] It should be self-evident because they would hardly be of any use if unable to extrapolate at all. See GANs (https://www.thispersondoesnotexist.com/), AlphaGo (https://en.wikipedia.org/wiki/AlphaGo), etc.

YeGoblynQueenne · on June 15, 2019

That neural nets overfit to their training data and are incapable of extrapolation is not controversial.

For a high-level discussion, see the following article, by Francois Chollet, the maintainer of the deep learning library, Keras:

https://blog.keras.io/the-limitations-of-deep-learning.html

It is also not controversial that gradient optimisation gets stuck in local optima. A good exercise to convince yourself of this is to implement a simple gradient descent algorithm and observe its behaviour.

>> [1] It should be self-evident because they would hardly be of any use if unable to extrapolate at all.

That is why deep neural nets are trained with such vast amounts of data and computing power: it's to compensate for their inability to generalise to unseen instances.

That, too, is not controversial. It is common knowledge that the recent success of deep learning is due to the availability of larger datasets and more powerful computers than in past years. For example, see this interview with Geoff Hinton:

http://techjaw.com/2015/06/07/geoffrey-hinton-deep-learning-...

Geoffrey Hinton: I think it’s mainly because of the amount of computation and the amount of data now around but it’s also partly because there have been some technical improvements in the algorithms. Particularly in the algorithms for doing unsupervised learning where you’re not told what the right answer is but the main thing is the computation and the amount of data.

The algorithms we had in the old days would have worked perfectly well if computers had been a million times faster and datasets had been a million times bigger but if we’d said that thirty years ago people would have just laughed.

I hope this is helpful. I understand there are misconceptions about the strengths and limitations of deep neural nets that are not very easy to see with the naked eye. But it's a good idea to approach such an overhypoted technology with a skeptical mind and, e.g., ask yourself: why is all that data necessary, if they are so good at learning?

YeGoblynQueenne · on June 13, 2019

>> I wouldn't. Traditional NNs are widely called funciton approximators, or function induction. That's because they're not Turing-complete, and in this case don't even have memory. This makes them decidedly not programs, just functions.

But what the article describes is not function approximation anymore. Their system doesn't optimise a set of parameters. The search for an optimal set of parameters has been replaced entirely with a search for an optimal architecture with a single, shared parameter.

Note also that you don't need something to be Turing-complete for it to be a program. For instance, a regular automaton is a program, but it's not Turing complete. Turing completeness makes something, well, a Universal Turing Machine, which can compute any program. But a program is just a set of instructions.

darkmighty · on June 14, 2019

> Note also that you don't need something to be Turing-complete for it to be a program. For instance, a regular automaton is a program, but it's not Turing complete. Turing completeness makes something, well, a Universal Turing Machine, which can compute any program. But a program is just a set of instructions.

A regular automaton is a program (and a function is a program), but all programs cannot be expressed as finite (regular) automatons (FAs). So FAs are a subset of programs. If we equate programs with mathematical algorithms, in general they really need the unlimited memory aspect (or at least the capability of abstracting memory, which in practice is always finite of course).

To make things clear, an example: a fixed function (or FA) cannot sort a list of numbers of variable size. You can train your function to sort any K numbers, which are the input variables. But because it has no memory, you cannot input variables sequentially (in the FA case you cannot input arbitrarily many variables). You've essentially created a sorting network[1].

Not only that, but for each input growth you need to re-train your architecture. The nice thing about program inference is that it hopefully captures the fundamental algorithm behind a program, which generalizes to arbitrary sizes. It's not that arbitrary sizes are necessary in practice (again computers are always bounded), it's that this abstraction/generalization is extremely useful, saving data and allowing large (arbitrary) variation in input.

[1] https://en.wikipedia.org/wiki/Sorting_network

> a Universal Turing Machine, which can compute any program

I feel there's a bit of confusion here. Turing Machines themselves can compute any program (i.e. you can represent any program as a TM) -- that follows essentially from the definition of algorithm (a set of well defined steps given unlimited paper); again languages accepted by FAs are a restricted subset of TMs (the former accepts a regular language while the latter recursively enumerable languages). There are special Turing Machines that simulate arbitrary Turing Machines -- those are called Universal Turing machines. It's a desirable characteristic of any computer, of course (the ability to input arbitrary programs).

We refer to program description languages that can describe any Turing Machine as Turing-complete. You could make a Universal Turing machine that accepts this kind of language directly or you could in principle just build a Turing machine that executes the given algorithm (for example using a gate array/FPGA to encode the state automaton, plus some large memory for the tape).

YeGoblynQueenne · on June 15, 2019

Yes, sloppy definition of Turing completenes on my part. I apologise unrservedly.

But, I still don't understand your objection. You agree that a regular automaton is a program, and that it's not Turing complete. So a program does not need to be Turing complete.

I didn't say, nor do I think, that the algorithm described in this article is Turing complete. I think it perfoms program induction. That's as far as I went. I still don't see where Turing completeness comes into it.

MasterScrat · on June 13, 2019

Loved how they implemented CartPole in JS so you can play with it.

I can't wait for more browser-based RL env so people can get a feeling of how things works with 0 setup. Reminds me of DeepTraffic: https://selfdrivingcars.mit.edu/deeptraffic/

hardmaru · on June 13, 2019

Thanks, glad you liked it!

The car demo looks pretty cool maybe I can use that for another project.

magicalhippo · on June 13, 2019

Very nice and interesting article!

This might be a dumb question. How does say the discovered CartPoleSwingUp nets respond to changing the mass of the cart and/or the pole? That is, if I take the net found after 1024 generations, and doubled the mass of the pole. Would it still be able to upright the pole, assuming the motors could output the required force?

p1esk · on June 13, 2019

If we compare the WANN architecture evolved for MNIST, to a simple one layer NN (e.g. a linear classifier, where inputs are connected directly to the outputs with different weights), it seems to me that the evolved architecture uses activations and skip connections to simulate weights. Therefore given enough variety in the arrangement of activation functions and number of hops, each pixel position ends up being weighted in a unique but consistent way. I wonder if the authors tried to establish the correspondence between the weights in a linear classifier, and path modulations from input pixel to the output. To put another way, is there an equivalent weight arrangement for a linear classifier to the WANN with a shared weight?

dmos62 · on June 13, 2019

I skimmed through the article. It's very accessible and well-written, compared to my experience with some of the other literature in ML.

As a bit of an outsider I'd like to ask, how does this compare to the traditional analogue? In what scenarios is this superior to the classic BP NNs?

I understand that the point of the paper is to show that shared-weight training is possible, but the results look promising, hence the question.

There was a mention that one of the test cases achieved results competitive with current state of the art, but using 10 times less connections. That's pretty impressive.

hardmaru · on June 13, 2019

Hi, thanks for the feedback.

We think this approach has the potential to develop networks useful in applications where we need to train the weights really quickly for the network to adapt to a given task.

We put in the discussion section that the ability to quickly fine-tune weights might find uses in few-shot learning and continual lifelong learning where agents continually acquire, fine-tune, and transfer skills throughout their lifespan.

The question is, which "super task" do you optimize your WANN for so that it is useful for many subtasks that you did not optimize for to begin with? We think perhaps optimizing a WANN to be good at exploration and curiosity might be a good start for it to develop good priors that are useful for new unseen tasks in its environment. There's a lot more work that can be done from here ...

stochastic_monk · on June 13, 2019

If I understand correctly, the idea isn’t that you’d only use the architecture untrained, but that its goal is better architecture discovery for given tasks or groups of tasks.

Is that accurate?

_k8si · on June 13, 2019

This works for NLP too: https://arxiv.org/abs/1901.10444

brechmos · on June 13, 2019

Given a network topology found using this technique, I wonder how it would compare with the same topology with fit parameters?

hardmaru · on June 13, 2019

Hi,

So in the article we also reported in the experimental results section, the performance increase when we fit the individual parameters using the network topology found, and compared it to the non-tuned parameters.

afabisch · on June 13, 2019

This reminds me a lot of the work on compressed neural network from Jan Koutnik and his colleagues. They don't evolve topology of a NN, but they learn weights of a neural network in some compressed space. That seems to be very similar to weight sharing.

Here are some related papers:

- original idea: http://people.idsia.ch/~tino/papers/koutnik.gecco10.pdf

- vision-based TORCS: http://repository.supsi.ch/4548/1/koutnik2013fdg.pdf

- backpropagation with compressed weights: http://www.informatik.uni-bremen.de/~afabisch/files/2013_NN_...

For example, in the case of the cart pole (without swing up) benchmark a simple linear controller with equal positive weights is required which can easily be encoded with this approach.

hardmaru · on June 13, 2019

Hi,

Thanks for the references. The GECCO paper on compressed network search has been a big influence on previous projects I worked on, see:

https://news.ycombinator.com/item?id=16694153

https://news.ycombinator.com/item?id=14883694

it’s a small community!

brechmos · on June 13, 2019

Awesome, I was just reading the paper when you sent this. It looks like a really interesting direction of work.

I have a couple astrophysics CNN problems for which I am not sure the best architecture and I am now curious to try this out.

s_Hogg · on June 13, 2019

Another way to look at this is something analogous to a Fourier transform. Given a function, try a frequency and see what you get. So maybe some of the things that conceptually make that nicer might also have an analogue here?

This paper really is a breath of fresh air, for mine

raindeer3 · on June 13, 2019

To me this is a strangely written paper although the results might be some what interesting. They just optimize over a different set of functions than you normally do in NN training and find solutions to various problems. That they claim that the weights "are the solution" in normal NN training is wrong. Of course the weights AND the architecture are both important since they define the set of functions you search over. That you can find solutions by only searching over architecture is not surprising since the architecture is the combination of different operations.

s_Hogg · on June 13, 2019

Is there meant to be a direct link between this sort of thing and the randomly-initialised approach to Echo State Networks?

s_Hogg · on June 13, 2019

apologies, reading further they refer to:

> a randomly-initialized LSTM [12] with a learned linear output layer can predict time series where traditional RNNs fail

hardmaru · on June 13, 2019

Yeah, though I think we could have referenced a few more works in the reservoir computing area. Will keep that in mind for the next draft.

theSage · on June 13, 2019

just starting to read this. It reminds me of meiosis networks + stochastic delta rule where the weights were drawn from a random distribution (normal I think).

hardmaru · on June 14, 2019

Added a few cites to these works, thanks.

amelius · on June 13, 2019

Isn't this similar to a neural network where the weights are binary: 0 or 1?

drabiega · on June 13, 2019

I don't want to be dismissive since this is pretty cool, but from just skimming the page it seems like this is basically a version to neural nets that is roughly equivalent to genetic programming?

Scaevolus · on June 13, 2019

Genetic programming is just one way to optimize candidates embedded in a high-dimensional space. Neural network research has been using population-based optimization techniques for a long time, that's not the innovation here.

This is an extension of NEAT (from 2002), which evolved neural network architectures and weights simultaneously. The interesting thing is by having a shared weight, they appear to be finding much more minimal networks, which can still be trained to have results close to the state of the art.

ahpearce · on June 13, 2019

I really love the look and feel of this. Can anyone give a few pointers as to how to achieve something similar? I know it's rather simplistic, but it's really elegant.

hardmaru · on June 13, 2019

Hi,

We are currently cleaning up the code and it should be released soon on GitHub if you are interested to play with modifying the experiments described and try your own ideas.

ahpearce · on June 13, 2019

Thanks for that! I was originally referring to the layout of the article itself, but I'd be interested in both, if possible!

hardmaru · on June 14, 2019

The layout template is based on the distill.pub template I used earlier in another project. The GitHub repo has the instructions on how to build articles using it:

https://github.com/designrl/designrl.github.io

msoad · on June 13, 2019

if a well designed architecture can perform better than learned weights of a general architecture, doesn't it make sense we design a system that can learn to architecture itself as it is learning the weights? Maybe that's the famous "Learning to learn" paper?

p1esk · on June 13, 2019

Yes it does: https://arxiv.org/abs/1905.01392