
Genetic Algorithms for Training Deep Neural Networks for Reinforcement Learning - magoghm
https://arxiv.org/abs/1712.06567
======
jeffclune
Hello all. I am one of the paper's coauthors. Thank you for your interest in
this work! We hope you enjoy it. Just an FYI, this paper is part of a cluster
of five papers released on Tuesday morning, detailed in this blog post:
[https://eng.uber.com/deep-neuroevolution](https://eng.uber.com/deep-
neuroevolution)

------
frisco
Without reading this specific paper, I think this is the future. If not
genetic search, then some other metaheuristic. I think this is only really
credible road to AGI in the near term, as it seems to be largely a hardware
limitation that will be alleviated in the next few years.

Through the history of deep learning, the frontier has been networks that can
be trained to some feedback of whether they're working or not in ~6-8 days.
Within a few years there will almost certainly be ASICs that allow us to train
hundreds of thousands or millions of networks in several weeks, relative to
several weeks for one of the same network in 2017. This will let us start
exploring things like evolving networks seriously.

Fundamentally, we know neural networks can instantiate general intelligence,
and we know genetic search is capable of finding the right neural networks.
There are big differences between the CS and biological versions of each, but
it's striking that the big breakthrough in "AI" was deep neural networks and
not anything else.

When I think about the difference between AlphaZero and human intelligence, I
don't think it's "more intelligence." AlphaZero seems perfectly intelligent to
me: I think the difference is more about the selective pressures that produced
us. AlphaZero is a reflection of its environment and the process by which it
developed in that environment. I would be shocked the future of deep learning
continues to be hand-design from human intuition.

Edit: looking at the link, I want to caveat the above by saying this paper may
or may not be "it," but directionally I think the idea is underrated and
erroneously out of favor now as neural networks were in 2009. Ken Stanley
(coauthor on the linked paper) in particular has been hung up on one
particular approach, NEAT, since like 2002 that is kind of interesting but
definitely not the end-all.

~~~
irremediable
> Fundamentally, we know neural networks can instantiate general intelligence

This isn't at all the case! We know neuronal networks, of the human brain
variety specifically, can host this vague thing we're calling general
intelligence.

It's not at all clear that artificial neural networks of the deep learning
variety can do everything a neuronal network can do.

------
zackmorris
Note that using genetic algorithms to come up with initial weights for neural
networks was the state of the art in the late 90s/early 2000s. So this paper
is not as novel as it seems, but it's good to have it for reference.

My feeling is that since shallow networks can be made to have equivalent
accuracy to deep networks, that the real challenge isn't topology but
training. Hobbyists have access to so much processing power with GPUs now that
they can explore techniques that weren't practical for experts 20 years ago.
So we may see training speed increase by a few orders of magnitude using
techniques besides gradient descent (maybe quantum computing someday, who
knows).

The big question though is how to combine networks into hierarchies so that
the number of behaviors that can be learned is no longer limited (since
pattern recognition is largely a solved problem). I think the way GAs fit in
is that they make it much easier to understand and build simple NNs, and
possibly train hierarchies or discover topologies that aren't immediately
obvious.

~~~
frisco
Deep learning wasn't novel in 2012 either - it was the removal of a hardware
limitation that made it compelling again. I think the same is true for
evolving DNNs, but I don't know if the available compute power is there yet.

> My feeling is that since shallow networks can be made to have equivalent
> accuracy to deep networks, that the real challenge isn't topology but
> training.

This is not really true though... even very shallow neural networks can be
universal function approximators in a trivial sense because they can be lookup
tables, but they are really not expressive enough to generalize well and lack
a lot of the expressivity of deep networks.

~~~
argonaut
> shallow neural networks can be universal function approximators in a trivial
> sense because they can be lookup tables, but they are really not expressive
> enough to generalize well

You've got it flipped. If anything, shallow neural networks (of an equivalent
number of parameters) are more "expressive" than deep networks, BUT that
expressivity just makes them overfit. This is the bias vs. variance tradeoff.
If anything, deep networks encode our prior belief that there is a hierarchy
of features / a compressed representation, which limits the model that is
learned, to a model conforming to those priors.

------
erokar
In 2005 I took a sub-symbolic AI course where one of the homework assignments
was to evolve a neural network using genetic algorithms. So that in itself is
nothing new. The novelty of the paper seems to be the scale of the solution.
Personally, I hope we'll see a little bit of revival of genetic
algorithms/programming -- I think it's promosing in design (cars, aeroplanes,
architecture, etc.), since it's good at finding novel solutions to
optimization problems.

~~~
fao_
Overall the idea is about 20 years old. The earliest reference I can think of
is Karl Simms work. It's pretty impressive considering how old it is:
[http://www.karlsims.com/evolved-virtual-
creatures.html](http://www.karlsims.com/evolved-virtual-creatures.html)

~~~
cameldrv
This was the paper that excited me about neural networks in the nineties, from
92... [http://liacs.leidenuniv.nl/assets/PDF/boers-
kuiper.92.pdf](http://liacs.leidenuniv.nl/assets/PDF/boers-kuiper.92.pdf)

------
bertil
The paper looks really solid, but the title had me gasp “Bingo!” for a second.

More seriously, thinking about metaheuristic, which approaches at which scale,
i.e. machine learning architecture, that's the future.

------
tree_of_item
I'm not well-versed in machine learning, so I just want to make sure I
understand the general point here: these researchers trained a competitive
deep neural network _without_ using gradient descent? Does this mean neural
networks might become more useful for discrete optimization problems without
gradient information, like program synthesis?

~~~
Yajirobe
Neuroevolution is nothing new. I don't know why people think this is something
big/innovative.

~~~
tree_of_item
Because as far as I know, genetic algorithms perform poorly compared to
basically every other optimization technique. I'm under the impression that
they're not actually used for anything interesting these days, now that the
initial hype about implementing natural selection on a computer has died down.

Stochastic hill climbing as a baseline method for evaluating genetic
algorithms: [http://papers.nips.cc/paper/1172-stochastic-hillclimbing-
as-...](http://papers.nips.cc/paper/1172-stochastic-hillclimbing-as-a-
baseline-method-for-evaluating-genetic-algorithms.pdf)

When will a genetic algorithm outperform hill climbing?
[http://web.cecs.pdx.edu/~mm/nips93.pdf](http://web.cecs.pdx.edu/~mm/nips93.pdf)

A GA being competitive with modern gradient based methods is very surprising
to me.

~~~
Yajirobe
Try optimizing for the shape of an antenna using hill climbing.

Thing about evolutionary networks is that they can evolve topology, which
usually is just a thing decided by people implementing NNs (and not guaranteed
to be anywhere near most optimal). I don't see why given enough time GAs (or
EAs) could not outperform simple backpropagation based on differentiating
simple cost functions.

~~~
tree_of_item
Why wouldn't you be able to do that with hill climbing?

~~~
ramgorur
Because you can't establish any gradient information from the design variables
(antenna topology) to the objective function (i.e. antenna gain). So most of
the time you might end up climbing towards some wrong direction (i.e. local
optima).

~~~
tree_of_item
Hill climbing doesn't use a gradient, that's gradient descent. Hill climbing
is pretty much the same as GA with a population of 1. There's no reason you
can't design an antenna with it, and before this paper I'd be convinced hill
climbing would actually do it better.

The person I was replying to was also advocating the use of GAs to solve the
antenna problem, which don't use gradient information either.

------
m3kw9
Maybe backprop is actually a form of evolution, a straight forward one where
you iterate towards a better form

~~~
sythe2o0
Better isn't always straightforward, sometimes unintuitive steps need to be
taken to reach a better fitness value. The advantage of evolutionary methods
over backprop is that evolutionary methods can take steps "backwards" and
avoid local minima.

~~~
chillee
But local minima aren't a problem for modern neural networks that are
optimizing in very high dimensional space.

~~~
Houshalter
It definitely is a problem. For instance there was a recent post about trying
to teach a robot to put a board with a hole in it into a peg. It would just
learn to shove the board next to the peg.

~~~
chillee
Sorry, saw this late.

It's a shown/kind of proven result that deep neural networks don't fall into
local minima in the very high dimensional parameter space.

GANs and reinforcement learning are different. Research on getting those to
converge to good local minima is still much more in its infancy. I don't
particularly consider those just a "neural network", but sorry, I should have
been more clear.

~~~
Houshalter
This thread is about reinforcement learning which definitely suffers from
local minimas.

But even vanilla supervised nets suffer from local minima. Anyone who's played
with them has encountered it. Here you can mess around with a neural net live
in the browser and it very easily gets stuck if you try more than 3 layers
(especially try the spiral dataset):
[http://playground.tensorflow.org/](http://playground.tensorflow.org/)

~~~
chillee
That's why I said high dimensional neural networks. There's been a lot of
literature explaining why local minima aren't a problem in very high dimension
loss surfaces.

Check any of the literature on this subject:
[https://arxiv.org/abs/1611.06310v2](https://arxiv.org/abs/1611.06310v2)

[https://arxiv.org/abs/1406.2572](https://arxiv.org/abs/1406.2572)

Local minima are something that people thought was gonna be a problem,
especially back in the 2000s. They played around with small neural nets on toy
examples such as yours, and thought it was intractable. It's the entire reason
why neural nets fell out of the fashion in the early 2000s, and people moved
towards techniques like SVM.

These toy examples don't generalize to high dimensions, and if you take a look
at the literature, you'll see that the consensus agrees with my statement.

~~~
Houshalter
Ehh these theoretical results have questionable application to real life. Sure
it might be very easy to learn simple correlations like "this patch of pixels
correlates highly with the output '8'". But it's trivial to construct examples
where neural nets get stuck in local minimas. For instance, try training a net
to multiply two binary numbers.

Maybe with a billion neurons, just by random chance some of them would
correspond to the correct algorithm and get reinforced by backprop. But very
few NNs have layers larger than a thousand neurons. Because the cost of layers
that big grows quadratically. And the chance of random weights finding the
solution decreases exponentially.

One of the biggest reasons things like stochastic gradient descent, and
dropout are used is because they break local minimas.

~~~
chillee
The statement "deep neural networks are not affected by poor local minima" is
not really a personal opinion/theory at this point; it's the dominating
consensus in the research community.

These are not just theoretical results. They're theory papers trying to
explain the empirical result of why neural nets don't get stuck at local
minima.

> Given that deep networks are highly nonlinear systems optimized by local
> gradient methods, why do they not seem to be affected by bad local minima?

And other such results.

As I said above, neural nets are obviously able to get stuck in local minima
in toy examples. If you read my above comment, you'll see that that has no
bearing on my initial statement.

Dropout's main motivation is not to break local minima. It's to achieve better
generalization. If it were the case that it was meant to break bad minima,
we'd have better training loss upon adding dropout, which is obviously not
true.

As for SGD, we used to think that it was mainly for computational purposes.
That is, we're unable to batch our entire training set at once, so we have to
split into mini batches.

Modern theory states more that SGD is good for avoiding sharp minima, as well
as some other desirable properties.

I'm not sure you're really reading my comments thoroughly nor checking out the
links, so if you're actually interested in understanding what's really going
on, please do some proper research on the topic.

------
sigi45
i wonder if something like this is the abstraction where we as humans stop
understanding it.

I do know that we define the cost function / the goal of it, but from using
evolutionary Technic to build the network, there is only x layers left to add
to create any net.

------
monfrere
I'm surprised Uber is funding general AI research given their reported
financial situation.

------
hamilyon2
Well,this is surprising. Paper doesn't mention image recognotion and voice
recognition at all. Is this because they are not interesting for new research
or because in theese applications results were worse than that of gradient
methods?

