
Evolution Strategies as a Scalable Alternative to Reinforcement Learning - melqdusy
https://blog.openai.com/evolution-strategies/
======
Seanny123
I feel like this is more an argument against the efficiency of flat (non-
hierarchical, not model-based) Deep Reinforcement Learning than an argument
for Genetic/Evolutionary Algorithms. As in, if you're not solving a task more
efficiently than Evolutionary Algorithms or more difficult than they can
handle, you're not doing something right. Similar to how Deep PCA and Deep
Random Forests blew ConvNets out of the water on basic benchmarks like MNIST,
but couldn't compete on larger datasets, indicated the proof we should require
before getting excited about a new technique.

~~~
argonaut
Not sure where you're getting this from, but CNNs are definitely still state
of the art on MNIST. Papers often cite outdated CNN numbers; in fact the best
published CNN accuracy numbers are probably lower than they could be - vanilla
MNIST for supervised learning is a pretty useless benchmark for computer
vision researchers now.

~~~
Seanny123
My bad, I should have been more specific. The Deep PCA paper shows superior
performance not on baseline MNIST, but on the MNIST variations.

I also agree that vanilla MNIST is pretty useless for Computer Vision
researches and was trying (awkwardly) to support that idea by showing how
these other non Deep Learning techniques performed equally well.

~~~
argonaut
The Deep PCA paper shows no such thing. They've cherry-picked bad CNN
baselines. The red flag for me is that the "state of the art" methods they
compare to achieve > 1% test error on vanilla MNIST, which is totally wrong
since CNNs routinely achieve <0.5 % test error (even by the standards of 2014,
when the paper was published).

------
alfa02
Policy search is very effective. This is not a new finding as this article
seems to suggest in the abstract.

Variants of ES have been used for years
([http://dl.acm.org/citation.cfm?id=1645634](http://dl.acm.org/citation.cfm?id=1645634)).
The article seems to ignore almost all the work in robotics, e.g. from Jan
Peters' research group ([http://www.jan-peters.net/](http://www.jan-
peters.net/) -> publications).

The good thing is that we have one more paper that justifies this research
direction and a little bit more public attention.

------
mark_l_watson
Great write up.

I had a similar example in my 20 year old book 'C++ Power Paradigms' in which
I used a genetic algorithm to train the weights in a recurrent network. As a
performance hack, weights were initially represented by just a few bits, and
the bit length would gradually be increased, which greatly increased the
search space. I never got this to scale past small networks, but I have
thought of revisiting my old code since I have a lot more computing power
available now, compared to 20 years ago.

------
vn0m
I'm not an expert but for me it's seems that the proposed approach doesn't
compute the value function . By optimizing directly the policy function aren't
we losing some key ingredient for generalization? I mean could this algorithm
be used in a fully non deterministic environment ? Like human vs machine?

~~~
alfa02
I am not sure at the moment. I guess the main problem of ES in a complex non-
deterministic environment is that it would average over multiple local minima
that occur in one generation which would result in a non-optimum solution.
There are policy search methods that address this problem (e.g. VIPS
[https://scholar.google.com/citations?view_op=view_citation&h...](https://scholar.google.com/citations?view_op=view_citation&hl=de&user=GL360kMAAAAJ&citation_for_view=GL360kMAAAAJ:IjCSPb-
OGe4C) , hierarchichal REPS
[https://scholar.google.com/citations?view_op=view_citation&h...](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=PxgVi0cAAAAJ&citation_for_view=PxgVi0cAAAAJ:u5HHmVD_uO8C)).

Temporal credit assignment is another problem: the policy is updated after a
full episode and there is no way to use the information which step was
responsible for which reward. Policy search usually works well if the value
function is very complex and the optimal policy is simple.

------
shpx
> we were able to solve one of the hardest MuJoCo tasks (a 3D humanoid) using
> 1,440 CPUs across 80 machines in only 10 minutes. As a comparison, in a
> typical setting 32 A3C workers on one machine would solve this task in about
> 10 hours.

So using 80 times more machines makes you 60 times faster (assuming those are
the same machines) "while performing better on 23 games tested, and worse on
28"[0]?

[0] The paper for this blog post
[https://arxiv.org/pdf/1703.03864.pdf](https://arxiv.org/pdf/1703.03864.pdf)

Asynchronous advantage actor critic (A3C)
[https://arxiv.org/pdf/1602.01783.pdf](https://arxiv.org/pdf/1602.01783.pdf)

------
pebblexe
"A Field Guide to Genetic Programming" ([https://www.amazon.com/Field-Guide-
Genetic-Programming/dp/14...](https://www.amazon.com/Field-Guide-Genetic-
Programming/dp/1409200736)) is one of the better books I've read on the matter
(but I've yet to read anything by Koza). W. Langdon's webpage
([http://www0.cs.ucl.ac.uk/staff/W.Langdon/](http://www0.cs.ucl.ac.uk/staff/W.Langdon/))
has a lot of great information.

~~~
symstym
Evolution Strategies is not an example of Genetic Programming, and that book
by Langdon doesn't cover anything directly relevant to the linked article.

~~~
pebblexe
You're right, I'm sorry.

------
rawoke083600
Reading the code; I dont get get: 1\. How to model the different actions

2\. Tie future awards to action at this timestep...

Can anyone help with better nitty-gritty explanation ?

~~~
symstym
The example code inline in the article just illustrates the basic idea of
Evolution Strategies (ES), not their new work in applying ES.

The behavior of agents is determined by a "policy function". This function
takes in inputs (e.g. what the agent sees) and outputs actions (e.g. what the
agent does). The policy function has a set of internal parameters that
determines the precise mapping from inputs to outputs.

In their work, they used a neural network as the policy function. The
parameters are just all the weights of the network.

In a simple version, you start with some random weights for the NN. Then you
make many copies of the network, each with a slight random variation made to
the weights. For each of these altered networks, you use them to control an
agent for a while, and see how well the agent performs during that trial
period. Based on how well the different variations do during their trial runs,
you adjust the weights of the network a small amount. You adjust the weights
to be more similar to the variations that did well. Then you repeat the
process indefinitely (generate new variations, test them, etc.).

~~~
rawoke083600
Very good sir ! Makes more sense now ! Thank you.

------
Drdrdrq
This is a fantastic summary of both RL and ES (know as genetoc algorithms
too). Kudos to authors!

~~~
dimatura
While they're in the same general family of black box optimization algorithms,
ES and GAs are not really the same, and in fact were developed independently.
The historical roots of both are actually quite interesting.

~~~
Drdrdrq
Thanks for correcting me, I appreciate it! This is a nice summary of
differences: [0].

[0] [http://stackoverflow.com/questions/7787232/difference-
betwee...](http://stackoverflow.com/questions/7787232/difference-between-
genetic-algorithms-and-evolution-strategies)

