
Population-based training of neural networks - Tenoke
https://deepmind.com/blog/population-based-training-neural-networks/
======
knexer
Two things stick out to me after a first read:

First, this actually learns a schedule for each hyperparameter, not just a
good set of fixed values, automatically discovering learning rate annealing
and related techniques. This seems incredibly powerful. It is also learning
hyperparameter schedules specific to a single training run - which seems
interesting but not obviously helpful, especially since many of the learned
schedules fairly closely match the baseline hand-tuned ones.

Second, it seems like they're optimizing against their validation metric
directly; isn't that basically 'cheating' (i.e. defeats much of the point of
having a separate validation metric in the first place)? It also seems
completely orthogonal to their technique - could they not have optimized for
the same loss function as the network itself? Is this an improvement over
state of the art, or is it just overfitting to the validation metric?

~~~
gwern
Well, they consider RL problems extensively, and as the joke goes, in RL it's
OK to overfit to your validation set - if you can.

As for regular supervised learning: it's no worse than, say, early stopping
based on validation scores. It should be wrong but in practice NNs generalize
anyway, and since this paper implies that Google Brain & DM are doing this
hyperparameter optimization routinely now for everything, I figure that they
would have noticed any overfitting problems by now (either when the methods
fail to outperform on one of Google's private internal huge databases, or when
they rolled outth the translator).

------
0x63_Problems
This is really cool! I haven't read through the real paper yet but it's very
impressive that this method does not incur a significant performance cost. I
had assumed that using a genetic-style algorithm would be costly since you
would need to train a large number of networks individually, but treating all
variations equally in terms of training time now seems naive. Distributing the
training time using intelligent exploration and exploitation is an awesome
idea to fix this.

~~~
kirillseva
This is a fairly well represented technique in bayesian hyperparameter
optimization, where you train a meta-classifier that keeps track of the
parameter space. Kind of like a manager model, if you will, that learns to
intelligently optimize exploration vs exploitation so that a team of workers
will arrive at the global optimum.

Back when Yahoo! was a real company they used a technique called "multi-armed
bandits" to learn what ads to show. [1]

More recently, there's a number of off-the-shelf packages available that you
can trivially integrate into your ML pipeline to optimize hyperparameters of
your models, I'll include the links below.

[1] multi-armed bandits
[https://www.theregister.co.uk/2011/09/23/yahoo_core_personal...](https://www.theregister.co.uk/2011/09/23/yahoo_core_personalization/)

[2] tree of parzen estimators
[https://jaberg.github.io/hyperopt/](https://jaberg.github.io/hyperopt/)

[3] hyperband - what google uses in their internal ML toolkits AFAIK
[https://arxiv.org/pdf/1603.06560.pdf](https://arxiv.org/pdf/1603.06560.pdf)

[4] (shameless plug) gaussian process based hyperparameter optimization
service [https://github.com/avantoss/loop](https://github.com/avantoss/loop)

~~~
gwern
This isn't your standard MAB or GP hyperparameter optimization; those
typically require you to train each NN to convergence before further
exploration is done (ie each 'round' is training a NN). Skimming the paper, OP
is closer to freeze-thaw or reversible backpropagation hyperparameter
optimization, or Net2Net meta-RL: the hyperparameter optimization is
monitoring the loss curve of each trained NN, switching between them based on
promisingess like in freeze-thaw, but also switching hyperparameters on the
fly and reusing the trained weights to avoid starting from scratch, Net2Net
style. Each NN being trained is periodically updated to either clone & tweak a
new hyperparameter set to continue training the current NN's parameters, or
clone & tweak the best NN's parameters while keeping the old hyperparameters.
(They only clone the full NN, so they can't do architecture search, but
there's no reason they couldn't use Net2Net or other recent approaches which
similarly recycle the trained weights to avoid the huge computational burden
of training from scratch.)

------
margorczynski
I think this is what Google and the others aim for - no hand-tuning. You
simply specify the problem (some function to optimize) and the data.
Everything in a nice concise package running on Google Cloud using their
custom software and hardware.

------
giacaglia
This seems similar to what Jeff Dean was working on with AutoML:
[https://research.googleblog.com/2017/11/automl-for-large-
sca...](https://research.googleblog.com/2017/11/automl-for-large-scale-
image.html). Is DeepMind collaborating with the Google Brain team and how
connected are the teams? It seems somehow that the efforts may be duplicated
in some areas...

~~~
billysbeanes
Isn't it good that efforts are duplicated? It commoditizes the work and
results, provides more jobs so there are more people who understand this
field. It's unlikely each approach will be exactly similar.

Similarly, take a look at the deep learning library market: caffe (I think out
of Stanford?), tensorflow (google), pytorch (FB + MS)... each has different
strengths, but I'm sure glad the pytorch people pushed ahead, even though
google put a ton of marketing effort into TF, simply because now we have more
awesome things :).

Once a market or product is mature, then I can see the "duplicates are
wasteful". But a nascent, exploratory field like ML/DL needs as many different
approaches as is possible.

Now, if only we could gradient descent to find the optimal approach ;).

~~~
epmaybe
Should I move from theano to tensorflow? I didn't realize that theano was no
longer being developed when I first starting playing with keras.

~~~
taneq
Does Theano meet your needs? Then no. Does TensorFlow meet them better, enough
to justify the cost in switching? Then yes. "Actively developed" is a silly
metric. Focus on features, flexibility, robustness etc.

~~~
nl
For neural network libraries this isn't sensible.

For many (most?) users outside of Google and Facebook the most important
feature is "is there an off-the-shelf implementation of new technique XXX or
do I have to build it myself?"

For most users the sensible choice comes down to Keras+Tensorflow or PyTorch.

~~~
kirillseva
Depends on what you do. If you're starting a new project picking Theano indeed
isn't a very good choice due to the reasons you've mentioned. However, if you
already have a stable piece of software that does what you want it to do then
migration won't add much value and you could spend this time doing something
more important, like improving documentation or having dinner with your family
and friends.

However it's worth pointing out that theano's API is somewhat similar to
tensorflow so migrating shouldn't be too hard and should be fairly easy to
test

------
hmm_really
Looks like a genetic algorithm to me, just framed it differently to normal
(which is clever). The hyperparams being the DNA and a network being an
environment in which they are executed against.

------
bdod6
Seems very similar to using RL to tune the hyperparameters. Surely that means
there are hyperparameters for PBT that need to be set, such as the exploration
vs exploitation tradeoff.

------
deepnotderp
So basically SGD based NEAT?

~~~
YaxelPerez
NEAT just uses GA to generate a network topology and weights. From what I
read, I think this is just a fancy way to parallelize searching for optimal
hyperparameters.

~~~
deepnotderp
Well, okay, what I meant is that it's literally applying evolutionary
algorithms for hyperparameter optimization. Calling it some souped up BS like
"population based training" seems like Deepmind marketing is getting out of
hand...

~~~
nametube
Evolutionary Algorithms are a subset of Population Based heuristics. I don't
think its "souped up BS" to use the term.

