
Gradient-Based Hyperparameter Optimization Through Reversible Learning - gwern
http://arxiv.org/abs/1502.03492
======
taliesinb
> The last remaining parameter to SGD is the initial parameter vector.
> Treating this vector as a hyperparameter blurs the distinction between
> learning and meta-learning. In the extreme case where all elementary
> learning rates are set to zero, the training set ceases to matter and the
> meta-learning procedure exactly reduces to elementary learning on the
> validation set. Due to philosophical vertigo, we chose not to optimize the
> initial parameter vector.

Comedy gold.

------
duvenaud
Authors here, feel free to ask us anything!

~~~
zackmorris
First off, thanks for your efforts!

From what I've followed of machine learning, I feel like there are areas that
are not very rigorous, for example where humans have to tune weights and
parameters. Does your work address those places?

Specifically do you think you can train on enough parameters, and compare
enough results, to begin to estimate how that tuning can be done
automatically?

~~~
duvenaud
Yes, gradient-based hyperparameter optimization is one option for taking the
human out of the loop to some extent. Of course, we can't avoid having to
specify hyper-hyper-parameters!

More generally, automatic hyperparameter tuning (without gradients) is already
a standard part of many researchers' pipelines.

It's also part of good scientific practice, since it makes it harder to bias
the results towards a particular method.

------
thearn4
Just gave a brief run through the paper, and I'm curious: is reverse-mode
differentiation in this context similar in concept to adjoint-type methods
used in computational fluid dynamics (and increasingly, in design optimization
in other fields)?

Backwards-propagation of a gradient calculation to allow for extremely high-
dimensional parameter spaces is a trick that has been around for awhile in
some circles, but also seems to have been missed in a lot of other
disciplines. I'm seeing a lot of publications making use of it in the last few
years, and it's pretty exciting to see them used in more places. My hope is
that more scientific and engineering analysis codes will expose derivative
interfaces for use in numerical optimization.

Here's some short papers with some background for those interested in the
topic from a physical engineering perspective:

[http://www.piercelab.caltech.edu/assets/papers/ftc00.pdf](http://www.piercelab.caltech.edu/assets/papers/ftc00.pdf)

[http://www.nt.ntnu.no/users/skoge/prost/proceedings/npcw09/A...](http://www.nt.ntnu.no/users/skoge/prost/proceedings/npcw09/A-N-
Ringset-R-1.pdf)

~~~
duvenaud
As far as I understand, the adjoint method computes gradients for functions
that obey hard constraints (such as fluid solvers), and the main advantage is
that it avoids differentiating through iterative constraint satisfaction
procedures.

I had some success with naively differentiating through fluid solvers, though.
Here is a fluid field whose initial velocities have been initialized so that
it will end up matching a given image (i.e. blowing a fancy smoke ring:)

[https://github.com/HIPS/autograd/blob/master/examples/fluids...](https://github.com/HIPS/autograd/blob/master/examples/fluidsim/animated.gif)

and here's a free-form wing shape in the middle of being optimized to maximize
lift-to-drag ratio:

[https://github.com/HIPS/autograd/blob/master/examples/fluids...](https://github.com/HIPS/autograd/blob/master/examples/fluidsim/wing.png)

I too find it baffling when I see engineers doing gradient-free optimization
of simulated objective functions, although it's not always easy to compute
gradients, especially when using massive Fortran codebases or very large-scale
simulations.

------
johntb86
Do we now need hypervalidation sets?

~~~
duvenaud
Right, once we start heavily hyper-parameterizing our models, there's the
potential to overfit the hyperparameters. I think most people agree that has
already happened on common datasets such as MNIST.

However, the current method of avoiding hyperparameters is just to have very
few of them. This is kind of like avoiding overfitting of parameters by only
having 10 parameters in your model - barbaric!

That being said, for hyper-gradients to really be useful, someone needs to
develop hyperparameter optimization schemes such as BayesOpt that can
condition on gradient information, to allow us to try different hyperparameter
settings in parallel. I know at least one group is working on this, but as far
as I know it's not ready for prime time yet.

~~~
bmh100
Could overfitting also be addressed by sharing and optimizing hyperparameters
over multiple datasets? For example, could initialization be shared across
multiple handwriting sets?

------
davidwihl
Will this research continue at HIPS or is it all going to Twitter?

