
Gradient Descent: The Ultimate Optimizer - tosh
https://arxiv.org/abs/1909.13371
======
samatman
Can’t pass on the opportunity to share one of my favorite (and topical!) paper
titles:

[https://arxiv.org/abs/1606.04474](https://arxiv.org/abs/1606.04474)

------
sdan
> We propose to instead learn the hyperparameters themselves by gradient
> descent, and furthermore to learn the hyper-hyperparameters by gradient
> descent as well, and so on ad infinitum.

Made me laugh. Can't believe this hasn't been done yet. Maybe I should stop
looking at ML research upon a pedestal (regardless, OpenAI's PPO and TRPO
papers were amazing, not to mention their Learning with Dexterity one)

~~~
gwern
No, it's been done. For example, "Gradient-based Hyperparameter Optimization
through Reversible Learning"
[https://arxiv.org/abs/1502.03492](https://arxiv.org/abs/1502.03492) ,
Maclaurin et al 2015 (one of their cites). The idea is pretty obvious: of
course you'd like to learn the hyperparameters like they were any other
parameter. But that's easier said than done.

The problem is that, well, to backpropagate through a _hyper_ parameter, you
would need to, say, track how it affects every iteration throughout the entire
training run, rather than simply tracking one parameter through a single
iteration on a single datapoint. And it's difficult enough to do gradient
descent on a single hyperparameter, so it hardly helps to start talking about
doing entire stacks! If you can't really do one, doing ad infinitum probably
isn't going to work well either.

If you look at their experiment, they're doing a 1 hidden-layer FC NN on
MNIST. (Honestly, I'm a little surprised that stacking hyperparams works even
on that small a problem.)

~~~
Mehdi2277
Reading the paper, instead of computing the gradient across the training run
they compute the gradient for a hyper parameter after each batch and update
the hyper parameter then. That keeps the computational cost pretty light.
Stacking gradient descent 50 times on itself only costs about double normal
gradient descent. That's specific to their experiment, but when done on bigger
models the added cost for computing the hyperparameter derivatives should
become a smaller fraction.

I'm surprised by the lack of any bigger experiments (imagenet would be nice,
but even cifar10 would help) given how computationally light this was. Also
surprised that an 11ish author stanford paper did their experiments on 1 cpu.

------
jupp0r
“Gradient Descent: The Ultimate Optimizer”

Did they get stuck in a local optimum?

------
Straw
Unfortunately, the 1-step optimal learning rate often differs massively from
the long-horizon best choice:

[https://arxiv.org/abs/1803.02021](https://arxiv.org/abs/1803.02021)

Due to the fact that larger LRs can result in worse immediate performance but
better progress in low signal-to-noise ratio directions.

~~~
0-_-0
I suspect that's because higher LR leads to better exploration of the opt.
surface, i.e. it works as an implicit regularizer. The ideal solution would be
to develop better regularizers to go with the better optimizers, instead of
relying on the noise in the worse optimizer for implicit regularization.

~~~
Straw
I completely agree, we shouldn't be depending on our optimizers do some
approximate Bayesian inference- an optimizer should optimize only.

However, I think it's a different effect- even purely in terms of optimizing
the training loss, on a quadratic (with noisy gradients), the short-horizon
bias effect exists.

------
axiom92
Interesting!

Relevant work:

1\. What's one of the most important hyperparameters? Architecture:
[https://arxiv.org/abs/1806.09055](https://arxiv.org/abs/1806.09055)

2\. Dropout: [https://papers.nips.cc/paper/5032-adaptive-dropout-for-
train...](https://papers.nips.cc/paper/5032-adaptive-dropout-for-training-
deep-neural-networks.pdf)

3\. Learning activation thresholds:
[https://datascience.stackexchange.com/questions/18583/what-i...](https://datascience.stackexchange.com/questions/18583/what-
is-the-difference-between-leakyrelu-and-prelu)

~~~
shoeffner
Regarding 2: I saw a talk yesterday by Alex Hernández-García in which he
showed that dropout and weight decay can be replaced with data augmentation in
many cases. The relevant paper by him and König is
[https://arxiv.org/abs/1806.03852v2](https://arxiv.org/abs/1806.03852v2)

edit: changed to /abs/ instead of /pdf/ link

------
skunkworker
So we could see Adam optimization with hyper-parameter tuning by Adam? But
IIRC this is never guaranteed to always find the global minima/maxima, so why
would it in this scenario? Or is this just finding a way to let your optimizer
self optimize?

~~~
myrryr
Pretty much. pick something stupid, but good enough.

What IS interesting, is that the problems all end up in the same place after a
few levels of recursion.

------
mcguire
Because gradient descent never gets stuck in a suboptimal solution?

~~~
gHosts
If I understand what they are looking at, they are looking at problems that
are simple in the sense of having no local optima, but complex in the sense of
having very high dimension.

ie. It's not a problem for the class of problems they are optimizing.

I've been out of that domain for awhile... but when I did a lot of optimizing
/ fitting in a high dimensional space with a function that had lots of local
optima... I found the downhill simplex method with simulated annealing was
most effective / robust.

~~~
naniwaduni
which is great as long as you choose your problems based on what your tools
are good at

~~~
dodobirdlord
Local minima basically don't exist in high dimensional spaces, so in practice
the tools work for a wide variety of problems.

~~~
gHosts
Definitely not true. Not even close.

~~~
dodobirdlord
Not much of a response. Care to expand here? This is a well-understood
principle in the existing body of research on convergence in high dimensional
spaces.

------
linuxdude314
It is nice to see someone doing this!

This reminds me of using GD for tuning PID (which has been done for a while).

[https://www.researchgate.net/publication/287359696_PID_contr...](https://www.researchgate.net/publication/287359696_PID_controller_tuning_optimization_using_gradient_descent_technique_for_an_electro-
hydraulic_servo_system)

------
Rainymood
I couldn't find the code anywhere on the web (nicely formatted, easy to use)
so I tried to make the code more easily accessible here [1]

[https://github.com/Rainymood/Gradient-Descent-The-
Ultimate-O...](https://github.com/Rainymood/Gradient-Descent-The-Ultimate-
Optimizer)

------
bravura
There are some cites to similar recent work in this thread, but even older
work is Stochastic Meta Descent.

This inspired my consulting company / ML q+a site name MetaOptimize and it’s
motto: optimizing the process of optimizing the process of...

------
codesushi42
But then who tunes the hyperparameters for the gradient optimizer?

~~~
Jaxan
As said in the abstract: another gradient optimizer!

~~~
codesushi42
But then who tunes that optimizer?

~~~
jchook
Voom

------
mikepalmer
11 authors, impressive!

~~~
jchook
One for each gradient descent.

~~~
ur-whale
I thought it was, to quote "and so on, ad infinitum"?

Maybe they chopped the long tail of authors?

------
bigred100
Hopefully this isn’t useful for anything because it’s the dumbest thing I’ve
seen in my life

