
A Practical Guide to Hyperparameter Optimization - ole_gooner
https://blog.nanonets.com/hyperparameter-optimization/
======
iyaja
Hi everyone. I'm the author. There's one more I thing I wanted to add: a good
reason you should try using some sort of hyperparameter search, even you think
it's a complete waste of time and compute, is for reproducibility.

This probably applies more to open-source academic contributions, where you're
trying to help your fellow practitioners recreate and use your models, as
opposed to a corporate setting, where reproducibility would be the equivalent
of getting fired.

Recently, I was trying to train a ResNet to beat the top Stanford DAWNBench
entry (spoiler alert: I did, but by less than a second). Initially, I blindly
tried manually tuning the learning rate, batch size, etc. without even reading
the original model's guidelines.

After actually going through a blog post written by the David C Page (the guy
with the top DAWNBench entry), I saw that he tried varying the hyperparameters
himself and that the ones that were set by default in the code were what he
found to be optimal.

That saved me a lot of time and let me focus on other things like what
hardware to use.

I think the lesson here is that if more researchers perform and publish the
results of some basic hyperparameter optimization, it would really save the
world a whole lot of epochs.

~~~
d__k
> The heavier the ball, the quicker it falls. But if it’s too heavy, it can
> get stuck or overshoot the target.

This explanation of momentum is somewhere between misleading and wrong.
Momentum is about inertia and acceleration, i.e., the _ability_ to quickly
change speed.

~~~
polynomial
> The heavier the ball, the quicker it falls

Wasn't there a famous experiment about this someone once did?

~~~
wongarsu
A misleading experiment: the heavier the feather the quicker it falls is
obviously true; steel feathers are useless. The same is true for balls (except
for the exceptional situation of a perfect vacuum), it's just that drag and
air currents don't influence balls all that much at low speeds.

------
dfan
I realize this is classic old-man-yells-at-cloud, but I don't understand why
every online article these days, even the technical ones, need to have a giant
"amusing" gif every two paragraphs. Do people not pay attention otherwise?

~~~
andbberger
I noped out of there after seeing those and the terminator reference in the
first paragraph. Maybe I'm not the target audience...

Seems like the vast majority of the DL articles that make it the front of HN
are just fluff. Nothing for DL practitioners, just 'hey look I can import
tensorflow'.

~~~
polynomial
> Nothing for DL practitioners, just 'hey look I can import tensorflow'.

Which is literally the ironic reference in the first image.

------
ArtWomb
More essential background on Bayesian optimization in AutoML's HyperTune

[https://cloud.google.com/blog/products/gcp/hyperparameter-
tu...](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-
machine-learning-engine-using-bayesian-optimization)

~~~
Zephyr314
There are also several papers and blog posts diving into details and tradeoffs
of different Bayesian optimization approaches and components here [0].
Example: Covariance Kernels for Avoiding Boundaries [1]

[0]: [https://sigopt.com/research/](https://sigopt.com/research/)

[1]: [https://sigopt.com/blog/covariance-kernels-for-avoiding-
boun...](https://sigopt.com/blog/covariance-kernels-for-avoiding-boundaries/)

------
mccourt
I always appreciate articles emphasizing the importance of hyperparameter
optimization; thank you for writing this. The discussion on learning rate is
nice additional point to mention, though I find it a bit misleading -- earlier
in the discussion you are mentioning a number of hyperparameters but then
learning rate is studied in a vacuum. If other hyperparameters were varied
along with the learning rate, I assume those graphics would look much more
complicated.

Additionally, practical circumstances for hyperparameter tuning using Bayesian
optimization often include complications: dealing with discrete
hyperparameters, large parameter spaces being unreasonably costly or poorly
modeled, accounting for uncertainty in your metric, balancing competing
metrics, black-box constraints. Obviously, one cannot mention everything in a
blog post, I just wanted to bring up that outstanding researchers in Bayesian
optimization are pushing forward on all of these topics.

Regardless, thank you for continuing to hammer home the value of
hyperparameter optimization. If I may, a couple links, for anyone trying to
learn more:

My favorite BO intro -
[https://arxiv.org/abs/1807.02811](https://arxiv.org/abs/1807.02811) AutoML
from the Freiburg crew - [http://papers.nips.cc/paper/5872-efficient-and-
robust-automa...](http://papers.nips.cc/paper/5872-efficient-and-robust-
automated-machine-learning) Some discussion on parallelism/high dimensions -
[https://bayesopt.github.io/papers/2017/3.pdf](https://bayesopt.github.io/papers/2017/3.pdf)
Strategies for warm starting - [https://ml.informatik.uni-
freiburg.de/papers/18-AUTOML-RGPE....](https://ml.informatik.uni-
freiburg.de/papers/18-AUTOML-RGPE.pdf)

------
wongarsu
I'm using NNI[1] with decent success for hyperparameter optimization. It
implements a number of different approaches, from a simple random search to a
Tree Parzen Estimator (TPE) and specialized algorithms for automatically
designing networks.

It's very powerful and gives you a lot of freedom (it can minimize/maximize
the output of fundamentally any python program). The main drawback is that you
are on your own to figure out which paramters go well together: For example
using an assesor to stop underperforming attempts early is great for random
search, but devastating for TPE. You have to figure that out on your own. You
inevitably spend some time tuning your hyperparameter tuner. It's still a big
win in terms of human effort, at the expense of doing a lot more computing.

1: [https://github.com/Microsoft/nni](https://github.com/Microsoft/nni)

~~~
MasterScrat
I'm used to hyperopt, do you know how they compare?

~~~
Zephyr314
hyperopt also uses TPEs [0], this may be a variant/fork of that.

[0]:
[http://hyperopt.github.io/hyperopt/](http://hyperopt.github.io/hyperopt/)

------
improbable22
Is there a good reason not to regard this as a standard few-parameter no-
gradient optimisation problem, and use something like Nelder-Mead on it?

~~~
bigred100
I think many people (including the DFO community) already do that. People also
consider the notion of multiple objectives important here I believe

~~~
improbable22
Thanks. What's DFO? And what do you usefully do with multiple objectives,
besides minimise some total?

~~~
jcagalawan
DFO is derivative free optimization. With multiple objectives you try to find
different solutions given different weightings to the objectives for the
Pareto front and pick one depending on the domain.

------
platz
I enjoy the f-you tone of this article

------
stunt
It requires massive amount of computing power, otherwise theoretically you
should be able to explore different optimizations automatically. Even then,
validation is still hard and time consuming though.

~~~
jjn2009
It sounds like an easy way to increase performance but really exploring the
hyperparameter space is likely more efficiently done manually first and only
automatically when you have figured out how to distribute the work.

~~~
manneshiva
Manually searching is time taking since you need to wait for the results from
each experiment. This becomes impossible when the number of hyperparameters is
more than 8-10 and you will probably end up only tuning a few of them that you
think are relevant. You'd also need a lot of experience in tuning
hyperparameters else your tuning is as good as random.

Given these disadvantages of manual tuning, "Bayesian Optimization" seems like
the most promising technique, it needs a lot less "choose->train->evals" loops
as it uses the information from previous runs to select the next set of
hyperparameters (similar to what humans would do).

~~~
dual_basis
Does it work in parallel though?

~~~
manneshiva
Sure, it does, it's not trivial though. Tedious to implement it yourself. You
could use python libraries as "scikit-optimize" which has an implementation of
parallel Bayesian optimization (based on Gaussian process), have a look at
this: [https://scikit-optimize.github.io/notebooks/bayesian-
optimiz...](https://scikit-optimize.github.io/notebooks/bayesian-
optimization.html)

