

Tuning machine learning models - Zephyr314
http://blog.sigopt.com/post/111903668663/tuning-maching-learning-models

======
mjw
Nice post, couple of bits of feedback:

When you talk about "fit" it sounds like you mean fit to the training data,
which would obviously be a bad thing to optimise hyperparameters for. From the
github repo it sounds like you are using a held-out validation set, but maybe
worth being clear about this (e.g. call it something like "predictive
performance on validation set").

When you've optimised over hyper-parameters using a validation set, you need
to hold out a further test set and report results of your optimised
hyperparameter settings on that test set, rather than just report the best
achieved metric on the validation set. Is that what you did here? Maybe worth
a mention.

A question about sigopt: how do you compare to open-source tools like
hyperopt, spearmint and so on? Do you have proprietary algorithms? Are there
classes of problems which you do better or worse on? Or is it more about the
convenience ?

~~~
Houshalter
>When you've optimised over hyper-parameters using a validation set, you need
to hold out a further test set and report results of your optimised
hyperparameter settings on that test set, rather than just report the best
achieved metric on the validation set.

It is possible to overfit hyperparameters. However that's beyond the scope of
these methods, whose only goal is to find the best settings for the validation
set. So comparing their validation scores is fair, and the test scores could
potentially be misleading.

~~~
mjw
Yeah I was thinking about this after I posted. Not entirely convinced though
-- I want the hyperparameters I learn to generalise to unseen data, just like
plain old parameters. If there are two methods for learning them then I'm
going to pick the one which performs best on unseen data and I'd like a metric
which helps me make that choice.

Sure, you can evaluate them purely as optimisation algorithms, but does it
follow that the better optimisation algorithm is necessarily better at picking
hyperparameters that generalise to unseen data?

One way that hyperparameter optimisation can overfit that people don't always
think about, is by repeatedly evaluating high-variance metrics and picking the
best of N tries. This has burned me when it comes to optimising settings for
stochastic optimisation algorithms for example. An algorithm that was very
aggressive in doing this might reach a better maximum on the validation set
but wouldn't do any better on held-out data.

There are things you can do to compensate for that of course (variance
estimates for metrics is a good idea!), but evaluating on a test set data
usually doesn't hurt and seems like the safest option.

~~~
Houshalter
>If there are two methods for learning them then I'm going to pick the one
which performs best on unseen data and I'd like a metric which helps me make
that choice.

But both methods will converge to the exact same set of hyper parameters, the
ones that are optimal for the validation set. The only difference is some
methods are faster.

------
Zephyr314
I'm one of the founders of SigOpt and I am happy to answer any questions about
this post, our methods, or anything about SigOpt. I'll be in this thread all
day.

~~~
pfhayes
Another founder here! If you have any questions about our engineering stack,
I'm happy to answer.

~~~
orsenthil
Can it solve the halting problem?

