
Optimization in the context of machine learning - bourbakis
https://blog.algorithmia.com/introduction-to-optimizers/
======
blt
> _To avoid getting stuck in local minima, we make sure we use the proper
> learning rate_

This sentence is extremely misleading. It implies that gradient descent always
reaches a global minimum if the "proper" learning rate is used. Guarantees of
that form are only available for restricted families of function such as
Lipschitz-smooth convex functions.

~~~
rq1
And you don’t necessarily want a global minimum of an universal
approximator...

------
taeric
This is... Misnamed? Optimization solvers are a different topic than what has
grown to dominate machine learning. Isn't it?

~~~
mkl
Yes, it's talking about optimisation in the context of machine learning, which
is a narrow slice of optimisation as a field.

~~~
dang
I've pinched that phrase to make a less misleading title. Hope that's ok!

~~~
mkl
Of course, no problem!

------
david_draco
I am surprised the SGD extensions of
[https://arxiv.org/abs/1704.04289](https://arxiv.org/abs/1704.04289) have not
been picked up yet (via [https://www.inference.vc/everything-that-works-works-
because...](https://www.inference.vc/everything-that-works-works-because-its-
bayesian-2/)).

~~~
moultano
It's interesting work, but I don't think it's relevant to real world models,
because real world models violate the conditions they assume at the outset.
Namely, real models are not run until anything near convergence on the
training set. Everyone uses early stopping to prevent them from overfitting,
which they happily do if left to train forever.

~~~
sgt101
Has there been any progress in the art around this issues? The idea of
optimise until it stops making sense to optimise is terribly unsatisfying. I
think it means "don't optimise this, but move to another state in a way that
is approximately optimisation" or "optimise to this special state". Is there
art to define "this special state" which is the generalising classifier. There
wasn't anything (for sure) in 1997, although we thought that there were two
types of risk minimisation - structural risk (size of classifier) and
empirical risk (error on datasets) but no one had an idea of how to use these
ideas well. I did some work where I pruned optimised ensembles using GA's to
get better generalisation vs direct trained, but everyone hated it because it
was inefficient.

Then people looked at deep networks and the structural risk minimisation stuff
went out of the window I guess.

~~~
beagle3
In simpler worlds, there's a variety of criteria such as Akaike's Information
Criterion (AIC) and the Bayesian Information Criterion (BIC). Neither applies
directly to NN optimization because they amount to "every additional parameter
costs _this_ much".

The more philosophical Minimal Description Length (MDL) principles and friends
might yield something applicable for NN training, but hasn't yet AFAIK.

------
mlevental
something i've never understood: all of the ml frameworks (tf, pytorch, etc.)
are seemingly pretty good at optimization of at least a particular set of cost
functions (convex quadratic? but i know that with activation functions ml cost
functions aren't really convex quadratic) so why aren't there general
optimization libraries built on top of them? like why are GAMS and AMPL still
a thing? is it simply because they're more general? and also as a corollary
how good are the optimizers baked into the ml frameworks relative to tools
like GAMS and AMPL?

~~~
tmyklebu
It might be instructive for you to try solving the linear programs in
netlib[0] with stochastic gradient descent. ;)

The short version of the story is that different optimisation problems have
different kinds of structure, and the user will want solutions of different
quality and have different amounts of patience depending on the context.
Practical optimisation is largely about finding ways to exploit the structure
of a problem to get solutions of some quality in some amount of time. No known
optimisation algorithm dominates all others.

In many machine learning contexts, you want a low-precision solution to a
problem with an objective that's expensive to compute but can be reasonably
approximated by random sampling. Often, when you use a linear program solver,
you want a solution to a linearly-constrained problem where the geometry of
the constraints may not be "easy" and any violation renders a solution
completely useless.

Good LP solvers can solve "typical" well-formulated LPs with hundreds of
thousands of variables and constraints in a matter of seconds on modern
hardware[1]. These algorithms _can 't_ solve many of the problems you're going
to feed into your deep learning framework at all. Conversely, your deep
learning framework will have a hell of a time solving LPs.

As for why GAMS and AMPL are a thing, they're interfaces to solvers, not
solvers themselves. You use them because they can handle pulling the data out
of the database, translating your GAMS/AMPL program into a formulation suited
to the solver you choose, running the solver, and handing back the results.
It's straightforward but tedious to do this all on a case-by-case basis; GAMS
and AMPL automate much of the uninteresting work.

[0] [http://www.netlib.org/lp/](http://www.netlib.org/lp/) [1]
[http://plato.asu.edu/ftp/lpcom.html](http://plato.asu.edu/ftp/lpcom.html)

------
david_draco
For a bit of a wider view, see for example the OpenOpt problem tree:
[https://web.archive.org/web/20150206060441/http://openopt.or...](https://web.archive.org/web/20150206060441/http://openopt.org:80/Problems)
Probably most interesting is "Non-Linear Problems (NLP)" there, and if you
click in, you can see algorithms.

------
DrNuke
It always depends on your problems, your targets and your metrics. The most
critical aspect for many industrial applications is false positives, so that
machine learning extreme / redundant optimisation becomes cheaper for the
business in the long term.

------
aldanor
Wanted to share a wonderful addition to scipy I’ve recently discovered for
myself (since the topic kind of fits):

[https://docs.scipy.org/doc/scipy/reference/optimize.minimize...](https://docs.scipy.org/doc/scipy/reference/optimize.minimize-
trustconstr.html)

This was added in 1.0+, along with a few other trust-* methods, basically a
quite modern miltivariate constrained optimization method, deals pretty well
with noisy functions you typically encounter in ML.

~~~
aldanor
Ooook thanks for downvotes then..

------
enthdegree
newton's method?

~~~
mkl
Newton's method finds zeros of functions, which is different from optimisation
(finding maximums or minimums). If you use Newton's method on a function which
is the derivative of some other function you're trying to optimise, then it
will find critical points, which may be the optimums you're looking for (or
inflection points, saddles, etc.).

~~~
snovv_crash
This is why you _actually_ do something like Levenberg-Marquardt, which
combines a gradient descent with a trust region of quadratic optimization,
which is varied depending on if the cost function is decreasing or not. And of
course it is iterative, so a single bad step doesn't destroy everything. The
advantage comes from faster convergence in practice.

~~~
mi_lk
Do you know of any good literature on LM or trust region? I'm interested in
the details.

~~~
snovv_crash
I strongly recommend reading the documentation for Ceres Solver [1]. It is
intuitive and in-depth, and references everything you will need.

1\. [http://ceres-solver.org/nnls_solving.html#trust-region-
metho...](http://ceres-solver.org/nnls_solving.html#trust-region-methods)

