
Optimization Methods for Large-Scale Machine Learning - Bootvis
https://arxiv.org/abs/1606.04838
======
arcanus
Nocedal's text, 'Numerical Optimization' is the standard for that field.

As he notes, I've always been surprised that more techniques in ML do not
leverage the Hessian to get quadratic convergence rates.

Nevertheless, the most interesting tidbit of this text, speaking as a
Computational Scientist, was,

'Much more could be said about this rapidly evolving field. Perhaps most
importantly, we have neither discussed nor analyzed at length the
opportunities offered by parallel and distributed computing'

The scalability of these algorithms, in particular across distributed memory
systems (e.g. MPI) at extreme scale will be an extremely important question.
I'm very interested in attempting to scale these networks to tens or hundreds
of thousands of processing cores. With heroic scale systems now often
eclipsing millions of cores, there is quite a bit of room to scale up, if the
algorithms are indeed robust.

~~~
x0x0
The Hessian is too expensive and too big.

lbfgs is quite common for eg regression w/o l1 penalties.

MPI is not great at even high hundreds of cores; it's too much work to build
redundancy / retry / restart / clean failure in. You really need a framework
that helps with this.

~~~
arcanus
> The Hessian is too expensive and too big.

Not necessarily, often derivatives are analytically known in ML.

> MPI is not great at even high hundreds of cores

? You realize that Sequia, which I have run on, has codes that scale to all
two million processors.

~~~
oliwaw
>Not necessarily, often derivatives are analytically known in ML.

The focus here is largely on deep neural networks. In this domain, the Hessian
cannot be computed and SGD (with minor variants) continues to be the golden
standard.

