
Natural Gradient Descent (2018) - ghosthamlet
https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/
======
carlmcqueen
Not to spoil the article for anyone but..

Pretty in depth article and well laid out explanation of natural gradient
descent with a small pre-fit dataset for a conclusion of 'too computationally
expensive for machine learning/big data world'.

This is what I struggled with in school. You'd spend a class week learning
some tough stuff only to be told 'this is no longer done, better methods are
now used.'

Sometimes the work is needed to allow you to understand why/how the new method
is used, but in many cases I didn't find that to be true.

~~~
imurray
I wouldn't recommend this to someone wanting to use the currently most
practical tools. But I'm glad some people want to know about this stuff.

Sometimes the natural gradient turns out to be easier to compute than the
gradient, and then it has made it in to applications.

Variational Inference: A Review for Statisticians (Blei et al., 2016)
[https://arxiv.org/abs/1601.00670](https://arxiv.org/abs/1601.00670)

In this case, the gradient (51) has the Fisher Information in it and the
natural gradient gets rid of that hard-to-compute term. This stuff has been
used in a bunch of applications (section 5.1, p23).

You could argue that someone pragmatic might have tossed that positive
definite term anyway (without knowing about natural gradients). Or come up
with the same algorithm, as a heuristic improvement of existing algorithms.
But the connection is quite neat.

Another example of links to natural gradients helping to make an algorithm
cheaper: [https://arxiv.org/abs/1712.01038](https://arxiv.org/abs/1712.01038)
(Two related papers appeared at ICML 2018.)

------
dfan
[https://towardsdatascience.com/its-only-natural-an-
excessive...](https://towardsdatascience.com/its-only-natural-an-excessively-
deep-dive-into-natural-gradient-optimization-75d464b89dbb) is another nice
overview complementary to this one.

------
xtacy
This series is well written. The speed up is expected due because you're
incorporating second-order information during the optimisation. For more
insight into second order optimisation methods, take a look at Newton's
method:
[https://en.wikipedia.org/wiki/Newton%27s_method](https://en.wikipedia.org/wiki/Newton%27s_method).
The intuition, derivation, and proof of correctness and convergence speed are
quite illuminating.

~~~
wenc
Thanks for this insight.

I worked in deterministic optimization and there are many well-known
techniques for speeding up convergence from line searches to 2nd-order
derivatives (Hessians). These are fairly standard techniques. However I was
unfamiliar with stochastic optimization, so on cursory reading, I didn't
recognize these concepts until I realize what people are trying to do were
probabilistic analogues of standard optimization techniques, e.g. instead of
Euclidean norms, K-L divergences are used.

In the discussion, the ADAM algorithm was mentioned, which sounds like a type
of quasi-Newton method but with much simplified properties like constraining
the elements to the diagonal. This sounds to me like the akin to the sort of
simplification that Stochastic Gradient Descent (SGD) takes -- very crude but
practical at extremely large scale.

It's useful to note that for normal scale problems (thousands to a million
variables), using analytical gradients (of which the natural gradient is a
type of) and Hessians via automatic differentiation with a Newton-type method
provide much faster convergence. SGD/ADAM are essentially a compromise to make
things work at extremely large scale.

