Hacker News new | past | comments | ask | show | jobs | submit login
Natural Gradient Descent (2018) (wiseodd.github.io)
98 points by ghosthamlet 34 days ago | hide | past | web | favorite | 8 comments

Not to spoil the article for anyone but..

Pretty in depth article and well laid out explanation of natural gradient descent with a small pre-fit dataset for a conclusion of 'too computationally expensive for machine learning/big data world'.

This is what I struggled with in school. You'd spend a class week learning some tough stuff only to be told 'this is no longer done, better methods are now used.'

Sometimes the work is needed to allow you to understand why/how the new method is used, but in many cases I didn't find that to be true.

I wouldn't recommend this to someone wanting to use the currently most practical tools. But I'm glad some people want to know about this stuff.

Sometimes the natural gradient turns out to be easier to compute than the gradient, and then it has made it in to applications.

Variational Inference: A Review for Statisticians (Blei et al., 2016) https://arxiv.org/abs/1601.00670

In this case, the gradient (51) has the Fisher Information in it and the natural gradient gets rid of that hard-to-compute term. This stuff has been used in a bunch of applications (section 5.1, p23).

You could argue that someone pragmatic might have tossed that positive definite term anyway (without knowing about natural gradients). Or come up with the same algorithm, as a heuristic improvement of existing algorithms. But the connection is quite neat.

Another example of links to natural gradients helping to make an algorithm cheaper: https://arxiv.org/abs/1712.01038 (Two related papers appeared at ICML 2018.)

   but in many cases I didn't find that to be true. 
I wouldn't be so sure. Consider the "opposite" approach in some sense, a person who has only done a rapid training online course in current techniques. I give them a very specific task that matches their training well, they will probably do ok. If something unexpected happens though, they will mostly be unable to address it effectively. If a slightly different problem comes up, they won't know how to address that, either. They will have a shallow sense of "how", and very little sense of "why". Worse, they are ill prepared for adapting new techniques; i'd probably be just as well off finding a newer version to hire, who has had more recent training....

Sure, not all courses are as good as they could be (nor all lecturers) but the core of what they are trying to teach you is nothing as simple as methods. Methods are the easy part, after all.

Strong disagree. I started learning machine learning back in 2010, before neural networks really took off. The understanding of where the field was before and why DNNs were useful advancement puts me way ahead of peers who don't understand what problems they're solving. I have an appreciation for the ways that those problems can still creep back into neural network research, despite them apparently being a black box to many researchers.

> 'too computationally expensive for machine learning/big data world'

But not for normal-sized datasets...

I've noticed a shift in the past year where (new?) people are thinking ML is synonymous wth NN/Deep Learning models.

To me ML has always encompassed statistical learning techniques, most of which work very well on normal-sized datasets (thousand to a million rows, definitions differ). Most classical/statistical methods work just fine at this scale, including the method in the linked article.

Lately I've also been thinking: given certain patterns of regularity in data, and outside of certain domains involving images/sound/language, we don't really need large-scale datasets to train our models. Good models can be trained on carefully selected samples without significant loss of fidelity, which opens up the scope of the types of models that can be deployed.

https://towardsdatascience.com/its-only-natural-an-excessive... is another nice overview complementary to this one.

This series is well written. The speed up is expected due because you're incorporating second-order information during the optimisation. For more insight into second order optimisation methods, take a look at Newton's method: https://en.wikipedia.org/wiki/Newton%27s_method. The intuition, derivation, and proof of correctness and convergence speed are quite illuminating.

Thanks for this insight.

I worked in deterministic optimization and there are many well-known techniques for speeding up convergence from line searches to 2nd-order derivatives (Hessians). These are fairly standard techniques. However I was unfamiliar with stochastic optimization, so on cursory reading, I didn't recognize these concepts until I realize what people are trying to do were probabilistic analogues of standard optimization techniques, e.g. instead of Euclidean norms, K-L divergences are used.

In the discussion, the ADAM algorithm was mentioned, which sounds like a type of quasi-Newton method but with much simplified properties like constraining the elements to the diagonal. This sounds to me like the akin to the sort of simplification that Stochastic Gradient Descent (SGD) takes -- very crude but practical at extremely large scale.

It's useful to note that for normal scale problems (thousands to a million variables), using analytical gradients (of which the natural gradient is a type of) and Hessians via automatic differentiation with a Newton-type method provide much faster convergence. SGD/ADAM are essentially a compromise to make things work at extremely large scale.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact