Pretty in depth article and well laid out explanation of natural gradient descent with a small pre-fit dataset for a conclusion of 'too computationally expensive for machine learning/big data world'.
This is what I struggled with in school. You'd spend a class week learning some tough stuff only to be told 'this is no longer done, better methods are now used.'
Sometimes the work is needed to allow you to understand why/how the new method is used, but in many cases I didn't find that to be true.
Sometimes the natural gradient turns out to be easier to compute than the gradient, and then it has made it in to applications.
Variational Inference: A Review for Statisticians (Blei et al., 2016)
In this case, the gradient (51) has the Fisher Information in it and the natural gradient gets rid of that hard-to-compute term. This stuff has been used in a bunch of applications (section 5.1, p23).
You could argue that someone pragmatic might have tossed that positive definite term anyway (without knowing about natural gradients). Or come up with the same algorithm, as a heuristic improvement of existing algorithms. But the connection is quite neat.
Another example of links to natural gradients helping to make an algorithm cheaper: https://arxiv.org/abs/1712.01038
(Two related papers appeared at ICML 2018.)
but in many cases I didn't find that to be true.
Sure, not all courses are as good as they could be (nor all lecturers) but the core of what they are trying to teach you is nothing as simple as methods. Methods are the easy part, after all.
But not for normal-sized datasets...
I've noticed a shift in the past year where (new?) people are thinking ML is synonymous wth NN/Deep Learning models.
To me ML has always encompassed statistical learning techniques, most of which work very well on normal-sized datasets (thousand to a million rows, definitions differ). Most classical/statistical methods work just fine at this scale, including the method in the linked article.
Lately I've also been thinking: given certain patterns of regularity in data, and outside of certain domains involving images/sound/language, we don't really need large-scale datasets to train our models. Good models can be trained on carefully selected samples without significant loss of fidelity, which opens up the scope of the types of models that can be deployed.
I worked in deterministic optimization and there are many well-known techniques for speeding up convergence from line searches to 2nd-order derivatives (Hessians). These are fairly standard techniques. However I was unfamiliar with stochastic optimization, so on cursory reading, I didn't recognize these concepts until I realize what people are trying to do were probabilistic analogues of standard optimization techniques, e.g. instead of Euclidean norms, K-L divergences are used.
In the discussion, the ADAM algorithm was mentioned, which sounds like a type of quasi-Newton method but with much simplified properties like constraining the elements to the diagonal. This sounds to me like the akin to the sort of simplification that Stochastic Gradient Descent (SGD) takes -- very crude but practical at extremely large scale.
It's useful to note that for normal scale problems (thousands to a million variables), using analytical gradients (of which the natural gradient is a type of) and Hessians via automatic differentiation with a Newton-type method provide much faster convergence. SGD/ADAM are essentially a compromise to make things work at extremely large scale.