
Computing Higher Order Derivatives of Matrix and Tensor Expressions [pdf] - jkam
http://www.matrixcalculus.org/matrixcalculus.pdf
======
DonHopkins
[https://donhopkins.com/home/catalog/lem/WonderfulPoems.html](https://donhopkins.com/home/catalog/lem/WonderfulPoems.html)

Klapaucius witnessed the first trial run of Trurl's poetry machine, the
Elecronic Bard. Here are the some of the wonderful poems it instantly composed
to Klapaucius's specifications:

A love poem, lyrical, pastoral, and expressed in the language of pure
mathematics. Tensor algebra mainly, with a little topology and higher
calculus, if need be. But with feeling, you understand, and in the cybernetic
spirit.

    
    
        Come, let us hasten to a higher plane,
        Where dyads tread the fairy fields of Venn,
        Their indices bedecked from one to n,
        Commingled in an endless Markov chain!
        Come, every frustum longs to be a cone,
        And every vector dreams of matrices.
        Hark to the gentle gradient of the breeze:
        It whispers of a more ergodic zone.
    
        In Riemann, Hilbert or in Banach space
        Let superscripts and subscripts go their ways. 
        Our asymptotes no longer out of phase,
        We shall encounter, counting, face to face.
    
        I'll grant thee random access to my heart,
        Thou'lt tell me all the constants of thy love;
        And so we two shall all love's lemmas prove,
        And in our bound partition never part.
    
        For what did Cauchy know, or Christoffel,
        Or Fourier, or any Boole or Euler,
        Wielding their compasses, their pens and rulers, 
        Of thy supernal sinusoidal spell?
    
        Cancel me not -- for what then shall remain?
        Abscissas, some mantissas, modules, modes,
        A root or two, a torus and a node:
        The inverse of my verse, a null domain.
    
        Ellipse of bliss, converse, O lips divine!
        The product of our scalars is defined!
        Cyberiad draws nigh, and the skew mind
        cuts capers like a happy haversine.
    
        I see the eigenvalue in thine eye,
        I hear the tender tensor in thy sigh.
        Bernoulli would have been content to die,
        Had he but known such a squared cosine 2 phi!
    

[https://en.wikipedia.org/wiki/The_Cyberiad](https://en.wikipedia.org/wiki/The_Cyberiad)

------
mjw
This is very neat. That said the reason these methods haven't received much
attention so far is that relatively few people actually need to compute
Jacobeans or Hessians directly.

Often only Hessian-vector products or Jacobean-vector products are required,
and these can be computed via more standard autodiff techniques, usually a lot
more efficiently than if you were to compute the Hessian or Jacobean directly.

Also for models with lots of parameters, the Jacobean and Hessian are usually
impractically large to realise in memory (N^2 in the number of parameters).

Nevertheless the symbolic tensor calculus approach is very appealing to me.
For one thing it could make it a lot easier to see in a more readable symbolic
notation what the gradient computations look like in standard backprop, and
could perhaps make it easier to implement powerful symbolic optimizations.

~~~
SoerenL
True, for large-scale problems Hessian-vector products are often the way to go
(or completely ignoring second order information). However, computing first an
expression for the Hessian symbolically and then taking the product with a
vector is still more efficient than using autodiff for Hessian-vector
products. It is not in the paper though. But the gain is rather small (just a
factor of two or so). But true, only for problems involving up to a few
thousand parameters computing Hessians or Jacobians is useful.

------
aabajian
This is a fascinating paper. It demonstrates how rooted deep learning is in
classical mathematics, and how libraries like TensorFlow/Keras/PyTorch/etc.
have both helped and _hurt_ its progress. A software developer can build and
train a neural network in under an hour with these libraries, with little
knowledge of the underlying math. We take for granted that the implementation
is optimized. Imagine if the built-in Python sorted function took O(n^3)
instead of O(n log n). It wouldn't take long for someone to point out a better
approach. There's a huge need for people who excel at math _and_ can program
_and_ have time to do open-source.

------
riskneutral
This seems important, if true. 3X speedup of forward and backward automatic
differentiation on GPUs.

~~~
bmc7505
Not 3X. Three _orders of magnitude_ faster. If true, this is indeed, an
exciting result. Maybe as exciting as Strassen’s algorithm was for matrix
multiplication - this enables a bunch of powerful optimization methods that
were previously thought intractable for modern DNNs.

~~~
SoerenL
As one of the authors: Yes, it is true, 3 orders of magnitude (so about a
factor of 1000 on GPUs). But please be careful, this holds only for higher
order derivatives (like Hessians, or Jacobians), as stated in the paper and as
the title says. For gradients, the speedup is rather limited.

~~~
nl
Can this be used for faster earth mover distance calculation?

~~~
SoerenL
Do you need Hessians or Jacobians for computing the earth mover distance, then
a definite yes. Otherwise, I would doubt it (though I do not know exactly.)

------
bmc7505
Would be interesting to explore whether evaluating ∇²f(x) for higher-order SGD
methods directly is now feasible for smaller DNNs and whether this leads to
faster convergence during training. Most methods like Newton or Gauss-newton
were thought intractable for DNNs. Also curious if the angle descent
prescribed by quasi-Newton methods is empirically closer to the true 2nd order
gradient and whether these approximations are useful in wall-clock time.

~~~
jkam
What kind of work are you referring to when you say higher-order SGD may _now_
be feasible for deep learning? I only find results that try to approximate
second order information.

~~~
bmc7505
Not sure what you mean. The paper above claims 1000x speedups for computing
second-order derivatives. Have not tested their claims, but was speculating
that such an improvement, if true, would make computing hessians for small
networks fesiable. This is what I am referring to.

------
jkam
What was the gift from Google?

