Computing Higher Order Derivatives of Matrix and Tensor Expressions [pdf] 95 points by jkam 3 months ago | hide | past | web | favorite | 20 comments

 https://donhopkins.com/home/catalog/lem/WonderfulPoems.htmlKlapaucius witnessed the first trial run of Trurl's poetry machine, the Elecronic Bard. Here are the some of the wonderful poems it instantly composed to Klapaucius's specifications:A love poem, lyrical, pastoral, and expressed in the language of pure mathematics. Tensor algebra mainly, with a little topology and higher calculus, if need be. But with feeling, you understand, and in the cybernetic spirit.`````` Come, let us hasten to a higher plane, Where dyads tread the fairy fields of Venn, Their indices bedecked from one to n, Commingled in an endless Markov chain! Come, every frustum longs to be a cone, And every vector dreams of matrices. Hark to the gentle gradient of the breeze: It whispers of a more ergodic zone. In Riemann, Hilbert or in Banach space Let superscripts and subscripts go their ways. Our asymptotes no longer out of phase, We shall encounter, counting, face to face. I'll grant thee random access to my heart, Thou'lt tell me all the constants of thy love; And so we two shall all love's lemmas prove, And in our bound partition never part. For what did Cauchy know, or Christoffel, Or Fourier, or any Boole or Euler, Wielding their compasses, their pens and rulers, Of thy supernal sinusoidal spell? Cancel me not -- for what then shall remain? Abscissas, some mantissas, modules, modes, A root or two, a torus and a node: The inverse of my verse, a null domain. Ellipse of bliss, converse, O lips divine! The product of our scalars is defined! Cyberiad draws nigh, and the skew mind cuts capers like a happy haversine. I see the eigenvalue in thine eye, I hear the tender tensor in thy sigh. Bernoulli would have been content to die, Had he but known such a squared cosine 2 phi! `````` https://en.wikipedia.org/wiki/The_Cyberiad
 This is very neat. That said the reason these methods haven't received much attention so far is that relatively few people actually need to compute Jacobeans or Hessians directly.Often only Hessian-vector products or Jacobean-vector products are required, and these can be computed via more standard autodiff techniques, usually a lot more efficiently than if you were to compute the Hessian or Jacobean directly.Also for models with lots of parameters, the Jacobean and Hessian are usually impractically large to realise in memory (N^2 in the number of parameters).Nevertheless the symbolic tensor calculus approach is very appealing to me. For one thing it could make it a lot easier to see in a more readable symbolic notation what the gradient computations look like in standard backprop, and could perhaps make it easier to implement powerful symbolic optimizations.
 True, for large-scale problems Hessian-vector products are often the way to go (or completely ignoring second order information). However, computing first an expression for the Hessian symbolically and then taking the product with a vector is still more efficient than using autodiff for Hessian-vector products. It is not in the paper though. But the gain is rather small (just a factor of two or so). But true, only for problems involving up to a few thousand parameters computing Hessians or Jacobians is useful.
 This is a fascinating paper. It demonstrates how rooted deep learning is in classical mathematics, and how libraries like TensorFlow/Keras/PyTorch/etc. have both helped and hurt its progress. A software developer can build and train a neural network in under an hour with these libraries, with little knowledge of the underlying math. We take for granted that the implementation is optimized. Imagine if the built-in Python sorted function took O(n^3) instead of O(n log n). It wouldn't take long for someone to point out a better approach. There's a huge need for people who excel at math and can program and have time to do open-source.
 This seems important, if true. 3X speedup of forward and backward automatic differentiation on GPUs.
 Not 3X. Three orders of magnitude faster. If true, this is indeed, an exciting result. Maybe as exciting as Strassen’s algorithm was for matrix multiplication - this enables a bunch of powerful optimization methods that were previously thought intractable for modern DNNs.
 As one of the authors: Yes, it is true, 3 orders of magnitude (so about a factor of 1000 on GPUs). But please be careful, this holds only for higher order derivatives (like Hessians, or Jacobians), as stated in the paper and as the title says. For gradients, the speedup is rather limited.
 Can you briefly explain what the approach is? I’ve briefly skimmed the paper and I feel like I am missing something. Is it auto diff on the component wise expressions, then somehow figuring out a way to evaluate those expressions using matrices?
 That's what is usually done, autodiff on the component wise expression. We don't do it here. Instead, we really compute on the matrix and tensor level and compute derivatives here directly. Let me give you a simple example to illustrate it: Consider the function f(x)=x'Ax. Then, its Hessian is A+A'. This expression is what we compute. And evaluating this expression is orders of magnitude faster than what TF, PyTorch, etc. do, namely autodiff on the components. Hope this helps, if not, please let me know.
 Do you think this could end up being implemented in TF and Pytorch?
 Not in its current formulation. It uses a different representation of the tensors. However, a new version/algorithm that will be available in a few months can be used in TF and PyTorch.
 Are you referring to the Ricci notation when you are saying it uses a different representation of the tensors? Do you also plan to add non-differentiable functions like relu?
 Yeah, I am referring to Ricci notation. TF and PyTorch don't use it. The new version already has relus. This version is targeted at standard formulas (has abs as a non-differentiable function), next version works for deep learning. Never wanted to work on this but there is just no way around it.
 Can this be used for faster earth mover distance calculation?
 Do you need Hessians or Jacobians for computing the earth mover distance, then a definite yes. Otherwise, I would doubt it (though I do not know exactly.)
 Wait, isn’t Jacobian is just a bunch of gradients?
 Would be interesting to explore whether evaluating ∇²f(x) for higher-order SGD methods directly is now feasible for smaller DNNs and whether this leads to faster convergence during training. Most methods like Newton or Gauss-newton were thought intractable for DNNs. Also curious if the angle descent prescribed by quasi-Newton methods is empirically closer to the true 2nd order gradient and whether these approximations are useful in wall-clock time.
 What kind of work are you referring to when you say higher-order SGD may _now_ be feasible for deep learning? I only find results that try to approximate second order information.
 Not sure what you mean. The paper above claims 1000x speedups for computing second-order derivatives. Have not tested their claims, but was speculating that such an improvement, if true, would make computing hessians for small networks fesiable. This is what I am referring to.