I’ll add some things that relate to this perspective: there is a follow up paper to the DL via Hessian-free optimization paper by James Martens that develops a variant of AD which calculates a special curvature quantity which is useful for efficient second order optimization. This emphasizes that AD is just one of a family of program transformations, most of which are probably waiting to be discovered. For example, would it be useful to have an AD-like pass that calculated trusted regions for gradient updates? And: why do we not do more iteration during DL in which we interleave forward and backward for a single layer only? (rhetorical question, there are many reasons but asking this question forces you to think deeply about a lot of things)
Also, if you generalize VAEs to be arbitrary differentiable programs you immediately run into versions of message passing. For example, vector quantized VAEs are a precisely a soft version of the EM step in k-means. So how would we develop more advanced variational programs in which this process is better managed? There may be very different forms for the inference and generation phases that are not related by a program transformation, or ARE related but only in a high-level way best captured by functional programming primitives.
Any other people nerding out about this stuff please say hi!
Deep Learning via Hessian-free Optimization: http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pd...
Optimizing Neural Networks with Kronecker-factored Approximate Curvature: http://arxiv.org/abs/1503.05671
James Martens' list of publications with links to sample code for the above two papers, slides/condensed conference versions, etc: http://www.cs.toronto.edu/~jmartens/research.html
Pretty neat stuff
To answer my own question, yes: http://papers.nips.cc/paper/7112-scalable-trust-region-metho...
Meh, this is not entirely accurate. Sure it doesn't expand the program into analytic functions whose derivatives are easy to compute; but it still handles the program symbolically. So, in a way, it still transforms the program into a known set of mathematical primitives of which it can then construct a program that computes the derivative in compile time.
As a side note, when modern compilers optimize abstract syntax trees, I'm pretty sure they do operations that are similar to the message-passing algorithm described. And they work great, albeit for specialized purposes.
In any case, compare and contrast with "Compiling to categories" http://conal.net/papers/compiling-to-categories/