
From automatic differentiation to message passing [video] - adamnemecek
https://www.youtube.com/watch?v=NkJNcEed2NU
======
taliesinb
This is a perspective I’ve been taking recently as I’ve been investigating
next-gen DL frameworks like Julia’s Zygote and Swift for Twnsorflow. So it’s
very nice to see this articulated so well here by someone who has been
thinking deeply about it for so long.

I’ll add some things that relate to this perspective: there is a follow up
paper to the DL via Hessian-free optimization paper by James Martens that
develops a variant of AD which calculates a special curvature quantity which
is useful for efficient second order optimization. This emphasizes that AD is
just one of a family of program transformations, most of which are probably
waiting to be discovered. For example, would it be useful to have an AD-like
pass that calculated trusted regions for gradient updates? And: why do we not
do more iteration during DL in which we interleave forward and backward for a
single layer only? (rhetorical question, there are many reasons but asking
this question forces you to think deeply about a lot of things)

Also, if you generalize VAEs to be arbitrary differentiable programs you
immediately run into versions of message passing. For example, vector
quantized VAEs are a precisely a soft version of the EM step in k-means. So
how would we develop more advanced variational programs in which this process
is better managed? There may be very different forms for the inference and
generation phases that are not related by a program transformation, or ARE
related but only in a high-level way best captured by functional programming
primitives.

Any other people nerding out about this stuff please say hi!

~~~
proofofconcept
>there is a follow up paper to the DL via Hessian-free optimization paper by
James Martens that develops a variant of AD which calculates a special
curvature quantity which is useful for efficient second order optimization

Deep Learning via Hessian-free Optimization:
[http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pd...](http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf)

Optimizing Neural Networks with Kronecker-factored Approximate Curvature:
[http://arxiv.org/abs/1503.05671](http://arxiv.org/abs/1503.05671)

James Martens' list of publications with links to sample code for the above
two papers, slides/condensed conference versions, etc:
[http://www.cs.toronto.edu/~jmartens/research.html](http://www.cs.toronto.edu/~jmartens/research.html)

Pretty neat stuff

~~~
taliesinb
Haha thanks for finding links! Was on a bus on my phone, so didn't have the
patience...

------
carapace
> Automatic differentiation is an elegant technique for converting a
> computable function expressed as a program into a derivative-computing
> program with similar time complexity. It does not execute the original
> program as a black-box, nor does it expand the program into a mathematical
> formula, both of which would be counter-productive. By generalizing this
> technique, you can produce efficient algorithms for constraint satisfaction,
> optimization, and Bayesian inference on models specified as programs. This
> approach can be broadly described as compiling into a message-passing
> program.

[https://www.microsoft.com/en-us/research/video/from-
automati...](https://www.microsoft.com/en-us/research/video/from-automatic-
differentiation-to-message-passing/)

~~~
gnulinux
> nor does it expand the program into a mathematical formula.

Meh, this is not _entirely_ accurate. Sure it doesn't expand the program into
analytic functions whose derivatives are easy to compute; but it still handles
the program symbolically. So, in a way, it still transforms the program into a
known set of mathematical primitives of which it can then construct a program
that computes the derivative in compile time.

~~~
carapace
It seems that way to me too, but without more details about implementation I'm
not sure.

------
aqi
At the end of the presentation the presented mentions that this is
structurally identical to loopy belief propagation... Isn't that a big issue,
since they inherit many of its tractability issues with regards to training
and inference? Modern DL models are far too interconnected for inference to be
tractable in general, so the best we can hope for is that we can make
simplifying assumptions that make loopy belief propagation feasible.

As a side note, when modern compilers optimize abstract syntax trees, I'm
pretty sure they do operations that are similar to the message-passing
algorithm described. And they work great, albeit for specialized purposes.

------
carapace
It seems to me that the message-passing aspect is kind of an implementation
detail.

In any case, compare and contrast with "Compiling to categories"
[http://conal.net/papers/compiling-to-
categories/](http://conal.net/papers/compiling-to-categories/)

------
keithalewis
Another approach to AD is:
[https://github.com/keithalewis/epsilon](https://github.com/keithalewis/epsilon).

