
Automatic Differentiation in Machine Learning: A Survey [pdf] - sonabinu
http://jmlr.org/papers/volume18/17-468/17-468.pdf
======
edflsafoiewq
AD reminds me of another idea: in cryptography, you can determine eg. how bit
dependencies propagate through a state vector by running the regular code for
your compression function or whatever, but on "numbers" that actually track
the dependencies of bits. Details vary, but a simple idea is to use as a
"number" a vector of symbolic equations for each bit in the number in terms of
symbolic variables for input bits to the function. You can then easily
calculate items of interest eg. at this many rounds, has every input bits
affected every output bit?

So, jets instead of numbers for differentiation, systems of boolean equations
instead of numbers for answering some crypto questions... anyone know of other
tricks in this vein?

~~~
profquail
What you’re describing sounds like Abstract Interpretation[1], a technique
used in static analyzers and compilers (e.g. for bounds-check elimination).
You can pick any abstract domain you want, but the more complex the domain —
i.e., the more complex/precise your tracking of the possible values associated
with each variable — the longer it takes for the analysis to run.

[1]:
[https://en.m.wikipedia.org/wiki/Abstract_interpretation](https://en.m.wikipedia.org/wiki/Abstract_interpretation)

~~~
edflsafoiewq
Abstract interpretation would be the equivalent of symbolic differentiation in
this analogy.

------
infogulch
Another paper related to AD was posted last month: The Simple Essence of
Automatic Differentiation [0]. Video presentation [1]. I found it pretty
awesome, just in case you hadn't seen it.

[0]: https//news.ycombinator.com/item?id=18306860

[1]:
[https://www.youtube.com/watch?v=ne99laPUxN4](https://www.youtube.com/watch?v=ne99laPUxN4)

~~~
YorkshireSeason
Also not mentioned in the Baydin et al paper (which has been knocking about
for a while: it was submitted in 2017, so not surprising):

* F. Wang, X. Wu, G. Essertel, J. Decker, T. Rompf, _Demystifying Differentiable Programming: Shift /Reset the Penultimate Backpropagator_, see [1], and [4] for a presentation based on this paper.

* S. Laue, M. Mitterreiter, J. Giesen, _Computing Higher Order Derivatives of Matrix and Tensor Expressions_ , see [2], discussed in [3].

Quite a bit of exciting work in better understanding backpropagation going on
right now.

[1] [https://arxiv.org/abs/1803.10228](https://arxiv.org/abs/1803.10228)

[2]
[http://www.matrixcalculus.org/matrixcalculus.pdf](http://www.matrixcalculus.org/matrixcalculus.pdf)

[3]
[https://news.ycombinator.com/item?id=18464003](https://news.ycombinator.com/item?id=18464003)

[4]
[https://www.youtube.com/watch?v=igRLKYgpHy0](https://www.youtube.com/watch?v=igRLKYgpHy0)

------
kxyvr
I skimmed the article and I enjoyed it. Recently, I was looking for a good
reference to cite that discussed the link between automatic differentiation
and back propagation. From those I know if the field, they knew about the link
for a very, very long time, but I had a hard time finding a paper to cite. The
best I found was an article titled, "Backwards Differentiation in AD and
Neural Nets: Past Links and New Opportunities" by Paul Werbos, which can be
found in the book "Automatic Differentiation: Applications, Theory, and
Implementations" published in 2006 by Springer. That article is preceded by
one written by Louis Rall titled, "Perspectives on Automatic Differentiation:
Past, Present, and Future?", which also provides some good history of AD.
Anyway, if someone has a better paper to cite linking back propagation and AD,
I'd be interested to hear.

~~~
whatshisface
I thought the article itself did a good job of explaining it. Backpropagation
is just a special case of one way to do AD, apparently. AD prescribes a method
of taking derivatives, and if you use it on a neural network, there you go,
backpropagation.

------
adamnemecek
Dual numbers are favorite idea of late. Nilpotence seems under appreciated.

~~~
sgillen
Can you elaborate? What makes you think that nilpotence is under appreciated?

~~~
adamnemecek
Well in this case, dual numbers have this nilpotent constant epsilon that’s
analogous to i such that epsilon ^ 2 = 0. It’s the basis of automatic
differentiation. I don’t fully understand it yet, but it seems that nilpotence
is useful for smoothness and differentiability.

~~~
gugagore
I think you are thinking about this the wrong way. nilpotence doesn't relate
to smoothness (of what) and differentiability (of what). Dual numbers happen
to, the way I see it, _model_ forward-mode differentiation. There are a couple
of ways to see how eps^2=0 is a good model. But the easiest to get across is
that in many calculus derivations, you take dx^2 to be 0, since dx is "a very
small number".

------
mohancy
Great but machine learning is too change in these days in their algorithm

