
Automatic Differentiation (2009) - Tomte
https://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/
======
davmre
At the time this post was written, there was no good autodiff implementation
for any language that anyone actually used for ML, meaning Python, Matlab, or
Java. That's changed now; for example, Python autograd [1] is quite seamless
and plays well with the entire Numpy stack.

The bigger trend, though, has been towards DSLs such as Theano or Tensorflow
(and Torch to a lesser degree) that include autodiff as a first class design
consideration. The ability to build complicated networks out of modular parts,
and have the gradient calculations automatically fall out, has been a major
driver of the recent explosive progress in deep learning research, as has easy
portability between CPU and GPU execution. It'll be exciting to see what
happens as these tools mature and especially if they find applications in
other fields.

[1] [https://github.com/HIPS/autograd](https://github.com/HIPS/autograd)

------
TeMPOraL
Never heard of it before. Thanks, 'Tomte.

Some resources for Common Lisp I found:

\- AD in Lisp paper:
[http://www.cs.berkeley.edu/~fateman/papers/ADIL.pdf](http://www.cs.berkeley.edu/~fateman/papers/ADIL.pdf)

\- A library: [https://github.com/masonium/cl-
autodiff](https://github.com/masonium/cl-autodiff) (basics seem to work, but
it definitely could use some love to grow)

------
mej10
TensorFlow has built-in automatic differentiation -- so it seems like at least
some progress has been made since this post.

~~~
jonas21
The Ceres Solver (also from Google) has great automatic differentiation
support as well.

[http://ceres-solver.org/index.html](http://ceres-solver.org/index.html)

------
throwaway39210
Can someone further clarify the distinction between automatic and symbolic
differentiation?

The author claims that using symbolic differentiation on his example
(f(x)=sin(x+sin(x+...+sin(x+sin(x))...))) builds a "huge expression that would
take much more time to compute" than automatic differentiation. At the top of
the article there's also an update noting an "explosion of great tools for
automatic differentiation" recently, which I assume is referring to
machine/deep learning frameworks like Theano, TensorFlow, etc. However, from
my understanding, building an entire computational graph to express the
gradient via the chain rule (backpropagation) is exactly what these frameworks
do, and by the author's brief description that seems to fall under symbolic
differentiation. So which one are these frameworks doing -- symbolic or
automatic differentiation? And, concretely, how would they differ if they were
instead doing the other one?

Edit: I've been doing some further reading trying to understand this, and it's
only confusing me further. Wikipedia has this diagram [1] which supposedly
illustrates the relationship between the two -- what the hell is going on
here? The only interpretation I can come up with from it is that automatic
differentiation takes code (written in a computer programming language) as
input and produces code to compute the gradient, whereas in symbolic
differentiation, the input and output are mathematical expressions. But this
doesn't seem like a meaningful distinction to me...both math and computer code
are languages; functions written in programming languages _are_ mathematical
functions (the ones that can actually be correctly called "functions",
anyway). Someone please help me out here; what am I missing?

[1]
[https://en.wikipedia.org/wiki/Automatic_differentiation#/med...](https://en.wikipedia.org/wiki/Automatic_differentiation#/media/File:AutomaticDifferentiationNutshell.png)

~~~
laughinghan
To the best of my understanding:

\- with numerical differentiation, you take an algorithm to approximate a
function, you repeatedly evaluate that, and you use the results to approximate
the derivative

\- with symbolic differentiation, you take a symbolic, _exact_ representation
of a function (which you could evaluate with floating-point numbers to
approximate its values if you wanted to use numerical differentiation
instead), and you apply rules like linearity and the product rule to get a
symbolic, exact representation of the derivative, which you evaluate with
floating-point numbers to approximate the derivative

\- with automatic differentiation, you take an algorithm to approximate a
function, which must be implemented using basic floating-point operations, and
you apply rules like linearity and the product rule _to the operations used to
implement the approximation_ , and you get an algorithm that approximates the
function's derivative

Put another way, when people say they're using symbolic or automatic
differentiation to "calculate" the derivative of a function, "calculate" means
different things: in the one, they're "calculating" the exact, symbolic
_expression_ for the derivative (which can optionally be evaluated); in the
other, they're "calculating" the floating-point approximation of the _values_
of the derivative.

To me, automatic differentiation is GENIUS. My friend, who's written a Julia
package [1] based on some of the ideas behind automatic differentiation,
introduced me to them; unfortunately, according to him, numerical
differentiation is usually more practical. I don't remember why but I'm going
to ask him.

[1]:
[https://github.com/jwmerrill/PowerSeries.jl](https://github.com/jwmerrill/PowerSeries.jl)

~~~
gh02t
You can think of AD as differentiating a function that is defined in terms of
composition... e.g. f(g(h(x))), where (loosely) x is the initial state of your
function and f, g, and h are differentiable mappings from one state to
another. In this analogy, in a procedural language f, g, and h would be lines
in your source code, which modify the overall program state to produce some
final value. There are a few different ways to actually do it (including the
rather tricky concept of "dual numbers"), but ultimately it amounts to
differentiating the overall program f(g(h(x)) recursively using the chain
rule. It's [often] more efficient than direct symbolic differentiation because
of this recursive evaluation - think of it as something like tail-call
optimization.

It's not an approximate method, nor does it care [directly] about floating
point. If you have implemented some f' that approximates a mathematical
function f and you use AD to differentiate f', you are calculating the _exact_
derivative of f', not an approximation of the derivative of f. It's an
important distinction.

~~~
laughinghan
I found dual numbers pretty intuitive, but I was a math major and studied
abstract algebra and stuff, that way of thinking is definitely an acquired
taste.

To be clear, when you say f' you're not talking about Newton's notation for
the derivative of f, you're talking about a function from floats to floats
that is the approximation/algorithmic implementation of f (which is R->R),
right?

I don't really see why the distinction is particularly important. The reason
the exact derivative of f' that you get from AD is _useful_ is because it also
approximates the derivative of f, right? If it were possible to make some
pathological choice of f' (that is, algorithmic implementation of f) such that
the result from AD is a poor approximation of the derivative of f, that's not
really a successful use of AD on f, right? So when I say I'm gonna "use AD to
get a derivative of f", I'm really saying I'm going to use AD on a good choice
of an approximation of f to get a good approximation of the derivative of f,
right?

I suppose if infinite precision analog computing were possible, or in the
context of theoretical real-valued register machines (does anyone still study
that?), it matters that AD is exact on the algorithmic implementation, but
we're not in or anywhere near those contexts, are we?

~~~
gh02t
> To be clear, when you say f' you're not talking about Newton's notation for
> the derivative of f, you're talking about a function from floats to floats
> that is the approximation/algorithmic implementation of f (which is R->R),
> right?

Yep. I probably should have used f* instead of f' because of context. R->R
isn't necessary however, it should generalize quite a bit.

As to why the distinction is important, it's because of sensitivity.
Approximating a function rarely also approximates its _derivative_ well unless
it is constructed to do so (e.g. k-th order Hermite polynomials). As an
example, consider the simplest approximation scheme, basic polynomial fitting.
They will guarantee that your fit matches the true function at all sample
points, but the derivative might (and usually will) be _wildly_ different. You
have to be extremely careful differentiating approximate schemes and relating
them to the true derivative. Usually what you're interested in for AD is the
derivative of the _approximate_ scheme and not an approximate derivative for
the exact function (e.g., sensitivity analysis).

And I agree that dual numbers are intuitive to me as well, but I think it
requires a decent bit of background as to why they work.

------
ScottBurson
Automatic differentiation dates back at least to 1974, in a language called
PROSE that ran on the CDC 6600 family [0].

[0]
[https://en.wikipedia.org/wiki/PROSE_modeling_language](https://en.wikipedia.org/wiki/PROSE_modeling_language)

------
Animats
I had a need for this in an unrelated problem area - ragdoll physics. In
1996-1997 I came out with Falling Bodes, the first ragdoll physics system that
worked right. It was a spring-damper system, not an impulse-constraint system
like most game engines; this produces more accurate and better looking
collisions, but is more expensive computationally. So it was used as an
animation tool, not for games.

The rigid body dynamics system used Featherstone's algorithm. This models a
tree of links and joints, a skeleton. You put in forces and torques at each
joint, and the algorithm computes accelerations at each joint and the root
node. Those accelerations then have to be integrated to get velocities and
positions. As positions change, so do contact forces. If you don't want to get
interpenetration in collisions, the contact forces must be allowed to become
very large. We used exponential springs as the contact model to achieve this.

Numerically, this means integrating a very stiff system of differential
equations. Explicit integration, like Runge-Kutta 4, works, but you have to
monitor the error term and cut the time step when necessary. During the early
parts of a hard collision, the time step may have to drop into the microsecond
range. This is why this approach doesn't work well for games; it's not
constant-time. (If you don't cut the time step, your simulation goes "sprong",
and things go flying off into space. Some games still do this.)

An alternative is to use implicit integration, where you work backwards to
solve the next step to be consistent with the previous one. Implicit
integration works with larger timesteps on stiff systems. (There's still
error, but it acts in the direction of draining energy from the moving system,
which is much better than adding it.) This requires gradients of the function
that the Featherstone algorithm is computing. The brute-force approach is to
compute the gradients (the Jacobian) by perturbing each input by a small
amount, running the Featherstone algorithm, measuring the differences, and
obtaining the slopes numerically. This is slow.

Ideally, you'd like to have analytical Jacobians. That's where automatic
differentiation would help. Differentiating Featherstone's algorithm
analytically is hard, but might be possible today.

We used SD/Fast, which is a C code generator for Featherstone's algorithm.
Straightforward implementations of Featherstone have lots of switches and
conditionals as they deal with the topology of the skeleton being simulated.
SD/Fast took in the link and joint description and turned out C code for that
structure. This gets rid of most of the conditionals. It unwound the code more
than you'd want to do today; there was much loop unrolling at the source
level. (Before superscalar CPUs, unwinding loops was a win; today, tiny tight
loops are a win.)

Anyway, it's possible that the tools being developed in machine learning could
be applied to game physics in that way. At the visual level, the difference is
that big objects bounce like big objects, with slower collision rebound times.
In impulse-constraint systems, all bounces are instantaneous, which is why, in
video games, everything seems to be too light.

------
scottlocklin
The caveats are worth reading before the rest of it. How long does it take to
write down the derivative of a loss function, compared to trusting that your
automatic diff did what you expected? I can tell you: not long.

------
hardmaru
Hi

Can you clarify the difference between autodiff and symbolic diff?

Thanks

~~~
klipt
Symbolic calculates the symbolic representation of the derivative, autodiff
just calculates the (numerical value of the) derivative by adjusting the
computation graph.

------
bawana
indeed, how did a post from 2009 get on the front page of HN?

~~~
gavazzy
You'd be surprised how many applications are using techniques for optimization
that not only work poorly but require more work. Many programmers, academics,
and students who browse HN may benefit from AD but don't know about it.

