
Dual Numbers and Automatic Differentiation (2014) - Kristine1975
http://blog.demofox.org/2014/12/30/dual-numbers-automatic-differentiation/
======
Atrix256
BTW this older article of mine is extended with a new one that shows how to
handle multiple variables (:
[http://blog.demofox.org/2017/02/20/multivariable-dual-
number...](http://blog.demofox.org/2017/02/20/multivariable-dual-numbers-
automatic-differentiation/)

~~~
imurray
_> If I had to guess, I’d say that dual numbers may be slightly slower than
backpropagation_

The cost of Dual numbers (a form of forward-mode differentiation) scales
linearly with the number of derivatives (just like finite differencing, but
more accurate). Backpropagation, or reverse-mode differentiation, is a
constant factor times the cost of a function evaluation. For neural nets with
millions of parameters, backpropagation is going to be millions of times
faster than dual numbers.

~~~
dnautics
[http://www.denizyuret.com/2015/02/beginning-deep-learning-
wi...](http://www.denizyuret.com/2015/02/beginning-deep-learning-
with-500-lines.html)

This guy did neural nets with dual numbers in julia and found it to take ~1.2
times as long as standard backpropagation, which suggests that if you had
dedicated hardware for it it would be very nice.

~~~
imurray
I think you're assuming that "automatic differentiation" (AD) is the same
thing as "using dual numbers", but AD is a family of methods including dual
numbers. The blog post you link to says they're using
[https://github.com/denizyuret/AutoGrad.jl](https://github.com/denizyuret/AutoGrad.jl)
which uses reverse-mode differentiation not Dual numbers. That means they're
performing the same operations (or nearly the same) as standard
backpropagation.

Dual numbers are just the wrong approach for neural nets, which have many
parameters. The _amazing_ thing about backprop / reverse-mode differentiation
is that you get the derivatives wrt _all_ the weights in one reverse sweep
through the network.

~~~
dnautics
Huh. I stand corrected. I thought for sure the package used automatic
differentiation using dual numbers.

------
Fede_V
This is a great post. However, it didn't touch on the two main problems of AD:

\- Using Dual Numbers requires that all functions that you call into accept
templated parameters. If you want to use GSL, BLAS, or any other mature math
library, you are probably out of luck.

\- Even if you are willing to port the code and modify the functions to accept
templated parameters, very highly optimized math libraries make assumption not
just about the behaviour of numbers (their API, defined by how they behave
under addition/subtraction, etc) but also about their ABI. For example, a well
tuned LAPACK like OpenBlas or MKL has very well tuned loop sizes to optimize
cache behaviour assuming that floats are of a particular size.

~~~
imurray
If the function is analytic and supports complex numbers (usual in
Matlab/Numpy/Fortran/...), but you're only interested in the real part, a
quick hack is to abuse complex numbers and compute
imag(f(x+i*epsilon))/epsilon. That one-liner can often give the right answer
to machine precision for small epsilon.
[http://blogs.mathworks.com/cleve/2013/10/14/complex-step-
dif...](http://blogs.mathworks.com/cleve/2013/10/14/complex-step-
differentiation/)

For matrix-based code, a good note on derivative propagation is
[https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf](https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf)
— the symbols with dots on top are forward-propagated derivatives like dual
numbers. The symbols with bars on top are back-propagated derivatives, which
is what you want to compute lots of derivatives at once. "Machine learning"
libraries like Theano, TensorFlow, and Autograd will compute the reverse mode
operation for many linear algebra expressions automatically, and use
reasonable libraries like BLAS+LAPACK or Eigen under the hood.

------
petters
Using dual numbers and modern C++ you can write a library that can do this:

    
    
        auto lambda =
            [](auto x, auto y)
            {
                // The Rosenbrock function.
                auto d0 =  y[0] - x[0]*x[0];
                auto d1 =  1 - x[0];
                return 100 * d0*d0 + d1*d1;
            };
    
        // The lambda function will be copied and
        // automatically differentiated. The derivatives
        // are computed using templates, not numerically.
        //
        // No need to derive or compute derivatives
        // manually!
        auto func = make_differentiable<1, 1>(lambda);
    

"func" now has code for first-order, second-order derivatives all generated
and heavily optimized at compile-time. This is one reason C++ is so good for
(mathematical) optimization.

------
one-more-minute
Check out ForwardDiff.jl [1], which is a really impressive implementation of
this exact idea. The technique gets pretty magical results; you can apply it
to any function, complete with loops, linalg etc, and it will compute a
derivative with something like 10% overhead. The standard approach is finite
differencing, which involves many-times overhead, isn't exact, and can easily
blow up for common pathological cases like step functions.

Fede_V's points about the drawbacks of the technique are valid in C++, but
Julia's duck-typing makes being generic the default (including in the standard
library). ForwardDiff works out of the box, for free, in a huge number of
cases.

[1]
[https://github.com/JuliaDiff/ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl)

~~~
dnautics
I hesitate to call what julia does "duck typing" because types are usually
checked at _compile time_ , even though it looks like "run-time" because the
system holds off on compilation until it's needed. If you write your functions
correctly, (with type stability) there is no active type checking that's done
after the first time you call the functions.

There certainly is run-time checking in julia (true duck-typing) if you are
not careful with how you write your functions, but I doubt that ForwardDiff,
for example, uses it.

~~~
one-more-minute
By duck typing, I mean that you can write:

    
    
        f(x) = x^2 + 2
    

for example, and the function will happily run with any type that supports
multiply and add. This is in contrast to languages where `x` has to explicitly
inherit from a given supertype or implement an interface to be valid. It's
true that Julia will specialise the function for ints or complex numbers or
whatever, but that's an implementation detail which doesn't change the meaning
of the program.

------
doublerebel
One of the libraries I'm using for computer vision depends on using the inner
dual number class Jet [0] from Google's ceres-solver to do automatic
differentiation.

It was worthwhile research reading through the implementation to understand
the applications.

[0]: [https://github.com/ceres-solver/ceres-
solver/blob/master/inc...](https://github.com/ceres-solver/ceres-
solver/blob/master/include/ceres/jet.h)

------
divbit
Super cool. In this example, it's differentiation of a one-dimensional curve,
but one can use the dual numbers to compute tangent spaces of more interesting
algebraic objects as well. See, e.g. remark 5.38 here:
[http://www.jmilne.org/math/CourseNotes/AG500.pdf](http://www.jmilne.org/math/CourseNotes/AG500.pdf)

------
jmount
Dual numbers are a blast, especially with type-templated languages. I wrote a
Scala implementation some time ago that I would like to share:
[http://www.win-vector.com/blog/2010/06/automatic-
differentia...](http://www.win-vector.com/blog/2010/06/automatic-
differentiation-with-scala/)

------
BoppreH
I've implemented this in Python a while ago, also (ab)using operator overload:
[https://github.com/boppreh/derivative](https://github.com/boppreh/derivative)

Not remarkable, but works.

    
    
        f = lambda x: x * 5 + x ** 2 - 2 / x + 3 / x ** 2
        print(derive(f, 6))

~~~
abecedarius
Similar:
[https://github.com/darius/sketchbook/blob/master/misc/autodi...](https://github.com/darius/sketchbook/blob/master/misc/autodiff.py)

------
zardo
An extension to multiple dimensions is mentioned. This would be exactly
geometric algebra, wouldn't it?

~~~
entelechy
It would be a subset of geometric algebra. Geometric Algebra is more generic
as it defines signature e_i * e_i := {-1,0,1} (where e_i are unit vectors) and
defines an exterior algebra [1].

So a 1 Dimensional geometric algebra with the signature e_1 _e_1 = e_ e := 0
is isomorphic to dual numbers.

[1]
[https://en.wikipedia.org/wiki/Exterior_algebra](https://en.wikipedia.org/wiki/Exterior_algebra)

------
inlineint
I don't like this use of dual numbers notation for differentiation because
division by a dual number is not defined. For example, how would one calculate
ε^2/ε? If ε is a matrix [0,1;0,0] then it doesn't have inverse and thus the
expression could not be evaluated.

On the other hand, little-o notation [1] was invented for exactly this
purpose. It is easy to evaluate derivatives using it, for example (x+ε)^3 =
x^3+3 _x^2_ ε+o(ε), and so ((x+ε)^3 - x^3)/ε = 3*x^2 + o(1), and o(1) tends to
0 for ε tends to 0.

[1] [https://en.wikipedia.org/wiki/Big_O_notation#Little-
o_notati...](https://en.wikipedia.org/wiki/Big_O_notation#Little-o_notation)

~~~
dnautics
Epsilon squared divided by epsilon is most definitely epsilon.

~~~
gugagore
It is most definitely not. For starters, you could also have argued that ε^2/ε
= 0 since the numerator is 0.

------
frankohn
The article is interesting but one have to be aware that dual number are not a
solution to automatic differentiation.

The reason is that to have automatic differentiation one needs to keep in
principle the epsilon^n for any order and the dual number just take epsilon^2
= 0 which is an strong limitation.

For example with dual number you can say that:

limit(x->0) (sin(x) - 1) / x = 1

just by doing x = epsilon but they will fail with:

limit(x->0) (cos(x) - 1) / x^2 = -1/2

because the quadratic terms you need will go to zero for x = epsilon.

~~~
blablabla123
>The reason is that to have automatic differentiation one

>needs to keep in principle the epsilon^n for any order and

>the dual number just take epsilon^2 = 0 which is an strong

>limitation.

Actually not. Differentiation in an abstract sense is the change of the
function, _approximated linearly_. This means: when you do normal
differentiation, you basically do already throw away the squared terms. The
Leibniz notation for a differentiated function df/dx is really explicit about
this principle - as opposed to the Newton notation f'(x).

What these people call dual numbers automatic differentiation is a really
powerful tool. In a more abstract sense, having an object and adding a small
epsilon to it, allows you differentiate much more general objects. For
instance I could ask: how does a function S[f] of a function change with
respect to a function f. I know this sounds like abstract non-sense but you
can for example calculate the curve of a rope with it by solving dS[f] = 0 to
f.

>limit(x->0) (sin(x) - 1) / x = 1

>just by doing x = epsilon

No, that's not the case. If you use the L'Hospital rules, you throw away the
epsilon (it's constant) and the derivatives of both enumerator and denominator
don't change. Thus the two limits stay the same.

------
tehsauce
Anyone else know this guy from his shadertoy profile?

~~~
Atrix256
Too funny, what's your shadertoy handle? (this is demofox)

------
erichocean
I'm not a math guy, but I've gotten incredible usage out of Dual numbers over
the years. Highly recommended.

~~~
amelius
Could you elaborate?

Imho, the article is not very convincing, since Dual numbers require double
amount of work per operation (making the computation of the derivative not
"free" or "automatic").

The concept, however, seems very interesting, so I'm wondering about other
applications.

~~~
erichocean
To follow up on what @zardo said, it's also not "double the amount of work" in
a language like C++ _on the coding side_ , because you just replace `float` or
`double` by your Dual number type and read off the derivatives as needed.

So, the "work" is almost nothing vs. computing derivatives by hand, which
requires entirely different math. Furthermore, the actual CPU instructions are
generally lower with a Dual implementation, usually because transcendentals
and the like aren't computed over and over again—although this is obviously
computation-dependent.

An example of a project that uses automatic differentiation in a really useful
way is the Ceres Solver.[0]

[0] [http://ceres-solver.org/](http://ceres-solver.org/)

------
empath75
How would this work with second, third, etc, derivatives?

~~~
wenc
This is where it gets complicated.

I'm not sure where the term "Dual Numbers" came from, but in the numerical
math community, this is known as Complex Step Differentiation.
[https://en.wikipedia.org/wiki/Numerical_differentiation#Comp...](https://en.wikipedia.org/wiki/Numerical_differentiation#Complex_variable_methods)

Complex step is only straightforward for first-derivatives. For second-order
and above, there are a number of competing techniques. One of them is
multicomplex numbers.

From a performance perspective, Automatic Differentiation pretty much
outperforms complex step methods in most situations. The only reason to use
complex step methods is when you have a very fast language with a complex
types (i.e. FORTRAN) and you need a quick-and-easy way of getting 1st order
derivatives.

Otherwise, just use AD.

