
Exact numeric nth derivatives - jliszka
http://jliszka.github.io/2013/10/24/exact-numeric-nth-derivatives.html
======
jordigh
Am I missing something or is this begging the question?

For any function that is not a combination of polynomials, you need to have
its Taylor expansion up to the desired order of derivatives, so you can't just
take an "arbitrary" function and use this method to compute its derivative in
exact arithmetic.

So for anything other than polynomials, you just reword the problem of finding
exact derivatives to finding exact Taylor series, and in order to find Taylor
series in most cases, you have to differentiate or express your function in
terms of the Taylor series of known functions.

Edit: Indeed, take the only non-polynomial example here, a rational function
(division by a polynomial). In order to make this work, you have to know the
geometric series expansion of 1/(1-x). For each function that you want to
differentiate this way, you have to keep adding more such pre-computed Taylor
expansions.

~~~
crntaylor
The typical approach is to provide overloaded versions of primitive functions
(generally addition, multiplication, subtraction, division, powers, trig and
hyperbolic trig functions, exponential and logarithm) for which you explicitly
tell the program what the first N terms in the Taylor series are.

The magic is that you also tell it how to compute Taylor series of function
compositions, if the Taylor series of the functions being composed are already
known - then any arbitrary function composed out the primitive functions can
have its Taylor series computed automatically!

For your example, the function 1/(1-x) is the composition of

    
    
        x -> -x
        x -> 1+x
        x -> 1/x
    

and so its Taylor series is already known as long as you have already defined
negation, addition and reciprocation.

~~~
jordigh
So instead of implementing a full CAS that knows linearity of differentiation
along with the product and chain rules, you implement something that for a
little less computation can give you exact derivatives at any given point. Am
I understanding this correctly?

~~~
crntaylor
Yes, that's more or less it. The system you implement still knows about
linearity of differentiation, the product rule, chain rule etc but it's not a
full blown CAS. It can't give you the symbolic derivative of a function
(although I guess you can technically apply AD to any numeric type, including
a symbolic type..) but it can compute the numerical value of the derivatives
at arbitrary points.

------
crntaylor
Very neat. Presumably there is a more efficient method for implementing Nth
order automatic differentiation than encoding the dual numbers as NxN
matrices, though? To multiply the matrices takes O(N^3) time, whereas by
exploiting their known structure I think you should be able to do it in O(N^2)
time. Am I wrong?

~~~
jliszka
You're right, there's only O(n) information in these particular n x n
matrices, so you can multiply them in O(n^2).

The code provided memoizes cell values using the key (r - c), so only O(n^2)
work is required to read off the top row of the matrix. I'm planning to make
that more explicit in a follow-up post.

------
dhammack
There's an interesting python library [1] which implements AD as well as has
some neat features like automatic compilation to optimized C. It's developed
by the AI lab at the University of Montreal, and is pretty popular in deep
learning circles. I've found it to be a huge time saver to not worry whether
you screwed up your gradient calculations when doing exploratory research!

[1]
[http://deeplearning.net/software/theano/](http://deeplearning.net/software/theano/)

~~~
lightcatcher
My personal favorite feature of Theano is the automatic compilation to CUDA
code (which would get about 8-15x speed-up over the optimized C code for the
deep learning research I was doing).

------
tel
See also the ad package [1] for Haskell which has a number of interesting
features of this vein.

[1]
[http://hackage.haskell.org/package/ad](http://hackage.haskell.org/package/ad)

------
fdej
This seems to be essentially the same thing as "power series arithmetic"
(first-order "dual arithmetic" is equivalent to arithmetic in the ring of
formal power series modulo x^2, but you can make that x^n).

Encoding power series as matrices is sometimes convenient for theoretical
analysis (or, as here, educational purposes), but it's not very efficient. The
space and time complexities with matrices are O(n^2) and O(n^3), versus O(n)
and O(n^2) (or even O(n log n) using FFT) using the straightforward polynomial
representation (in which working with hundreds of thousands of derivatives is
feasible). In fact some of my current research focuses on doing this
efficiently with huge-precision numbers, and with transcendental functions
involved.

------
BoppreH
I couldn't believe it would work, so I made a toy implementation in Python
using simple operator overloading:
[https://github.com/boppreh/derivative](https://github.com/boppreh/derivative)

All values tried so far agree with Wolfram Alpha, so color me surprised and
happy for learning something new.

------
backprojection
I think it's worth noting that the problem with numerical differentiation,
fundamentally, is that differentiation is an unbounded operator. In finite-
differences, (the more obvious approach), you assume that your data are
samples of some, general, function.

The problem then, is that that general functions have no (essential) bandlimit
[1]. Remember that differentiation acts as a multiplication by a monomial, in
the frequency domain [2]. Non-constant polynomials always eventually blow up
away from 0, so in differentiating, you're multiplying a function by something
that blows up, in the frequency domain. This means that, in the result, higher
frequencies are going to dominate over lower frequencies, at a polynomial
rate.

Let me be clear, the problem with numerical differentiation is not just that
rounding errors accumulate, it's that differentiation is fundamentally
unstable, and not something you want to apply to real-world data.

It depends very much on what your application is, however, I think generally a
better approach to AD is to redefine your differentiation, by composing it
with a low-pass filter. If designed properly, your low-pass filter will 'go to
zero' faster (in the frequency-domain) than any polynomial, thus making this
new operator bounded, and hence numerically more stable. It's not a panacea,
but it begins to address the fundamental problem.

One example of such a filter is Gamma(n+1, n x^2)/Factorial[n], where Gamma is
the incomplete gamma function [3].

In Python:

scipy.special.gammaincc(n+1,n _x_ _2) or mpmath.gammainc(n+1,n_ x __2,
regularized=True)

To see why this is a nice choice, notice item 2 in [4]. This filter is simply
the product of exp(- x^2) (the Gaussian) multiplied by the first n-terms of
the Taylor series of exp(+ x^2), (1/ the-Gaussian). Since this series
converges unconditionally everywhere, as n-> +infinity, this filter converges
to 1 for a fixed x (as you increase n), however, since it's still a gaussian
times a polynomial, it always converges to 0 as you increase x, but fix n.

This is my area of research, so if anyone's interested I can give more
details.

[1] [https://en.wikipedia.org/wiki/Band-
limit](https://en.wikipedia.org/wiki/Band-limit) [2]
[https://en.wikipedia.org/wiki/Fourier_transform#Analysis_of_...](https://en.wikipedia.org/wiki/Fourier_transform#Analysis_of_differential_equations)
[3]
[https://en.wikipedia.org/wiki/Incomplete_gamma_function](https://en.wikipedia.org/wiki/Incomplete_gamma_function)
[4]
[https://en.wikipedia.org/wiki/Incomplete_gamma_function#Spec...](https://en.wikipedia.org/wiki/Incomplete_gamma_function#Special_values)

~~~
jules
Read the article, it's not about computing derivatives of real world data
(using finite differences or whatever), it's about exact derivatives of
rational functions specified by a computer program. While what you wrote is
interesting, none of it applies to this article.

~~~
acjohnson55
^ Totally this. It's not meant for differentiating a function known only by
its points. It's meant for evaluating nth-order derivatives of functions you
can _already evaluate at arbitrary values_. The functions don't even have to
be rational. As long as the function can be calculated by a computer by some
method, so can its derivatives.

This comes in handy in all sorts of things where you might have designed this
fancy kernel function to use in some process where you need to be able to
calculate the value of the function and also of some derivatives--
backpropagation for example [1].

As I was told by someone in the field, at one point, people used to generate
machine learning publications simply by finding functions that required fancy
mathematical tricks to find closed-form derivatives of chosen functions so
that they would be usable in learning algorithms. But in many cases, this work
is unnecessary if you use automatic differentiation.

It's a really cool concept, applicable in specific situations. If you need to
know the derivative of a function that's not fully specified, you need
numerical differentiation [2]. If you need a closed-form expression for your
derivative function, that's when you need symbolic differentiation [3].

[1]
[http://en.wikipedia.org/wiki/Backpropagation](http://en.wikipedia.org/wiki/Backpropagation)
[2]
[http://en.wikipedia.org/wiki/Numerical_differentiation](http://en.wikipedia.org/wiki/Numerical_differentiation)
[3]
[http://en.wikipedia.org/wiki/Symbolic_differentiation](http://en.wikipedia.org/wiki/Symbolic_differentiation)

~~~
backprojection
> It's not meant for differentiating a function known only by its points. It's
> meant for evaluating nth-order derivatives of functions you can already
> evaluate at arbitrary values

This is correct. So my comment "I think generally a better approach to AD is
to redefine your differentiation", doesn't really apply to AD. Essentially, AD
is about differentiating functions, not data.

------
fusiongyro
Very neat article. This is essentially calculus with infinitesimals (also
called "nonstandard analysis") implemented on the machine. If you like the
approach, a more general and rigorous investigation can be had by reading H.
Jerome Keisler's book _Elementary Calculus_ , which is freely available online
in the 2nd edition here:

    
    
      http://www.math.wisc.edu/~keisler/calc.html
    

The third edition is now in print. I've been studying calculus with it off-
and-on for a while and I find the approach very intuitive, though Spivak's
_Calculus_ is probably a better book, the "standard analysis" is a little less
intuitive (and now, evidently, harder to teach a machine).

------
mrcactu5
congratulations, you have implmented the Zariski tangent space using nilpotent
matrices. welcome to the beautiful theory of algebraic geometry and schemes.

[http://math.stanford.edu/~vakil/0708-216/216class21.pdf](http://math.stanford.edu/~vakil/0708-216/216class21.pdf)

This really does fall in the ream of algebraic geometry since this method only
works for rational functions - as he implemented it.

To numerically compute sin(x + ε) you need the Taylor series.

------
gpsarakis
Nice analysis. Hope you don't mind me adding that by omitting terms of the
Taylor series you do have some loss of precision, however small. Also, solving
linear equation systems may even introduce instability as the following must
be preserved:
[http://en.wikipedia.org/wiki/Diagonally_dominant_matrix](http://en.wikipedia.org/wiki/Diagonally_dominant_matrix)

~~~
crntaylor
The point is that when you're considering the Taylor series for a _dual
number_ argument, you don't lose any precision, because higher powers of the
"imaginary" part of the dual number vanish. The example he gives is

    
    
        f(a + be) = f(a) + f'(a)be + 0.5 * f''(a) b^2 e^2 + O(e^3)
                  = f(a) + f'(a)be
    

because e^n=0 for all n>1\. This isn't an approximation - it's an exact
relationship for dual numbers!

You _will_ lose some precision by using floating point numbers instead of an
arbitrary-precision real number type, but this is a limitation of the machine
you're working on. The method is exact.

------
Pitarou
How does this technique compare to a computer implementation of the kinds of
techniques we learnt in High School? Is it easier to implement? More
efficient? Are there some situations where it isn't appropriate?

------
Bill_Dimm
There seems to be a typo at the beginning of the "Implementing dual numbers"
section. It says:

    
    
      The number a+bd can be encoded as...
    

Should be:

    
    
      The number a+b*epsilon can be encoded as...

~~~
jliszka
Thanks! fixing...

------
tomrod
This read nicely until I got to the code block. Does anyone else see this as
yellow and gray (with syntax highlighting) coloration--such that it's
virtually impossible to read?

------
svantana
Is it just me or this article pretty naive? The headline's use of the word
"exact" would imply integer arithmetic only, but the computations are done
with floating point. So basically (s)he is trading one rounding error for
another, which seems to be small-ish in some particular cases. What about
discontinuities? And why forward derivatives only? I hope noone will use this
for any application that actually relies on exact derivatives.

~~~
Leszek
They're exact in the sense that they give the same value as if you had
calculated the value of the analytic derivative. This is different from
numerical differentiation, which approximates the derivative with finite
differences.

------
Bahamut
As a former mathematician, the second sentence is an abomination:

"...but to give you an overview, the idea is that you introduce an algebraic
symbol ϵ such that ϵ≠0 but ϵ^2=0"

