First I want to say this article is actually a great intro to AD and shows off a taste of how well suited Julia is to writing this kind of thing. Mike Innes, linked in the article has a great tutorial on AD that made me realize how awesome Julia is as well, almost too smart :)
Now, AD avoids truncation error by hard coding the basic rules of differentiation, and applying these exact rules to the input. Thus the article setting up the rules for +-*/ as opposed to defining numerical differentiation as the usual limit. There is no magic, AD works because it doesnt use an approximation.
So if instead you use an approximation, like a Taylor series, for a function, and differentiate that, you dont get the derivative of the functions, you get the derivative of the truncated series you wrote. Same would be true for any function. This does not feel surprising.
So I can only assume that the article is really intended to just be about giving a round about explanation of how AD works, rather than uncovering some revelation, which is effectively a tautology, as the article itself points out.
So overall, valueable, but also a strange way of framing it IMO
An obvious example here is when you expand e^x with its Taylor series using `n` terms. The derivative of that length-n Taylor expansion is identical but only `n-1` terms long, so you lose precision.
More generally if you approximate a smooth function f with a truncated taylor series of length n around c, the error behaves as O((x−c)^(n+1)), and the error of the k'th derivative of that function auto-diffed will be of order O((x-c)^(n+1-k)).
That's true but this sort of problem would normally not happen with a standard AD library or subsystem, whatever language you're using. Such systems don't approximate functions then differentiate their approximations.
Well, all functions that use finite precision arithmetic deal with approximations. Errors of 1/2 ulp like trig functions are rather the exception. And machine learning is keen on using low precision floating point type, so errors are relatively large.
I think the example is supposed to be indicative of larger functions that users define into the system. That is, that this was shown with sin/cos was simply an example. Right?
Every real AD system has a primative for sin and cos.
but they do not have every single operation you might ever implement (else what would be the point),
and it is unlikely they will have every operation that involves any approximation.
(Though there is a good chance the might have every named scalar operation involving an approximation that you do, depends how fancy you are.)
I mean I maintain a library of hundreds of custom primatives.
(one of the uses of this post was so i can point to it in response to people saying "why do you need all those primatives if AD can just decompose everything into + and *)
If the example used sin|cos the auto diff implementations in Julia would have called native cos|-sin and not encurred such a "truncation error". However the post illustrates the idea in a good way.
You could use a lazy representation of the Taylor series that will give you as many terms as you ask for and symbolically differentiate that. Then you'll gate an accurate automatic differentiation. When you go to evaluate your approximation you'll get errors at that point, but you'll correctly get that the second derivative of sin(x) is exactly -sin(x).
Correct, which is mentioned at the end of the post.
I am not sure how well it generalizes, e.g. to multivariate functions that are defines as solutions of some thing like a differential equation.
You could do this but I'm pretty sure that real AD just deals all the common primitives directly. The derivative of sin(x) is cos(x), etc. AD just postpones the chain rule calculations that would be involved in something sin(x + x^2)^2, so the chain-rule happens at run time instead of a symbol expansion at compile time.
I have used Automatic Differention used extensively to "optimize" simulations. This is problem is know under many names such parameter estimation, local sensitivity analysis, optimal control ...
Having access to something the resembles the derivative of your code allows you to use optimization algorithms that converge faster (Newton or Gradient descent vs bisection methods) and you don't have to specify the derivatives by hand (which gets really bothersome when you have to specify a hessian matrix (2nd derivative information including all the mixed terms)).
Would be cool to see what your code looked like. I have the same question as the start of the thread.
That is, I can see how the derivative can help optimise a simulation, but I'm not clear on how using an auto derivative really helps.
Edit: note that I also see a difference in being able to get the derivative of a function, versus writing in a sense where any function can be moved to its derivative.
> That is, I can see how the derivative can help optimise a simulation, but I'm not clear on how using an auto derivative really helps.
Finding a derivative by hand gets tiring fast, and is error prone.
Used to be a neural net paper doing anything novel spent like 1-2 pages deriving the derivative of its cool new thing.
Not it's derivative isn't even mentioned.
But does that require auto derivative stuff? I'm assuming most could have simply mentioned the main equation, said "use Mathematica" and proceeded with the derivative?
The main equation may not really be anywhere near so simple as that.
E.g. for the kind of stuff I do for work the main equation is something like output a vector which is the instructions for every powerplant in Texas of how much power to produce, such that every person and business that needs power has power, subject to constraints about the capacity of each powerline
Differentiate total emissions of this process with respect to demand.
It's is an equation yes.
But it's actually a multi-hundred line program that includes thousands more lines of constants and a step of running ab internal linear optimization solver inside the problem.
Symbolic Differentiation, like Mathematica tends to start to slow down and generate massively amounts of code once functions get complicated.
I am also not sure that it can handle dynamic length loops.
Source to Source AD, like Zygote (Julia), Tapenade (Fortran), Jax (Python), Enzyme (LLVM) have a lot of what you might want though.
Thanks to both answers here! I think this is finally clicking. Probably many more ticks before I get it, but such is progress.
Would love to see a blog or other post on solving this kind of problem. One with an inner LP would be amazing to see. I'm pretty sure some of the stuff I've looked at in the past on supply chain solutions could find relevance here.
I don't think mathematica supports AD. It supports symbolic differentiation which is similar, but different. Symbolic differentiation given a function determines the derivative for the whole function. AD focuses on running a function and including information to be able to find the derivative at one value. That value may involve tons of branching along the way. You could have a program that involves loops/conditionals that symbolic differentiation would have a painful time and unlikely to pull it off while AD works just fine. AD does not try to determine what the derivative looks everywhere but really traces the function with some extra stuff (depends on forward/backward for what stuff) to allow it to know what the derivative is.
An example of a weird use case outside ML is AD allows you to differentiate through a raytracer/complex program. Maybe your raytracer has a couple parameters for lighting and you have a target image you can to create something as similar to as possible. You could use AD to optimize the lighting parameters. That's one problem that mathematica will be very unlikely to be able to do. For a large application if you want to AD the entire thing either you are using a framework that supports AD everywhere like tensorflow/pytorch or you need language level support like Julia. Pretty few languages have AD at the language level.
To try and rephrase, the techniques used are similar in both. In that it is all calculus. However, with AD, the focus is reducing all calculations down to what was performed in expressions that have duals, and getting the derivative of that, as calculated, to use in making a choice.
Doing this symbolically would require the entire function space be solved for the problem in very difficult ways that are not easy to avoid.
That is, with AD, you don't get the total derivative of a problem, per se. Instead, you get the problem reduced to a a method that can evaluate at a tuple, where you also have the derivative for that tuple?
Extrapolating from this, even if a problem had piecewise spots where it will not differentiate, AD mostly works around this by focusing on the pieces?
I think the part I had trouble with was seeing why AD would give better results. I still don't fully grok it, but that AD works not necessarily on the full equation, but the equation that was evaluated at a spot, I think finally clicks some with me.
I don't get how it helps with some discontinuity problems, but I think I can get how those aren't as important in many contexts.
The most frequent application is machine learning. If my piece of code is implementing some function that I want to minimize, then taking the derivative (w.r.t. some vector of parameters) tells me in which direction I need to change the parameters in order to reduce the function: "gradient descent".
We've got a model, implemented in code. Since it's code can be differentiated - not sure how that works with branches, I guess that's the math. :) This is generated through a set of input parameters.
We've got an error function, representing the difference between the model and reality.
If we differentiate the error function, we can choose which set of parameter mutations are heading in the right direction to then generate a new model? We check each close point and find the max benefit?
However, if everything is taking the parameters as input, is the derivative of the error function only generated once?
Is it saying that the derivative of the error function is independent of the parameters, so it doesn't matter what the model is, they all have the same error function, and that error function can be found by generating a single model?
Dealing with branches is indeed and interesting problem. Many AD systems can't accomodate branches. I think most of the Julia ones do.
The gist of it is that branches don't actually require that much fanciness, but they can introduce discontinuties, and AD systems will often do things like happily give you a finite derivative right at a discontinuity where a calculus student would normally complain.
Backward propagation computes the gradient of the loss function with respect to the weights. A gradient is a vector or tensor of derivatives. You need to differentiate the loss function somehow to evaluate this vector.
That's not strictly speaking true. You also calculate the gradient of the input. Otherwise, you wouldn't be able to backpropagate over more than one layer. It's just useful because you can do both easily, where with forward mode getting derivatives with respect to weights is considerably more expensive.
To expand, it tells you how you can change certain variables of a function to change the output of a function.
In machine learning, you create a loss function that takes in modal parameters and returns how accurate the model is. Then, you can optimize the function inputs for accuracy.
Automatic differentiation makes it much easier to code different types of models.
You want to take the derivative of a piece of code so that you don't have to either hand-write the derivative yourself, or rely on finite differencing which can be slower in many circumstances and is almost universally less accurate.
anyone one know why of the 7 AD's tried (8 including the one implemented at the start) there are the different answers?
0.4999999963909431 vs 0.4999999963909432 vs 0.4999999963909433
I assume if is some kind of IEEE math thing.
given that IEEE allow `(a+b)+c != a + (b + c)`
but where is it occuring exactly?
Occationally, there are a few places where you want arbitary high derivatives.
But in general there are specialized solutions for those (though very few things implement them. E.g. Jax's JETs and TaylorSeries.jl in julia implement taylor-mode AD. But those also need custom primatives)
More of a problem is first derivatives, but that you need very high accuracy.
Which doesn't occur in ML but does occur sometimes in scientific computing.
(Doesn't occur in ML cos we just basically shake the thing about a bit anyway)
Article: "Bonus: will symbolic differentiation save me? Probably, but probably not in an interesting way."
This articles seems to misunderstand AD.
Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value. Automatic differentiating is basically following the steps you'd follow for symbolic differentiation but substituting a value for the symbolic expansion and so avoiding the explosion of symbols that symbolic differentiation involves. But it's literally the same sequence of calculations. The one way symbolic differentiation might help is if you symbolically differentiated and then rearranged terms to avoid truncation error but that's a bit different.
The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.
> Automatic Differentiation doesn't incur more truncation error than symbolically differentiating the function and then calculating the symbolic derivative's value.
Yes. The article even says that.
_"The AD system is (as you might have surmised) not incurring truncation errors. It is giving us exactly what we asked for, which is the derivative of my_sin. my_sin is a polynomial. The derivative of the polynomial is:
[article lists the hand-derived derivative]"_
The reason symbolic might help is symbolic AD is often used in languages that don't represent things with numbers, but with lazy expressions.
I will clarify that.
(edit, I have updates that bit and I think it is clearer. Thanks)
> The article seems to calculate sin(x) in a lossy fashion and then attribute to the error to AD. That's not how it works.
The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative.
Is there a bit I should clarify more about that?
I tried to emphisize that at the end before the bit mentioning symbolic.
I would suggest saying front and center, more prominently, that AD incurs the truncation errors that the equivalent symbolic derivative would incur. Since you're talking about AD, you should be giving people a good picture of what it is (since it's not that commonly understood). 'Cause you're kind of suggesting otherwise even if somewhere you're eventually saying this.
I mean, Griewank saying "Algorithmic differentiation does not incur truncation error" does deserve the caveat "unless the underlying system has truncation errors, which it often does". But you can give that caveat without bending the stick the other way.
The important bit is that the accurasy lost from a accurate derviative of a lossy approximation is greater than accurasy lost from a lossy approximation to an accurate derivative.
I understand you get inaccuracy from approximating the sin but I don't know what you're contrasting this to. I think real AD libraries deal with primitives and with combinations of primitive so for such a library, the derivative of sin(x) would be "symbolically" calculated as cos(x) since sin is primitive (essentially, all the "standard functions" have to be primitives and create things from that. I doubt any library would apply AD to it's approximation of a given function).
I haven't used such libraries but I am in the process of writing an AD subsystem for my own little language.
Now, AD avoids truncation error by hard coding the basic rules of differentiation, and applying these exact rules to the input. Thus the article setting up the rules for +-*/ as opposed to defining numerical differentiation as the usual limit. There is no magic, AD works because it doesnt use an approximation.
So if instead you use an approximation, like a Taylor series, for a function, and differentiate that, you dont get the derivative of the functions, you get the derivative of the truncated series you wrote. Same would be true for any function. This does not feel surprising.
So I can only assume that the article is really intended to just be about giving a round about explanation of how AD works, rather than uncovering some revelation, which is effectively a tautology, as the article itself points out.
So overall, valueable, but also a strange way of framing it IMO