Implement your own source transformation AD with Julia 66 points by metalwhale on June 11, 2020 | hide | past | favorite | 27 comments

 We have so many AD implementations in Julia now that we actually have infrastructure that separates out the definition of primitive derivate rules from the actual AD mechanism itself, so the rules can be shared amongst all of them. Of course it would be better if there was just one AD to rule them all, but there are tradeoffs in the design space that make that hard. I think having all these different implementations has actually helped crystalize what the design space actually is, what choices need to be made and what the interesting classes of applications are. I think that'll help with the next generation of these tools (disclaimer I just started working on one of those next generation tools yesterday ;) ).
 Thanks for this. Glad to see the Julia community is going strong. At Zebrium we use Julia at the core of our log structuring engine, and talk to it over gRPC. Keep it up!
 Differentiating through control flow has never made sense to me. What does it mean to differentiate the following function: "f(x) = x > 0 ? x : -x"? If you plot this function you get a sharp corner at 0 which means it's not differentiable there because the limit from the left is -1 and the limit from the right is 1. Since 1 =/= -1 the derivative does not exist at 0.So how are AD libraries claiming to differentiate such functions? Is there an implicit assumption that the user knows the derivative does not make sense at 0?Edit: I just tried this and it gives the wrong answer without any hint that it's incorrect:"""julia> ff (generic function with 1 method)julia> f(0), f'(0)(0, -1)julia> f'(1), f'(-1)(1, -1)"""
 Yes, you can just consider it a piecewise differentiable function. That's not usually a problem in practice if you think of the derivatives more as heuristics in a search problem than absolute truth. Obviously how well your optimization works does depend on the differentiability attributes of your underlying function and how well your optimization scheme can handle it.
 If subgradients are enough (-1 is correct subgradient at 0 in your example) then there are valid approaches for AD subgradient, see https://arxiv.org/abs/1809.08530
 Thanks for the link.
 if you want to be rigorous about it, you can often talk in terms of elements of the set of subgradients rather than gradients, and the convergence proofs for many of the popular optimization algorithms still go through.
 Any references for sub-gradients and convergence proofs I can take a look at?
 Depending on the level you're looking for, I'm always a huge fan of Rockafellar [0] for the more mathematical expositions (they're lovely, clear, and very well thought through).Even in this case, though, subgradients may not exist (they're only guaranteed to exist for convex functions and any nonconvex function has, almost by definition, at least one point for which there exists no subgradient), in which case one talks about sub-derivatives instead. These always exist whenever a function is semicontinuous (such as the "if" statement example given in the GGP comment), even if it is not convex [1].All of this is to say, the subject of general variational analysis is mathematically nice, but computationally extraordinarily difficult.The first proofs of even local convergence of GD to local optima (not stationary points!) with high probability, even in the case of smooth, differentiable functions, only recently emerged as well, and the results are rather weak in the sense of limits [2]. Query-hardness has also been shown for many examples (even with unbounded computational time) in [3] even when the functions are smooth.The non-smooth case becomes even harder and you have exponential-size query complexity in the number of dimensions even when you restrict that the function cannot "vary too much," locally (i.e., if it is required to be L-Lipschitz) [4].-----[0] For example, Variational Analysis covers a good amount of this stuff https://sites.math.washington.edu/~rtr/papers/rtr169-VarAnal...[1] c.f., chapter 8 in [0].[3] https://arxiv.org/abs/1912.02365 in the stochastic case, and https://arxiv.org/abs/1710.11606 along with https://arxiv.org/abs/1711.00841 in the nonstochastic case[4] See, e.g., 1.1.3 in Nesterov's Introductory Lectures in Convex Optimization (can be easily found online).
 wow, the rockafellar book is fantastic. i can't believe i'd never run into it before. thanks!
 memexy on June 11, 2020 [–] Thanks.
 The kinks correspond to a set of measure zero, which you will likely never hit during execution, so one can safely ignore the problem as not physically relevant. One way to think of the problem is that the cost function we’re differentiating is approximate/fake, and whatever it needs to be (at some special neighborhoods) to give us derivatives we consider sensible (in large regions).After all, there’s nothing so special about the ReLU... It would be very very weird/unstable if our algorithms worked for ReLU, but not the link-smoothed version of ReLU.
 Hmm... I'm not sure I agree.All optimal points (for, say, optimizing a linear function) will lie on the extremal points of the feasible domain, many of which will be points where the constraint functions are not differentiable. In all cases you can turn nonlinear objective function optimization (say over f) into linear objective function optimization by adding a constraint f(x) ≤ t and moving t to the objective.Now, I will agree that smooth optimization algorithms will work ok, but try optimizing abs(x) with GD; you'll find that the best possible error you can achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.
 > All optimal pointsSorry I screwed up a bit when writing that: an optimal point exists which lies on the extremal points of the feasible domain. In many cases, it can be shown that the optimal point lies on the extremal points of the feasible domain (when the objective is linear)
 ssivark on June 12, 2020 [–] Yes, but we’re going to be restricted to O(L) final accuracy no matter what, for gradient descent (we could choose second order optimizations, etc, but that’s an orthogonal point — we’re happy to get within an epsilon ball of the answer).
 > Yes, but we’re going to be restricted to O(L) final accuracy no matter whatThis is not, in general, true for smooth functions so long as L is small enough (you can reach arbitrary accuracy with GD if L is smaller than ~ the reciprocal of the Lipchitz constant of a differentiable objective function but it need not be arbitrarily small).
 memexy on June 11, 2020 [–] My question wasn't about the theoretical aspects of measurability since any countable set of points will have measure 0 but about all AD libraries sweeping this kind of issue under the rug. Where in the Zygote docs is it mentioned that the absolute value function will give the wrong answer when differentiated?
 In the same way that the ReLU derivative is not defined at x=0. Most of the time, in practice, this all doesn't really matter and you can still get gradient descent to work in a useful way.
 the gradient doesn't exist but subgradients do exist (at points of non-differentiability) and that is still useful (case in point as someone else mentions ReLU)
 Thanks. That's a nice tutorial.
 Some more related info on different algorithmic differentiation approaches in Julia: https://github.com/MikeInnes/diff-zoo
 Thank you so much for sharing this great repo! I have noticed that the source transformation notebook is not finished yet. How is it now?