Hacker News new | past | comments | ask | show | jobs | submit login
Implement your own source transformation AD with Julia (rogerluo.me)
66 points by metalwhale on June 11, 2020 | hide | past | favorite | 27 comments

We have so many AD implementations in Julia now that we actually have infrastructure that separates out the definition of primitive derivate rules from the actual AD mechanism itself, so the rules can be shared amongst all of them. Of course it would be better if there was just one AD to rule them all, but there are tradeoffs in the design space that make that hard. I think having all these different implementations has actually helped crystalize what the design space actually is, what choices need to be made and what the interesting classes of applications are. I think that'll help with the next generation of these tools (disclaimer I just started working on one of those next generation tools yesterday ;) ).

Thanks for this. Glad to see the Julia community is going strong. At Zebrium we use Julia at the core of our log structuring engine, and talk to it over gRPC. Keep it up!

Differentiating through control flow has never made sense to me. What does it mean to differentiate the following function: "f(x) = x > 0 ? x : -x"? If you plot this function you get a sharp corner at 0 which means it's not differentiable there because the limit from the left is -1 and the limit from the right is 1. Since 1 =/= -1 the derivative does not exist at 0.

So how are AD libraries claiming to differentiate such functions? Is there an implicit assumption that the user knows the derivative does not make sense at 0?

Edit: I just tried this and it gives the wrong answer without any hint that it's incorrect:


julia> f

f (generic function with 1 method)

julia> f(0), f'(0)

(0, -1)

julia> f'(1), f'(-1)

(1, -1)


Yes, you can just consider it a piecewise differentiable function. That's not usually a problem in practice if you think of the derivatives more as heuristics in a search problem than absolute truth. Obviously how well your optimization works does depend on the differentiability attributes of your underlying function and how well your optimization scheme can handle it.

If subgradients are enough (-1 is correct subgradient at 0 in your example) then there are valid approaches for AD subgradient, see https://arxiv.org/abs/1809.08530

Thanks for the link.

if you want to be rigorous about it, you can often talk in terms of elements of the set of subgradients rather than gradients, and the convergence proofs for many of the popular optimization algorithms still go through.

Any references for sub-gradients and convergence proofs I can take a look at?

Depending on the level you're looking for, I'm always a huge fan of Rockafellar [0] for the more mathematical expositions (they're lovely, clear, and very well thought through).

Even in this case, though, subgradients may not exist (they're only guaranteed to exist for convex functions and any nonconvex function has, almost by definition, at least one point for which there exists no subgradient), in which case one talks about sub-derivatives instead. These always exist whenever a function is semicontinuous (such as the "if" statement example given in the GGP comment), even if it is not convex [1].

All of this is to say, the subject of general variational analysis is mathematically nice, but computationally extraordinarily difficult.

The first proofs of even local convergence of GD to local optima (not stationary points!) with high probability, even in the case of smooth, differentiable functions, only recently emerged as well, and the results are rather weak in the sense of limits [2]. Query-hardness has also been shown for many examples (even with unbounded computational time) in [3] even when the functions are smooth.

The non-smooth case becomes even harder and you have exponential-size query complexity in the number of dimensions even when you restrict that the function cannot "vary too much," locally (i.e., if it is required to be L-Lipschitz) [4].


[0] For example, Variational Analysis covers a good amount of this stuff https://sites.math.washington.edu/~rtr/papers/rtr169-VarAnal...

[1] c.f., chapter 8 in [0].

[2] https://arxiv.org/abs/1602.04915

[3] https://arxiv.org/abs/1912.02365 in the stochastic case, and https://arxiv.org/abs/1710.11606 along with https://arxiv.org/abs/1711.00841 in the nonstochastic case

[4] See, e.g., 1.1.3 in Nesterov's Introductory Lectures in Convex Optimization (can be easily found online).

wow, the rockafellar book is fantastic. i can't believe i'd never run into it before. thanks!


The kinks correspond to a set of measure zero, which you will likely never hit during execution, so one can safely ignore the problem as not physically relevant. One way to think of the problem is that the cost function we’re differentiating is approximate/fake, and whatever it needs to be (at some special neighborhoods) to give us derivatives we consider sensible (in large regions).

After all, there’s nothing so special about the ReLU... It would be very very weird/unstable if our algorithms worked for ReLU, but not the link-smoothed version of ReLU.

Hmm... I'm not sure I agree.

All optimal points (for, say, optimizing a linear function) will lie on the extremal points of the feasible domain, many of which will be points where the constraint functions are not differentiable. In all cases you can turn nonlinear objective function optimization (say over f) into linear objective function optimization by adding a constraint f(x) ≤ t and moving t to the objective.

Now, I will agree that smooth optimization algorithms will work ok, but try optimizing abs(x) with GD; you'll find that the best possible error you can achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.

> All optimal points

Sorry I screwed up a bit when writing that: an optimal point exists which lies on the extremal points of the feasible domain. In many cases, it can be shown that the optimal point lies on the extremal points of the feasible domain (when the objective is linear)

Yes, but we’re going to be restricted to O(L) final accuracy no matter what, for gradient descent (we could choose second order optimizations, etc, but that’s an orthogonal point — we’re happy to get within an epsilon ball of the answer).

> Yes, but we’re going to be restricted to O(L) final accuracy no matter what

This is not, in general, true for smooth functions so long as L is small enough (you can reach arbitrary accuracy with GD if L is smaller than ~ the reciprocal of the Lipchitz constant of a differentiable objective function but it need not be arbitrarily small).

My question wasn't about the theoretical aspects of measurability since any countable set of points will have measure 0 but about all AD libraries sweeping this kind of issue under the rug. Where in the Zygote docs is it mentioned that the absolute value function will give the wrong answer when differentiated?

In the same way that the ReLU derivative is not defined at x=0. Most of the time, in practice, this all doesn't really matter and you can still get gradient descent to work in a useful way.

the gradient doesn't exist but subgradients do exist (at points of non-differentiability) and that is still useful (case in point as someone else mentions ReLU)


Thanks. That's a nice tutorial.

Some more related info on different algorithmic differentiation approaches in Julia: https://github.com/MikeInnes/diff-zoo

Thank you so much for sharing this great repo! I have noticed that the source transformation notebook is not finished yet. How is it now?

This article is way over my head right now. But I bookmarked it. Differentiable programming and probabilistic programming are among the things that motivated me to learn the language (still a beginner), aside from just brushing up and sharpening my math skills in a practical manner.

About that... One thing that I didn't expect but should have been obvious is that introductory content is often geared towards scientists/mathematicians rather than engineers, which makes sense given that this is the target audience.

They often explain the programming side and not the mathematical/scientific side. Which is fine, because they present the right vocabulary for me to explore from different sources.

This article seems to be very much engineering focused but there is a ton of vocabulary I'm not used to yet. I assume the reader is expected to have a solid understanding of the paradigm and at least a high level understanding of Zygote.

You are correct. Our technical documentation is mostly aimed at working scientists who want to start using these techniques in their work. That does sometimes lead to funny cases where a document assumes you know what a smooth manifold is but will explain try/catch blocks. We've started trying to put together more introductory-focused material at https://juliaacademy.com/. We don't currently have anything particularly AD focused (outside of the general ML courses), but I think that's a topic that's high on the interest list.

Thank you for pointing out this fantastic resource.

> mostly aimed at working scientists

My primary goals are to learn what (primarily) data-scientists do. In the sense of: How do they think and approach problems, what are the limitations and the prerequisites etc. (And as I said to improve my math skills.)

I think there is merit in engineers learning these things (within reasonable scope) because at some point there needs to be a system that provides and transforms data into a format that scientists/analysts can work with. And in the other hand there are things that engineers can implement and learn to improve their systems. I'm excited about both and curious about how far I can get.

bit of a detour, but being able to do things like this is a big part of what lisp programmers are getting at when they bring up the advantages of 'syntax as data' - being able to perform complete high-level runtime program introspection, transformation, and code generation - good to see these kinds of techniques are available in other good languages that might appeal to the less parenthetically-inclined, not that that is the only benefit of julia

Disclaimer: I'm not the author. Just interested in the article and want to share this awesome post.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact