
Implement your own source transformation AD with Julia - metalwhale
http://blog.rogerluo.me/2019/07/27/yassad/
======
KenoFischer
We have so many AD implementations in Julia now that we actually have
infrastructure that separates out the definition of primitive derivate rules
from the actual AD mechanism itself, so the rules can be shared amongst all of
them. Of course it would be better if there was just one AD to rule them all,
but there are tradeoffs in the design space that make that hard. I think
having all these different implementations has actually helped crystalize what
the design space actually is, what choices need to be made and what the
interesting classes of applications are. I think that'll help with the next
generation of these tools (disclaimer I just started working on one of those
next generation tools yesterday ;) ).

~~~
stochastimus
Thanks for this. Glad to see the Julia community is going strong. At Zebrium
we use Julia at the core of our log structuring engine, and talk to it over
gRPC. Keep it up!

------
memexy
Differentiating through control flow has never made sense to me. What does it
mean to differentiate the following function: "f(x) = x > 0 ? x : -x"? If you
plot this function you get a sharp corner at 0 which means it's not
differentiable there because the limit from the left is -1 and the limit from
the right is 1. Since 1 =/= -1 the derivative does not exist at 0.

So how are AD libraries claiming to differentiate such functions? Is there an
implicit assumption that the user knows the derivative does not make sense at
0?

Edit: I just tried this and it gives the wrong answer without any hint that
it's incorrect:

"""

julia> f

f (generic function with 1 method)

julia> f(0), f'(0)

(0, -1)

julia> f'(1), f'(-1)

(1, -1)

"""

~~~
ssivark
The kinks correspond to a set of measure zero, which you will likely never hit
during execution, so one can safely ignore the problem as not physically
relevant. One way to think of the problem is that the cost function we’re
differentiating is approximate/fake, and whatever it needs to be (at some
special neighborhoods) to give us derivatives we consider sensible (in large
regions).

After all, there’s nothing so special about the ReLU... It would be very very
weird/unstable if our algorithms worked for ReLU, but not the link-smoothed
version of ReLU.

~~~
LolWolf
Hmm... I'm not sure I agree.

All optimal points (for, say, optimizing a linear function) will lie on the
extremal points of the feasible domain, many of which will be points where the
constraint functions are not differentiable. In all cases you can turn
nonlinear objective function optimization (say over f) into linear objective
function optimization by adding a constraint f(x) ≤ t and moving t to the
objective.

Now, I will agree that smooth optimization algorithms will work _ok_ , but try
optimizing abs(x) with GD; you'll find that the best possible error you can
achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.

~~~
ssivark
Yes, but we’re going to be restricted to O(L) final accuracy no matter what,
for gradient descent (we could choose second order optimizations, etc, but
that’s an orthogonal point — we’re happy to get within an epsilon ball of the
answer).

~~~
LolWolf
> Yes, but we’re going to be restricted to O(L) final accuracy no matter what

This is not, in general, true for smooth functions so long as L is small
enough (you can reach arbitrary accuracy with GD if L is smaller than ~ the
reciprocal of the Lipchitz constant of a differentiable objective function but
it need not be arbitrarily small).

------
kersny
Some more related info on different algorithmic differentiation approaches in
Julia: [https://github.com/MikeInnes/diff-
zoo](https://github.com/MikeInnes/diff-zoo)

~~~
metalwhale
Thank you so much for sharing this great repo! I have noticed that the source
transformation notebook is not finished yet. How is it now?

------
dgb23
This article is way over my head right now. But I bookmarked it.
Differentiable programming and probabilistic programming are among the things
that motivated me to learn the language (still a beginner), aside from just
brushing up and sharpening my math skills in a practical manner.

About that... One thing that I didn't expect but should have been obvious is
that introductory content is often geared towards scientists/mathematicians
rather than engineers, which makes sense given that this is the target
audience.

They often explain the programming side and not the mathematical/scientific
side. Which is fine, because they present the right vocabulary for me to
explore from different sources.

This article seems to be very much engineering focused but there is a ton of
vocabulary I'm not used to yet. I assume the reader is expected to have a
solid understanding of the paradigm and at least a high level understanding of
Zygote.

~~~
KenoFischer
You are correct. Our technical documentation is mostly aimed at working
scientists who want to start using these techniques in their work. That does
sometimes lead to funny cases where a document assumes you know what a smooth
manifold is but will explain try/catch blocks. We've started trying to put
together more introductory-focused material at
[https://juliaacademy.com/](https://juliaacademy.com/). We don't currently
have anything particularly AD focused (outside of the general ML courses), but
I think that's a topic that's high on the interest list.

~~~
dgb23
Thank you for pointing out this fantastic resource.

> mostly aimed at working scientists

My primary goals are to learn what (primarily) data-scientists do. In the
sense of: How do they think and approach problems, what are the limitations
and the prerequisites etc. (And as I said to improve my math skills.)

I think there is merit in engineers learning these things (within reasonable
scope) because at some point there needs to be a system that provides and
transforms data into a format that scientists/analysts can work with. And in
the other hand there are things that engineers can implement and learn to
improve their systems. I'm excited about both and curious about how far I can
get.

------
cat199
bit of a detour, but being able to do things like this is a big part of what
lisp programmers are getting at when they bring up the advantages of 'syntax
as data' \- being able to perform complete high-level runtime program
introspection, transformation, and code generation - good to see these kinds
of techniques are available in other good languages that might appeal to the
less parenthetically-inclined, not that that is the only benefit of julia

------
metalwhale
Disclaimer: I'm not the author. Just interested in the article and want to
share this awesome post.

