
A Differentiable Programming System to Bridge ML and Scientific Computing - ChrisRackauckas
https://arxiv.org/abs/1907.07587
======
KenoFischer
Happy that this paper finally made it to arxiv. The biggest reason for writing
it was to try and showcase some of the breadth of applications we see for
really high quality first class AD support at the language level. There are
several communities that need this technology, so it makes sense to try and
build one system that can address all of them and share tricks. I'm also
hoping this gives people a sense of why our list of feature requirements for
this system is so extensive (pervasive custom gradients, higher order AD,
mixed mode, ultra fast scalar AD, etc). These days AD is often discussed only
in the context of deep learning, but as the frontiers of deep learning are
pushed and hybrid models become more and more popular, I'm expecting our focus
on generality to pay off here.

~~~
cs702
Keno: first of all, let me give a big public _thank you_ to you and your
colleagues. (For those here who don't know, Keno is listed as one of the
authors of the paper, and works closely with Mike Innes, lead author and also
lead developer of Zygote. Mike is also an active member of HN.)

Second, let me bring up what I think is a significant issue. My perception is
that most deep learning researchers and practitioners -- that includes me --
tend to iterate very rapidly over new ideas. We want to test ideas in code as
quickly as possible, and in practice we will not invest the time and effort
necessary to figure out how to write cache-optimized kernels (e.g., for GPUs)
every time we might need one. In fact, I'd say the default attitude is that it
doesn't even make sense for us to use (what we perceive as slow) automated
kernel-writing tools to search for and compile optimized kernels. In practice,
it always seems faster and easier (from a developer-time standpoint) to write
code that leverages existing, inflexible, prebuilt, handtuned kernels that
have already proven to work well, are fast, and are already nicely integrated
with frameworks like PyTorch and TensorFlow.

There was a good discussion of this topic in the forums a few weeks ago[a] in
response to a recent paper published by some folks at Google in which they
make a compelling case that this issue is holding back AI research.[b] As an
example, they use capsule networks[c], the implementations of which have
proven difficult to optimize (e.g., they copy a lot more data around than is
strictly necessary, in order to slice and reshape data in a manner that is
compatible with preexisting hardware-accelerator kernels).

What are your thoughts on this? Will Zygote provide any improvements or
advantages on this front?

\--

[a]
[https://news.ycombinator.com/item?id=20301619](https://news.ycombinator.com/item?id=20301619)

[b]
[https://dl.acm.org/citation.cfm?id=3321441](https://dl.acm.org/citation.cfm?id=3321441)

[c] [https://arxiv.org/abs/1710.09829](https://arxiv.org/abs/1710.09829)

~~~
KenoFischer
Zygote is an orthogonal piece of technology on this front and relies on a good
optimizing compiler behind it to target actual hardware. Its focus is
primarily on expressability. We've been talking about automatic kernel
generation for a while (and when I saw kernel generation what I mean is
basically search for access patterns), but note that it's not quite as bad a
problem in julia, because you can use higher order functions to get a lot of
composability (e.g. if there's a hand-optimized parameterized matmul, fusing
in inputs and outputs is just passing in extra functions). There's some
academic work at the JuliaLab that's promising. We're also in discussions with
commercial partners who care about this kind of thing to see if somebody is
willing to fund it, but so far existing solutions have been good enough while
we work on higher priority items. I do agree that being limited to a few hand-
tuned kernels is a significant drag on research, so I hope we can do something
here.

~~~
_ks3e
In terms of trying to break free of dependence on hand optimized kernels: a
few people, myself included, have been working on some theoretical approaches
to generating cache-efficient rearrangements for neutral net like problems.
We've worked it out for convolution like problems [1] and have some upcoming
results generalizing these techniques to other problems. Please feel free to
email if you'd like to talk.

[1] [https://arxiv.org/abs/1802.06905](https://arxiv.org/abs/1802.06905)

------
jknz
This is huge and I have high hopes for zygote,jl.

Observation/request: For higher-order derivatives (Hessian, Laplacian, etc),
AD libraries typically provide API shortcuts. I have found it difficult to
control or predict the memory footprint and the time complexity of these API
shortcuts.

Laplacian is case in point: it is sometimes computed by first computing the
Hessian by forward-over-reverse and then taking the trace. In most libraries
you would have to dig deep in the documentation or the code itself to
understand how intermediary values are accumulated in memory for instance.

I would love to have a summary table of the complexity of all these API
shortcuts, similar to
[https://wiki.python.org/moin/TimeComplexity](https://wiki.python.org/moin/TimeComplexity)
for data-structures, but for these AD shortcuts (laplacian, divergence,
hessian, etc). The table would show memory and time complexity in terms of the
number of the input/intermediary/output dimensions and the complexity of
evaluating the original function.

~~~
KenoFischer
Unfortunately, this isn't super easy since the time complexity heavily depends
on what optimizations the compiler will apply and how the higher order AD is
exactly implemented (there's many ways to do so that all give you the same
answer and you probably want the system to pick the best for you). What we
could do fairly easily however is to have a compiler introspection tool that
runs through the computation, but doesn't do any of the work, just record how
much work it would have done. That would easily give you and idea and let you
plot a parameter sweep over any dimension you care about with actual results
corresponding to the capabilities of the system.

~~~
jknz
Thanks for the reply. Such introspection tool would be useful indeed--it seems
already possible to do some inspection by hand with
[https://fluxml.ai/Zygote.jl/latest/adjoints/#Gradient-
Reflec...](https://fluxml.ai/Zygote.jl/latest/adjoints/#Gradient-Reflection-1)

Anyway, great work and I look forward to what happens next!

------
endoftime5
I’ve just given this a cursory read but wanted to say that this looks really
exciting. Julia in general seems like such an exciting language.

I’ve recently switched some Fortran simulations at work for a pure Julia
implementation and it is great fun to write. DifferentialEquations is a
phenomenal piece of work!

~~~
6thaccount2
Is the performance equivalent in your scenario?

~~~
endoftime5
We have taken a slight performance hit, but the code is cleaner and much
easier to extend.

~~~
ChrisRackauckas
If there's any benchmarks on the DiffEq side, feel free to reach out to us
with them and we can take a crack at optimizing it. If it's something that's
allowed to be shared, we'd love to add it to our DiffEqBenchmarks.jl
repository to track the performance over time and understand the performance
characteristics of your domain.

~~~
6thaccount2
Is there a general way to do this with many of the libraries in Julia?

Any chance ya'll could eventually use a test model and script from me in a
benchmark test suite.

Ex: I have a large synthetic test model from a university and plan on writing
some code that uses a bunch of the sparse matrix and linear solve
functionality of the language. This would be on a sparse ~70k row/column
square matrix.

~~~
ChrisRackauckas
At least with DiffEq we keep track of a lot of different benchmarks here:
[https://github.com/JuliaDiffEq/DiffEqBenchmarks.jl](https://github.com/JuliaDiffEq/DiffEqBenchmarks.jl)
and it's always growing. Some of the best benchmarks have come from users who
want us to track performance on their problem. Some other libraries have repos
around, but not as formalized. We are getting things automated so that we can
start helping others.

Your model sounds nice because it can use our automatic sparsity detection and
matrix coloring. I'd love to give it a try.

[https://github.com/JuliaDiffEq/SparseDiffTools.jl#automated-...](https://github.com/JuliaDiffEq/SparseDiffTools.jl#automated-
sparsity-detection)

~~~
6thaccount2
Awesome! I wish I could give you a real industry one (even bigger), but a
synthetic model should adequately reflect the problem domain and they have
been the primary models used by researchers in my field for the past twenty
years or so.

Edit: which Julia group should I get into contact with over this?

------
GistNoesis
An alternative approach [https://www.microsoft.com/en-us/research/video/the-
simple-es...](https://www.microsoft.com/en-us/research/video/the-simple-
essence-of-automatic-differentiation/) where you define the right functional
structures that allow a very compact and efficient way of expressing automatic
differentiation and then let the existing compiler do the heavy lifting.

~~~
KenoFischer
I've read the corresponding paper, and while I think it's a fun read, I'm not
sure it does particularly much to clear up the underlying confusion here.
Reverse mode AD on straight line primitives is describabable in about three
lines of exposition. The tricky part is how do you handle control flow
(equivalently recursion). The description thereof then depends on what your
underlying IR data structure is, but describing this generally isn't horrible
either. The real problem in the literature is that there is lots of confusion
over what is the AD algorithm and what are the workarounds applied to make it
work in the particular system described.

For example, all the ML frameworks generally combine a high level tracer with
a straight line AD transform, which then sometimes gets presented as a
fundamental aspect of AD (which it isn't, you're just using the tracer to get
an IR you can handle). As a result, this field has a lot of confused
terminology, but all the results have been known since the 70s.

The name of the game for all the latest generation tools is to cleanly
separate out the semantic AD transform from the rest of the system and then
just use an unmodified compiler thereafter. We do it by transforming Julia IR,
the Swift folks do it on SIL, and the Scala folks have an implementation using
shift/reset and LMS. If you do this, a bunch of traditional AD techniques
become just special cases of general purpose compiler transforms (DCE, CSE,
etc). See this talk I gave recently for some more details:
[https://juliacomputing.com/blog/2019/02/19/growing-a-
compile...](https://juliacomputing.com/blog/2019/02/19/growing-a-
compiler.html)

As an aside, a pet peeve of mine is people calling these algorithms "tape
free". High storage requirements, due to the need to remember (or
alternatively, recompute) intermediate values are a fundamental property of
reverse mode AD and you can't get rid of it. The best you can do is hide it
(e.g. in compiler managed stack or closure environments), with all the same
fundamental challenges. This is probably a terminology clash, with people
wanting to use "tape" as a term for a much more narrow kind of data structure
generated from a tracing AD system, but the requirement to have some data
structure like it, no matter how hidden is fundamental to the algorithm.

~~~
GistNoesis
Currently, I'm currently using TF 1.x for most of my AD needs, but
encountering the limits so I'm always looking for neater solutions.

Conceptually, I enjoy the functional approach where the standard compiler does
the work. But I'm still using some graph based approach, where the order of
operations are recorded on a "tape", then replayed it in reverse order.

The main reason is to be able to manage the memory and computation placement.
To distribute huge computations among machines, while limiting memory
transfers between machines/devices to a minimum.

Also some kind of automatic gradient accumulation to reduce the memory need
would be nice. Currently using the graph based approach, partitioning a big
tensor into smaller ones and doing a map-reduce to accumulate the gradient
works surprising well (except for the initial very long graph creation time).

Every time I do one of those manual optimizations I start dreaming : "that
should be done by the compiler". You tell the compiler your cluster definition
and devices, you write a few for loops as if the code was done on a single
cpu. And the compiler do the loop splitting, reordering, parallelization
across device in an optimal way given your available memory constraints.

~~~
eigenspace
Sounds like you’d love julia!

------
user1241320
Is this somehow similar to the new Swift Automatic Differentiation feature?
[https://github.com/tensorflow/swift/blob/master/docs/Automat...](https://github.com/tensorflow/swift/blob/master/docs/AutomaticDifferentiation.md)

~~~
StefanKarpinski
Yes, it's the same idea. But Julia's differentiable programming capabilities
are far more advanced and mature than Swift's. As far as I'm aware, Swift
still doesn't support differentiating code with control flow (branches or
loops), which, needless to say, eliminates pretty much all non-trivial
programs.

Compare that to the situation in Julia: ∂P works _today_ on arbitrary
programs—like the ray tracer and other examples in this paper—programs which
are highly non-trivial and use iteration, recursion, mutation and global
state. All of which Julia's ∂P can take derivatives through.

When you additionally consider Swift's essentially non-existent computational
and data science ecosystem, it's a bit hard (for me at least) to rationalize
the Swift ∂P effort. (Are we going to differentiate iPhone apps?) They're
attempting to bootstrap their computational/data ecosystem by allowing calling
Python code, but as soon as you call into Python, you lose all ability to take
derivatives which only works for pure Swift code. So any program which relies
on Python to do some of the computation you want to take a derivative of won't
be differentiable, which kind of defeats the point of having ∂P in the first
place. We'll see how it pans out but the Swift effort has considerable
technical and social challenges to overcome.

~~~
billconan
How do if conditions handled? They are not differentiatable right? So are
loops?

~~~
StefanKarpinski
There's a whole paper explaining how it all works :)

------
ta1234567890
Can't comment on the technical details/merit of the paper.

However, this is huge. From a conceptual perspective, everything in the
physical world can be modeled as relative rates of change, which are basically
differential equations. Having the power to more easily build, run and
understand these models will help tremendously in advancing the
computer/physical interface as well as all its potential applications.

Thank you to the authors for such a huge contribution. Looking forward to all
the cool things that will be made possible through this.

------
eggy
I keep orbiting Julia, because of developments like this one. I am an old
Lisper/Scheme guy. How does this compare to say R6RS-AD[1] or cl-autodiff[2]
or Deriva for Clojure/Java[3]?

I realize Julia has femtolisp inside, yet another Lisp bait for me!

[1] [https://github.com/qobi/R6RS-AD](https://github.com/qobi/R6RS-AD)

[2] [https://github.com/masonium/cl-autodiff](https://github.com/masonium/cl-
autodiff)

[3] [https://github.com/lambder/Deriva](https://github.com/lambder/Deriva)

~~~
ddragon
Those are based on overloading functions to add the new behaviors (computing
the forward diff or building the graph within a special variable for the
reverse mode) which in Julia is implemented for example in [1], [2], [3] and
[4], while the one in OP is based on source analysis and transformation.

And the similarities with Lisp are more than just the parser being written in
it, Julia's programming paradigm is based on the CLOS, everything is an
expression, hygienic macros and reader macros and code is data. It mostly just
lacks sexpr.

[1]
[https://github.com/JuliaDiff/ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl)

[2] [https://github.com/dfdx/Yota.jl](https://github.com/dfdx/Yota.jl)

[3]
[https://github.com/denizyuret/AutoGrad.jl](https://github.com/denizyuret/AutoGrad.jl)
(from the Knet framework)

[4]
[https://fluxml.ai/Flux.jl/stable/internals/tracker/](https://fluxml.ai/Flux.jl/stable/internals/tracker/)
(from the Flux framework)

~~~
eggy
Wow, thank you for such an informative reply!

I am going to try this out. One of my pet peeves with Julia is that my main
machine is a Windows 10 box, and things like cudanative.jl say they only
install on Mac and Linux. I have an old 2013 Linux box (Lenovo T430u laptop).

BTW, do you recommend Julia Pro install or vanilla Julia and build up for more
general technical programming, not just ML and DL.

~~~
ddragon
I prefer vanilla Julia since it has all the packages at the latest released
versions and it's the most used so it's easier to get help. CUDANative.jl
documentation doesn't mention being mac/linux only (the only mention was in
the 2 year old preview release), and if it is maybe you could try another
library like ArrayFire.jl.

And more general technical programming is Julia specialty (it was built for
that as a high performance interactive language), the DL hype started after
the language was first released.

------
skybrian
This looks great! The language features of Julia seems to be exploited to good
effect. However, I'm curious: are there any aspects of Julia's design that, in
retrospect, make differential programming harder or more inconvenient?

~~~
KenoFischer
There's two competing currents here. One the on hand, Julia is extremely
powerful and dynamic, so it has very high expressability for any possible
differentiable programming you could think of. It also has a fairly simple
core, so as long as you know how to properly transform the core, you can get a
mathematically correct differential. However, the reason julia works so well
is that the compiler is able to understand and eliminate all the dynamism and
complexity for you at run time and generate very tight code as a result. When
you add AD into the mix, the generated code becomes, much, much more
complicated, so the compiler needs to work a lot harder. That can sometimes be
a cause for frustration for us, because we get something really cool working
with absolutely zero effort, but then need to go back and make the compiler
better to reclaim some performance (static languages have the opposite
problem, where you need to improve the type system in order to allow the thing
to happen, but once you have described it in a specific enough type system,
you usually also have good performance - if your language is any good that
is).

Overall I think Julia is a pretty great language to implement AD (as evidenced
I'd say by the 15 or so different AD packages that people have written for
individual use cases before the latest push for a unified one), but it still
is a very powerful language, so if you want your AD to handle the whole
language (as we do), then you're gonna have to do a bit of work.

~~~
skybrian
Thanks! So it sounds like what you're saying is that, as with other languages
that have complex optimizers, there can be performance cliffs when you get too
far off the beaten path?

For someone not working on the Julia compiler, how tricky it is to figure out
what to do to improve performance?

------
zitterbewegung
This looks neat . In python is this also a project that does the same thing ?
[https://github.com/HIPS/autograd](https://github.com/HIPS/autograd)

~~~
phrmoy
Would also be curious to understand how this compares. They seem to account
for recursion and if-branches as well.

~~~
eigenspace
They're not really all that close. For instance that Autograd library requires
that you use its own version of numpy instead of the regular version because
it can't differentiate C. Because most non-trivial python code is actually
written in C, there's almost no performant programs you can just differentiate
out of the box unless autograd itself has a fork of that package.

My understanding is that the python Autograd library is also quite slow
(though it's been a while since I looked at it).

Zygote will give the same runtime performance as handwritten derivatives in
many cases and works across almost the entire julia ecosystem since nearly all
julia packages are truly written in julia.

~~~
shoyer
JAX (github.com/google/jax), which is being developed by many of the authors
of Autograd, is a probably a better comparison. At the cost of requiring you
to rewrite control flow in a functional way, it eliminates Python's overhead
by compiling NumPy code into XLA.

------
marmaduke
Does this support user defined data structures? This is a thing holding me
back in Stan and other frameworks with AD (they only want scalars or arrays)

~~~
ddragon
Yes

[https://github.com/FluxML/Zygote.jl](https://github.com/FluxML/Zygote.jl)

"Without compromising on performance, Zygote supports the full flexibility and
dynamism of the Julia language, including control flow, recursion, closures,
structs, dictionaries, and more."

~~~
marmaduke
Ah OK the link page here didn't mention structs.

------
kensai
Cool and amazing. This thing should be used to enrich base Julia, not simply
be an extension.

~~~
eigenspace
Julia is designed in such a way that third party 'extensions' (packages) are
first class citizens. Very little actually needs to be in Base or the standard
library. When one types `using Zygote` you're not using a programming language
with a siloed AD exension. You've radically extenended julia into a
differentiable programming language.

The beauty of Zygote.jl being a package is that we don't force one AD approach
on the entire language ecosystem. Julia has several very promising AD systems
in development, each of which is preferable for certain situations but are
close enough in API that they're basically hot swappable.

When one of them needs a change made to base julia to better support the sort
of compiler transformations they want to do, they make a PR and the new
functionality gets implemented very quickly and all the other AD / compiler
transformation packages benefit from the change.

~~~
KenoFischer
Yes, this is exactly correct. We try very hard to extend Julia with
"mechanisms" not "features", such that if you don't like the exact way we're
doing it, you can just do your own (with blackjack), while still re-using as
much of the rest of the system as possible. The same holds true for AD. We
think (and we hope) that Zygote (and related packages) will be an extremely
good, general purpose AD system that's a good default, but if you're in a
specialized domain, you may be able to do a lot better with a specialized AD
transform (e.g. your domain might want to do part of the computation in the
frequency domain automatically, or have a domain-specific optimizations that
are not generally valid, etc.). Our user base is to varied to be able to
prescribe a one-size-fits-all approach. The only thing we can do is provide
the necessary mechanisms and high quality default implementations (as well as
good explanations as to how things work) and let it loose on the world.
Everything being in the same language helps enormously here, since even users
coming from the scientific side are well versed enough in the language to at
least be able to read and understand the semantics of code in the compiler
(even if they think it employs dark magic).

------
briantakita
Interesting project. I question the implications of using ML data shaping to
define models. A sufficiently optimized model may be missing fundamental
inputs & constraints, yet still perform better than a less optimized model
with the correct inputs & constraints. This would effectively create a local
maxima of understanding. It seems, while the predictive capabilities of
complex systems may increase in some cases, the practice of fundamental
understanding & articulation of the system will decrease (i.e. I don't know
how it works, the magic black box has all the answers).

Perhaps there are ways to mitigate these implications in the scientific
process.

~~~
ddragon
In this case it might be the exact opposite. Regardless of the structure (as
long as it has at least one hidden layer and enough parameters), neural
networks are universal approximators, which is why they are effectively black
boxes that can learn any model. So using more custom structure does not
increase the theoretical capability of the network to find more complex
models, but instead you're effectively adding more a priori information (by
embedding information of the problem within it's structure), which can heavily
restrict the search space and therefore making it somewhat less of a magic
black box.

An example might be attention models, which while being more complex than the
previous models it gives you more information about what the model does (by
allowing you to visualize what inputs the following layer is using).

------
roseandking
How is this different from JAX?

~~~
KenoFischer
JAX is a sophisticated and well implemented high level tracer, combined with a
backend compiler that makes use of XLA. High level tracing has some
fundamental draw backs in key features we want (efficient scalar AD for
example), so the approach being advocated here is doing AD as a compiler
transform on the original source language. Mike has some details in an earlier
paper: [https://arxiv.org/abs/1810.07951](https://arxiv.org/abs/1810.07951).
As a shameless plug, you can plug (heh) Julia into XLA as well, e.g. for
targeting TPUs
([https://arxiv.org/abs/1810.09868](https://arxiv.org/abs/1810.09868)). In
fact, Julia and JAX are probably better at that than TensorFlow, because the
TF execution model is a bit of a mismatch for what XLA expects.

~~~
one-more-minute
The really big problem with tracing, `@jit` annotations, and similar
approaches, is that you're no longer running Python, but something almost-but-
not-quite Python – modifying the language's semantics is necessary to get
performance.

In deep learning, tweaking some syntax and adding some annotations isn't a big
deal, but for the use cases we're interested in we really don't want to
rewrite every library to be AD compatible. Zygote is pretty unique (outside of
the scientific computing world) in being able to take libraries that were
written years before AD existed in Julia, and differentiate them correctly and
efficiently.

There are other, more subtle and technical, issues with those approaches, but
that's really the Big Deal.

------
adamnemecek
How can one get involved in this project? Is there a discord or something
where I can see what's happening?

~~~
ViralBShah
The Julia discourse
([https://discourse.julialang.org](https://discourse.julialang.org)) site is a
good place to engage. PRs, issues, etc. should go to the relevant package
repos on Github - Zygote.jl, Cassette.jl, ForwardDiff.jl, model zoo in FluxML
org and so on.

[https://github.com/FluxML/Zygote.jl](https://github.com/FluxML/Zygote.jl)

[https://github.com/jrevels/Cassette.jl](https://github.com/jrevels/Cassette.jl)

[https://github.com/JuliaDiff/ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl)

[https://github.com/FluxML](https://github.com/FluxML)

------
ummonk
Me reading the title: “doesn’t Julia already have that?”

Me reading the paper: “oh”

