Hacker News new | past | comments | ask | show | jobs | submit login
A Differentiable Programming System to Bridge ML and Scientific Computing (arxiv.org)
381 points by ChrisRackauckas on July 19, 2019 | hide | past | favorite | 71 comments

Happy that this paper finally made it to arxiv. The biggest reason for writing it was to try and showcase some of the breadth of applications we see for really high quality first class AD support at the language level. There are several communities that need this technology, so it makes sense to try and build one system that can address all of them and share tricks. I'm also hoping this gives people a sense of why our list of feature requirements for this system is so extensive (pervasive custom gradients, higher order AD, mixed mode, ultra fast scalar AD, etc). These days AD is often discussed only in the context of deep learning, but as the frontiers of deep learning are pushed and hybrid models become more and more popular, I'm expecting our focus on generality to pay off here.

Keno: first of all, let me give a big public thank you to you and your colleagues. (For those here who don't know, Keno is listed as one of the authors of the paper, and works closely with Mike Innes, lead author and also lead developer of Zygote. Mike is also an active member of HN.)

Second, let me bring up what I think is a significant issue. My perception is that most deep learning researchers and practitioners -- that includes me -- tend to iterate very rapidly over new ideas. We want to test ideas in code as quickly as possible, and in practice we will not invest the time and effort necessary to figure out how to write cache-optimized kernels (e.g., for GPUs) every time we might need one. In fact, I'd say the default attitude is that it doesn't even make sense for us to use (what we perceive as slow) automated kernel-writing tools to search for and compile optimized kernels. In practice, it always seems faster and easier (from a developer-time standpoint) to write code that leverages existing, inflexible, prebuilt, handtuned kernels that have already proven to work well, are fast, and are already nicely integrated with frameworks like PyTorch and TensorFlow.

There was a good discussion of this topic in the forums a few weeks ago[a] in response to a recent paper published by some folks at Google in which they make a compelling case that this issue is holding back AI research.[b] As an example, they use capsule networks[c], the implementations of which have proven difficult to optimize (e.g., they copy a lot more data around than is strictly necessary, in order to slice and reshape data in a manner that is compatible with preexisting hardware-accelerator kernels).

What are your thoughts on this? Will Zygote provide any improvements or advantages on this front?


[a] https://news.ycombinator.com/item?id=20301619

[b] https://dl.acm.org/citation.cfm?id=3321441

[c] https://arxiv.org/abs/1710.09829

Zygote is an orthogonal piece of technology on this front and relies on a good optimizing compiler behind it to target actual hardware. Its focus is primarily on expressability. We've been talking about automatic kernel generation for a while (and when I saw kernel generation what I mean is basically search for access patterns), but note that it's not quite as bad a problem in julia, because you can use higher order functions to get a lot of composability (e.g. if there's a hand-optimized parameterized matmul, fusing in inputs and outputs is just passing in extra functions). There's some academic work at the JuliaLab that's promising. We're also in discussions with commercial partners who care about this kind of thing to see if somebody is willing to fund it, but so far existing solutions have been good enough while we work on higher priority items. I do agree that being limited to a few hand-tuned kernels is a significant drag on research, so I hope we can do something here.

In terms of trying to break free of dependence on hand optimized kernels: a few people, myself included, have been working on some theoretical approaches to generating cache-efficient rearrangements for neutral net like problems. We've worked it out for convolution like problems [1] and have some upcoming results generalizing these techniques to other problems. Please feel free to email if you'd like to talk.

[1] https://arxiv.org/abs/1802.06905

Thank you for your response. Makes sense, and I'm happy to hear you and others are aware of the issue.

PS. I now feel that I asked my question without first thinking about it a bit more; sorry about that. I temporarily "forgot" that Zygote is a source-to-source AD package because, as someone who is developing and iterating over deep learning models for potential deployment to production, I naturally tend to think in terms of monolithic software stacks -- e.g.,"the TensorFlow stack," "the PyTorch stack," "the nascent Julia stack," and so on.

There are people in the Julia Lab working on high level tensor operation languages and compilers. It's a hard problem but one that many are interested in solving with Julia.

Chris: As a user of these tools, I cannot tell you how thankful I am for the work you, Keno, Mike and others do. (For those who don't know, Chris works closely with Mike, Keno, and others in the Julia team.)

I recognize that this is a hard problem.[a]

FWIW, I read or heard (can't remember which) that there are people working with Chris Lattner seeking to use predictive AI (instead of search and heuristics) to address this issue in MLIR. Let me add that my understanding of how that would work is very limited, though.

[a] Only superficially. As you can imagine, I'm dealing at a very different level of abstraction with my own set of problems and frustrations.

I'm not familiar with that part of the MLIR work, though I wouldn't be surprised if they are working at it. Google Brain is doing some of the finest work on ML for systems programming, so this'd be right up their alley. If it works well and they come up with some useful models, we'll be more than happy to incorporate them (or just use MLIR directly - there's a talk at JuliaCon next week from a Google engineer on using Julia as an MLIR frontend: https://pretalx.com/juliacon2019/talk/3YBZLC/).

Congratulations on some great work. I like the diversity shown in the examples. In particular, it's nice that you threw those working with stochastic processes a bone.

does the AD algorithm support functions with variant input sizes? For example, the input to a function is an array?

This is huge and I have high hopes for zygote,jl.

Observation/request: For higher-order derivatives (Hessian, Laplacian, etc), AD libraries typically provide API shortcuts. I have found it difficult to control or predict the memory footprint and the time complexity of these API shortcuts.

Laplacian is case in point: it is sometimes computed by first computing the Hessian by forward-over-reverse and then taking the trace. In most libraries you would have to dig deep in the documentation or the code itself to understand how intermediary values are accumulated in memory for instance.

I would love to have a summary table of the complexity of all these API shortcuts, similar to https://wiki.python.org/moin/TimeComplexity for data-structures, but for these AD shortcuts (laplacian, divergence, hessian, etc). The table would show memory and time complexity in terms of the number of the input/intermediary/output dimensions and the complexity of evaluating the original function.

Unfortunately, this isn't super easy since the time complexity heavily depends on what optimizations the compiler will apply and how the higher order AD is exactly implemented (there's many ways to do so that all give you the same answer and you probably want the system to pick the best for you). What we could do fairly easily however is to have a compiler introspection tool that runs through the computation, but doesn't do any of the work, just record how much work it would have done. That would easily give you and idea and let you plot a parameter sweep over any dimension you care about with actual results corresponding to the capabilities of the system.

Thanks for the reply. Such introspection tool would be useful indeed--it seems already possible to do some inspection by hand with https://fluxml.ai/Zygote.jl/latest/adjoints/#Gradient-Reflec...

Anyway, great work and I look forward to what happens next!

I’ve just given this a cursory read but wanted to say that this looks really exciting. Julia in general seems like such an exciting language.

I’ve recently switched some Fortran simulations at work for a pure Julia implementation and it is great fun to write. DifferentialEquations is a phenomenal piece of work!

Is the performance equivalent in your scenario?

We have taken a slight performance hit, but the code is cleaner and much easier to extend.

If there's any benchmarks on the DiffEq side, feel free to reach out to us with them and we can take a crack at optimizing it. If it's something that's allowed to be shared, we'd love to add it to our DiffEqBenchmarks.jl repository to track the performance over time and understand the performance characteristics of your domain.

Is there a general way to do this with many of the libraries in Julia?

Any chance ya'll could eventually use a test model and script from me in a benchmark test suite.

Ex: I have a large synthetic test model from a university and plan on writing some code that uses a bunch of the sparse matrix and linear solve functionality of the language. This would be on a sparse ~70k row/column square matrix.

At least with DiffEq we keep track of a lot of different benchmarks here: https://github.com/JuliaDiffEq/DiffEqBenchmarks.jl and it's always growing. Some of the best benchmarks have come from users who want us to track performance on their problem. Some other libraries have repos around, but not as formalized. We are getting things automated so that we can start helping others.

Your model sounds nice because it can use our automatic sparsity detection and matrix coloring. I'd love to give it a try.


Awesome! I wish I could give you a real industry one (even bigger), but a synthetic model should adequately reflect the problem domain and they have been the primary models used by researchers in my field for the past twenty years or so.

Edit: which Julia group should I get into contact with over this?

An alternative approach https://www.microsoft.com/en-us/research/video/the-simple-es... where you define the right functional structures that allow a very compact and efficient way of expressing automatic differentiation and then let the existing compiler do the heavy lifting.

I've read the corresponding paper, and while I think it's a fun read, I'm not sure it does particularly much to clear up the underlying confusion here. Reverse mode AD on straight line primitives is describabable in about three lines of exposition. The tricky part is how do you handle control flow (equivalently recursion). The description thereof then depends on what your underlying IR data structure is, but describing this generally isn't horrible either. The real problem in the literature is that there is lots of confusion over what is the AD algorithm and what are the workarounds applied to make it work in the particular system described.

For example, all the ML frameworks generally combine a high level tracer with a straight line AD transform, which then sometimes gets presented as a fundamental aspect of AD (which it isn't, you're just using the tracer to get an IR you can handle). As a result, this field has a lot of confused terminology, but all the results have been known since the 70s.

The name of the game for all the latest generation tools is to cleanly separate out the semantic AD transform from the rest of the system and then just use an unmodified compiler thereafter. We do it by transforming Julia IR, the Swift folks do it on SIL, and the Scala folks have an implementation using shift/reset and LMS. If you do this, a bunch of traditional AD techniques become just special cases of general purpose compiler transforms (DCE, CSE, etc). See this talk I gave recently for some more details: https://juliacomputing.com/blog/2019/02/19/growing-a-compile...

As an aside, a pet peeve of mine is people calling these algorithms "tape free". High storage requirements, due to the need to remember (or alternatively, recompute) intermediate values are a fundamental property of reverse mode AD and you can't get rid of it. The best you can do is hide it (e.g. in compiler managed stack or closure environments), with all the same fundamental challenges. This is probably a terminology clash, with people wanting to use "tape" as a term for a much more narrow kind of data structure generated from a tracing AD system, but the requirement to have some data structure like it, no matter how hidden is fundamental to the algorithm.

Currently, I'm currently using TF 1.x for most of my AD needs, but encountering the limits so I'm always looking for neater solutions.

Conceptually, I enjoy the functional approach where the standard compiler does the work. But I'm still using some graph based approach, where the order of operations are recorded on a "tape", then replayed it in reverse order.

The main reason is to be able to manage the memory and computation placement. To distribute huge computations among machines, while limiting memory transfers between machines/devices to a minimum.

Also some kind of automatic gradient accumulation to reduce the memory need would be nice. Currently using the graph based approach, partitioning a big tensor into smaller ones and doing a map-reduce to accumulate the gradient works surprising well (except for the initial very long graph creation time).

Every time I do one of those manual optimizations I start dreaming : "that should be done by the compiler". You tell the compiler your cluster definition and devices, you write a few for loops as if the code was done on a single cpu. And the compiler do the loop splitting, reordering, parallelization across device in an optimal way given your available memory constraints.

Sounds like you’d love julia!

> As an aside, a pet peeve of mine is people calling these algorithms "tape free".

The common autograd-style tape combines a recording of both the operations of a program and its intermediate values, making the term ambiguous; so when people say "tape-free" they mean avoiding recorded operations.

In the Julia world we've tried to disambiguate with "trace" and "tape" for operations and values, but that's not standard terminology, unfortunately.

Yes, I understand that, but i still think it's misleading. Even in Julia, we're still recording operations, just a) at a much coarser granularity (if there's no control flow in the middle we can record it as one operation) b) in the type of the tape data structure rather than explicitly

For example, one could imagine applying common subexpression compression to a tracer tape, and would get essentially the same thing.

Other places to stores this information are the stack of a closure chain, but it's still fundamentally the same information.

In a more traditional SCT AD like Tapenade, there isn't really anything that corresponds to recording of program operations at runtime. Zygote arguably does something semantically equivalent to recording a trace (at the interprocedural level), but of course the idea is that once you've optimised that away we approximate Tapenade – so the line is a bit blurred.

Don't get me wrong, I agree with your core point: "tape-free" conflates multiple things, and none of them really capture the AD design space in a useful way. Hopefully as the field settles down we'll figure out more useful axes for comparing these tools.

Is this somehow similar to the new Swift Automatic Differentiation feature? https://github.com/tensorflow/swift/blob/master/docs/Automat...

Yes, it's the same idea. But Julia's differentiable programming capabilities are far more advanced and mature than Swift's. As far as I'm aware, Swift still doesn't support differentiating code with control flow (branches or loops), which, needless to say, eliminates pretty much all non-trivial programs.

Compare that to the situation in Julia: ∂P works today on arbitrary programs—like the ray tracer and other examples in this paper—programs which are highly non-trivial and use iteration, recursion, mutation and global state. All of which Julia's ∂P can take derivatives through.

When you additionally consider Swift's essentially non-existent computational and data science ecosystem, it's a bit hard (for me at least) to rationalize the Swift ∂P effort. (Are we going to differentiate iPhone apps?) They're attempting to bootstrap their computational/data ecosystem by allowing calling Python code, but as soon as you call into Python, you lose all ability to take derivatives which only works for pure Swift code. So any program which relies on Python to do some of the computation you want to take a derivative of won't be differentiable, which kind of defeats the point of having ∂P in the first place. We'll see how it pans out but the Swift effort has considerable technical and social challenges to overcome.

How do if conditions handled? They are not differentiatable right? So are loops?

There's a whole paper explaining how it all works :)

Does Swift have an equivalent to Python‘s numpy?

It seems to lack a nice way of doing vector and matrix operations.

They built in Python interop so that you can use numpy in TFSwift

Bridges always have trolls under them.

If you don't see the troll, that doesn't mean it isn't there, it's just waiting for you to cross.

Doesn’t that introduce serious overhead compared to first class native arrays as in Julia or FORTRAN?

My impression is that this is to bootstrap adoption, there is a Swift-native TensorFlow package being developed also.

Bridging into NumPy isn’t going to be useful unless it covers some AD system too (XLA?).

XLA is a compiler for array code. It doesn't come with AD -- you need a wrapper like TF or JAX (or Swift, I guess) for that.

I was imagining the FFI emitting XLA code needed to implement the TF/JAX routines behind the scenes, but it's wild speculation.

Correct: you can call Python from Swift OR you can use AD but not both at the same time.

Thanks for linking this. I have been curious how auto diff gets implemented.

Can't comment on the technical details/merit of the paper.

However, this is huge. From a conceptual perspective, everything in the physical world can be modeled as relative rates of change, which are basically differential equations. Having the power to more easily build, run and understand these models will help tremendously in advancing the computer/physical interface as well as all its potential applications.

Thank you to the authors for such a huge contribution. Looking forward to all the cool things that will be made possible through this.

I keep orbiting Julia, because of developments like this one. I am an old Lisper/Scheme guy. How does this compare to say R6RS-AD[1] or cl-autodiff[2] or Deriva for Clojure/Java[3]?

I realize Julia has femtolisp inside, yet another Lisp bait for me!

[1] https://github.com/qobi/R6RS-AD

[2] https://github.com/masonium/cl-autodiff

[3] https://github.com/lambder/Deriva

Those are based on overloading functions to add the new behaviors (computing the forward diff or building the graph within a special variable for the reverse mode) which in Julia is implemented for example in [1], [2], [3] and [4], while the one in OP is based on source analysis and transformation.

And the similarities with Lisp are more than just the parser being written in it, Julia's programming paradigm is based on the CLOS, everything is an expression, hygienic macros and reader macros and code is data. It mostly just lacks sexpr.

[1] https://github.com/JuliaDiff/ForwardDiff.jl

[2] https://github.com/dfdx/Yota.jl

[3] https://github.com/denizyuret/AutoGrad.jl (from the Knet framework)

[4] https://fluxml.ai/Flux.jl/stable/internals/tracker/ (from the Flux framework)

Wow, thank you for such an informative reply!

I am going to try this out. One of my pet peeves with Julia is that my main machine is a Windows 10 box, and things like cudanative.jl say they only install on Mac and Linux. I have an old 2013 Linux box (Lenovo T430u laptop).

BTW, do you recommend Julia Pro install or vanilla Julia and build up for more general technical programming, not just ML and DL.

I prefer vanilla Julia since it has all the packages at the latest released versions and it's the most used so it's easier to get help. CUDANative.jl documentation doesn't mention being mac/linux only (the only mention was in the 2 year old preview release), and if it is maybe you could try another library like ArrayFire.jl.

And more general technical programming is Julia specialty (it was built for that as a high performance interactive language), the DL hype started after the language was first released.

This looks great! The language features of Julia seems to be exploited to good effect. However, I'm curious: are there any aspects of Julia's design that, in retrospect, make differential programming harder or more inconvenient?

There's two competing currents here. One the on hand, Julia is extremely powerful and dynamic, so it has very high expressability for any possible differentiable programming you could think of. It also has a fairly simple core, so as long as you know how to properly transform the core, you can get a mathematically correct differential. However, the reason julia works so well is that the compiler is able to understand and eliminate all the dynamism and complexity for you at run time and generate very tight code as a result. When you add AD into the mix, the generated code becomes, much, much more complicated, so the compiler needs to work a lot harder. That can sometimes be a cause for frustration for us, because we get something really cool working with absolutely zero effort, but then need to go back and make the compiler better to reclaim some performance (static languages have the opposite problem, where you need to improve the type system in order to allow the thing to happen, but once you have described it in a specific enough type system, you usually also have good performance - if your language is any good that is).

Overall I think Julia is a pretty great language to implement AD (as evidenced I'd say by the 15 or so different AD packages that people have written for individual use cases before the latest push for a unified one), but it still is a very powerful language, so if you want your AD to handle the whole language (as we do), then you're gonna have to do a bit of work.

Thanks! So it sounds like what you're saying is that, as with other languages that have complex optimizers, there can be performance cliffs when you get too far off the beaten path?

For someone not working on the Julia compiler, how tricky it is to figure out what to do to improve performance?

This looks neat . In python is this also a project that does the same thing ? https://github.com/HIPS/autograd

Autograd (and most of the current approaches) work by having a special object for the data and overloaded methods that instead of immediately executing an operation they instead store it in a graph of transformations. Then when you need the gradient it applies the chain rule over this graph. The support for loops/control flow is possible since at each call you destroy and recreate the graph, which is not optimal for performance but makes it very dynamic (tensorflow eager/pytorch vs tensorflow graph interface).

That's also an approach that Julia excels because of multiple dispatch which you can see explained in [1].

In that case you have effectively two separate languages, the language used to generate the graph and the graph, each. This approach applies the transformation directly on the Julia IR to generate the gradient descent as if you wrote it directly on Julia side by side with the code that was written in a way that is completely unaware of that transformation (such as the ability to differentiate libraries that were built before that approach even existed). So the end product is something that is similar to the tensorflow graph (it has all control flow already embedded and can be pre-optimized by a compiler), but that is even easier to write than tensorflow eager (which is also the intent of Swift for Tensorflow).

[1] https://github.com/MikeInnes/diff-zoo

Would also be curious to understand how this compares. They seem to account for recursion and if-branches as well.

They're not really all that close. For instance that Autograd library requires that you use its own version of numpy instead of the regular version because it can't differentiate C. Because most non-trivial python code is actually written in C, there's almost no performant programs you can just differentiate out of the box unless autograd itself has a fork of that package.

My understanding is that the python Autograd library is also quite slow (though it's been a while since I looked at it).

Zygote will give the same runtime performance as handwritten derivatives in many cases and works across almost the entire julia ecosystem since nearly all julia packages are truly written in julia.

JAX (github.com/google/jax), which is being developed by many of the authors of Autograd, is a probably a better comparison. At the cost of requiring you to rewrite control flow in a functional way, it eliminates Python's overhead by compiling NumPy code into XLA.

thank you.

Does this support user defined data structures? This is a thing holding me back in Stan and other frameworks with AD (they only want scalars or arrays)

Julia doesn't even really have a distinction between user defined and base data structures. Every julia datastructure that's not needed for bootstrap is implemented in julia and all are on an equal footing with a datastructure I define at the repl. Even numbers.

This is why julia is so powerful for so many different use-cases, when I make my own special custom array, number, string or whatever type, it's a first class citizen and with enough optimization will be just as fast, extensible and generic as whatever was provided by base.



"Without compromising on performance, Zygote supports the full flexibility and dynamism of the Julia language, including control flow, recursion, closures, structs, dictionaries, and more."

Ah OK the link page here didn't mention structs.

Cool and amazing. This thing should be used to enrich base Julia, not simply be an extension.

Julia is designed in such a way that third party 'extensions' (packages) are first class citizens. Very little actually needs to be in Base or the standard library. When one types `using Zygote` you're not using a programming language with a siloed AD exension. You've radically extenended julia into a differentiable programming language.

The beauty of Zygote.jl being a package is that we don't force one AD approach on the entire language ecosystem. Julia has several very promising AD systems in development, each of which is preferable for certain situations but are close enough in API that they're basically hot swappable.

When one of them needs a change made to base julia to better support the sort of compiler transformations they want to do, they make a PR and the new functionality gets implemented very quickly and all the other AD / compiler transformation packages benefit from the change.

Yes, this is exactly correct. We try very hard to extend Julia with "mechanisms" not "features", such that if you don't like the exact way we're doing it, you can just do your own (with blackjack), while still re-using as much of the rest of the system as possible. The same holds true for AD. We think (and we hope) that Zygote (and related packages) will be an extremely good, general purpose AD system that's a good default, but if you're in a specialized domain, you may be able to do a lot better with a specialized AD transform (e.g. your domain might want to do part of the computation in the frequency domain automatically, or have a domain-specific optimizations that are not generally valid, etc.). Our user base is to varied to be able to prescribe a one-size-fits-all approach. The only thing we can do is provide the necessary mechanisms and high quality default implementations (as well as good explanations as to how things work) and let it loose on the world. Everything being in the same language helps enormously here, since even users coming from the scientific side are well versed enough in the language to at least be able to read and understand the semantics of code in the compiler (even if they think it employs dark magic).

That's awesome, good to know! One more reason to learn Julia!

Interesting project. I question the implications of using ML data shaping to define models. A sufficiently optimized model may be missing fundamental inputs & constraints, yet still perform better than a less optimized model with the correct inputs & constraints. This would effectively create a local maxima of understanding. It seems, while the predictive capabilities of complex systems may increase in some cases, the practice of fundamental understanding & articulation of the system will decrease (i.e. I don't know how it works, the magic black box has all the answers).

Perhaps there are ways to mitigate these implications in the scientific process.

In this case it might be the exact opposite. Regardless of the structure (as long as it has at least one hidden layer and enough parameters), neural networks are universal approximators, which is why they are effectively black boxes that can learn any model. So using more custom structure does not increase the theoretical capability of the network to find more complex models, but instead you're effectively adding more a priori information (by embedding information of the problem within it's structure), which can heavily restrict the search space and therefore making it somewhat less of a magic black box.

An example might be attention models, which while being more complex than the previous models it gives you more information about what the model does (by allowing you to visualize what inputs the following layer is using).

How is this different from JAX?

JAX is a sophisticated and well implemented high level tracer, combined with a backend compiler that makes use of XLA. High level tracing has some fundamental draw backs in key features we want (efficient scalar AD for example), so the approach being advocated here is doing AD as a compiler transform on the original source language. Mike has some details in an earlier paper: https://arxiv.org/abs/1810.07951. As a shameless plug, you can plug (heh) Julia into XLA as well, e.g. for targeting TPUs (https://arxiv.org/abs/1810.09868). In fact, Julia and JAX are probably better at that than TensorFlow, because the TF execution model is a bit of a mismatch for what XLA expects.

The really big problem with tracing, `@jit` annotations, and similar approaches, is that you're no longer running Python, but something almost-but-not-quite Python – modifying the language's semantics is necessary to get performance.

In deep learning, tweaking some syntax and adding some annotations isn't a big deal, but for the use cases we're interested in we really don't want to rewrite every library to be AD compatible. Zygote is pretty unique (outside of the scientific computing world) in being able to take libraries that were written years before AD existed in Julia, and differentiate them correctly and efficiently.

There are other, more subtle and technical, issues with those approaches, but that's really the Big Deal.

How can one get involved in this project? Is there a discord or something where I can see what's happening?

The Julia discourse (https://discourse.julialang.org) site is a good place to engage. PRs, issues, etc. should go to the relevant package repos on Github - Zygote.jl, Cassette.jl, ForwardDiff.jl, model zoo in FluxML org and so on.





Me reading the title: “doesn’t Julia already have that?”

Me reading the paper: “oh”

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact