Second, let me bring up what I think is a significant issue. My perception is that most deep learning researchers and practitioners -- that includes me -- tend to iterate very rapidly over new ideas. We want to test ideas in code as quickly as possible, and in practice we will not invest the time and effort necessary to figure out how to write cache-optimized kernels (e.g., for GPUs) every time we might need one. In fact, I'd say the default attitude is that it doesn't even make sense for us to use (what we perceive as slow) automated kernel-writing tools to search for and compile optimized kernels. In practice, it always seems faster and easier (from a developer-time standpoint) to write code that leverages existing, inflexible, prebuilt, handtuned kernels that have already proven to work well, are fast, and are already nicely integrated with frameworks like PyTorch and TensorFlow.
There was a good discussion of this topic in the forums a few weeks ago[a] in response to a recent paper published by some folks at Google in which they make a compelling case that this issue is holding back AI research.[b] As an example, they use capsule networks[c], the implementations of which have proven difficult to optimize (e.g., they copy a lot more data around than is strictly necessary, in order to slice and reshape data in a manner that is compatible with preexisting hardware-accelerator kernels).
What are your thoughts on this? Will Zygote provide any improvements or advantages on this front?
PS. I now feel that I asked my question without first thinking about it a bit more; sorry about that. I temporarily "forgot" that Zygote is a source-to-source AD package because, as someone who is developing and iterating over deep learning models for potential deployment to production, I naturally tend to think in terms of monolithic software stacks -- e.g.,"the TensorFlow stack," "the PyTorch stack," "the nascent Julia stack," and so on.
I recognize that this is a hard problem.[a]
FWIW, I read or heard (can't remember which) that there are people working with Chris Lattner seeking to use predictive AI (instead of search and heuristics) to address this issue in MLIR. Let me add that my understanding of how that would work is very limited, though.
[a] Only superficially. As you can imagine, I'm dealing at a very different level of abstraction with my own set of problems and frustrations.
Observation/request: For higher-order derivatives (Hessian, Laplacian, etc), AD libraries typically provide API shortcuts. I have found it difficult to control or predict the memory footprint and the time complexity of these API shortcuts.
Laplacian is case in point: it is sometimes computed by first computing the Hessian by forward-over-reverse and then taking the trace. In most libraries you would have to dig deep in the documentation or the code itself to understand how intermediary values are accumulated in memory for instance.
I would love to have a summary table of the complexity of all these API shortcuts, similar to https://wiki.python.org/moin/TimeComplexity for data-structures, but for these AD shortcuts (laplacian, divergence, hessian, etc). The table would show memory and time complexity in terms of the number of the input/intermediary/output dimensions and the complexity of evaluating the original function.
Anyway, great work and I look forward to what happens next!
I’ve recently switched some Fortran simulations at work for a pure Julia implementation and it is great fun to write. DifferentialEquations is a phenomenal piece of work!
Any chance ya'll could eventually use a test model and script from me in a benchmark test suite.
Ex: I have a large synthetic test model from a university and plan on writing some code that uses a bunch of the sparse matrix and linear solve functionality of the language. This would be on a sparse ~70k row/column square matrix.
Your model sounds nice because it can use our automatic sparsity detection and matrix coloring. I'd love to give it a try.
Edit: which Julia group should I get into contact with over this?
For example, all the ML frameworks generally combine a high level tracer with a straight line AD transform, which then sometimes gets presented as a fundamental aspect of AD (which it isn't, you're just using the tracer to get an IR you can handle). As a result, this field has a lot of confused terminology, but all the results have been known since the 70s.
The name of the game for all the latest generation tools is to cleanly separate out the semantic AD transform from the rest of the system and then just use an unmodified compiler thereafter. We do it by transforming Julia IR, the Swift folks do it on SIL, and the Scala folks have an implementation using shift/reset and LMS. If you do this, a bunch of traditional AD techniques become just special cases of general purpose compiler transforms (DCE, CSE, etc). See this talk I gave recently for some more details: https://juliacomputing.com/blog/2019/02/19/growing-a-compile...
As an aside, a pet peeve of mine is people calling these algorithms "tape free". High storage requirements, due to the need to remember (or alternatively, recompute) intermediate values are a fundamental property of reverse mode AD and you can't get rid of it. The best you can do is hide it (e.g. in compiler managed stack or closure environments), with all the same fundamental challenges. This is probably a terminology clash, with people wanting to use "tape" as a term for a much more narrow kind of data structure generated from a tracing AD system, but the requirement to have some data structure like it, no matter how hidden is fundamental to the algorithm.
Conceptually, I enjoy the functional approach where the standard compiler does the work. But I'm still using some graph based approach, where the order of operations are recorded on a "tape", then replayed it in reverse order.
The main reason is to be able to manage the memory and computation placement. To distribute huge computations among machines, while limiting memory transfers between machines/devices to a minimum.
Also some kind of automatic gradient accumulation to reduce the memory need would be nice. Currently using the graph based approach, partitioning a big tensor into smaller ones and doing a map-reduce to accumulate the gradient works surprising well (except for the initial very long graph creation time).
Every time I do one of those manual optimizations I start dreaming : "that should be done by the compiler". You tell the compiler your cluster definition and devices, you write a few for loops as if the code was done on a single cpu. And the compiler do the loop splitting, reordering, parallelization across device in an optimal way given your available memory constraints.
The common autograd-style tape combines a recording of both the operations of a program and its intermediate values, making the term ambiguous; so when people say "tape-free" they mean avoiding recorded operations.
In the Julia world we've tried to disambiguate with "trace" and "tape" for operations and values, but that's not standard terminology, unfortunately.
For example, one could imagine applying common subexpression compression to a tracer tape, and would get essentially the same thing.
Other places to stores this information are the stack of a closure chain, but it's still fundamentally the same information.
Don't get me wrong, I agree with your core point: "tape-free" conflates multiple things, and none of them really capture the AD design space in a useful way. Hopefully as the field settles down we'll figure out more useful axes for comparing these tools.
Compare that to the situation in Julia: ∂P works today on arbitrary programs—like the ray tracer and other examples in this paper—programs which are highly non-trivial and use iteration, recursion, mutation and global state. All of which Julia's ∂P can take derivatives through.
When you additionally consider Swift's essentially non-existent computational and data science ecosystem, it's a bit hard (for me at least) to rationalize the Swift ∂P effort. (Are we going to differentiate iPhone apps?) They're attempting to bootstrap their computational/data ecosystem by allowing calling Python code, but as soon as you call into Python, you lose all ability to take derivatives which only works for pure Swift code. So any program which relies on Python to do some of the computation you want to take a derivative of won't be differentiable, which kind of defeats the point of having ∂P in the first place. We'll see how it pans out but the Swift effort has considerable technical and social challenges to overcome.
It seems to lack a nice way of doing vector and matrix operations.
If you don't see the troll, that doesn't mean it isn't there, it's just waiting for you to cross.
However, this is huge. From a conceptual perspective, everything in the physical world can be modeled as relative rates of change, which are basically differential equations. Having the power to more easily build, run and understand these models will help tremendously in advancing the computer/physical interface as well as all its potential applications.
Thank you to the authors for such a huge contribution. Looking forward to all the cool things that will be made possible through this.
I realize Julia has femtolisp inside, yet another Lisp bait for me!
And the similarities with Lisp are more than just the parser being written in it, Julia's programming paradigm is based on the CLOS, everything is an expression, hygienic macros and reader macros and code is data. It mostly just lacks sexpr.
 https://github.com/denizyuret/AutoGrad.jl (from the Knet framework)
 https://fluxml.ai/Flux.jl/stable/internals/tracker/ (from the Flux framework)
I am going to try this out. One of my pet peeves with Julia is that my main machine is a Windows 10 box, and things like cudanative.jl say they only install on Mac and Linux. I have an old 2013 Linux box (Lenovo T430u laptop).
BTW, do you recommend Julia Pro install or vanilla Julia and build up for more general technical programming, not just ML and DL.
And more general technical programming is Julia specialty (it was built for that as a high performance interactive language), the DL hype started after the language was first released.
Overall I think Julia is a pretty great language to implement AD (as evidenced I'd say by the 15 or so different AD packages that people have written for individual use cases before the latest push for a unified one), but it still is a very powerful language, so if you want your AD to handle the whole language (as we do), then you're gonna have to do a bit of work.
For someone not working on the Julia compiler, how tricky it is to figure out what to do to improve performance?
That's also an approach that Julia excels because of multiple dispatch which you can see explained in .
In that case you have effectively two separate languages, the language used to generate the graph and the graph, each. This approach applies the transformation directly on the Julia IR to generate the gradient descent as if you wrote it directly on Julia side by side with the code that was written in a way that is completely unaware of that transformation (such as the ability to differentiate libraries that were built before that approach even existed). So the end product is something that is similar to the tensorflow graph (it has all control flow already embedded and can be pre-optimized by a compiler), but that is even easier to write than tensorflow eager (which is also the intent of Swift for Tensorflow).
My understanding is that the python Autograd library is also quite slow (though it's been a while since I looked at it).
Zygote will give the same runtime performance as handwritten derivatives in many cases and works across almost the entire julia ecosystem since nearly all julia packages are truly written in julia.
This is why julia is so powerful for so many different use-cases, when I make my own special custom array, number, string or whatever type, it's a first class citizen and with enough optimization will be just as fast, extensible and generic as whatever was provided by base.
"Without compromising on performance, Zygote supports the full flexibility and dynamism of the Julia language, including control flow, recursion, closures, structs, dictionaries, and more."
The beauty of Zygote.jl being a package is that we don't force one AD approach on the entire language ecosystem. Julia has several very promising AD systems in development, each of which is preferable for certain situations but are close enough in API that they're basically hot swappable.
When one of them needs a change made to base julia to better support the sort of compiler transformations they want to do, they make a PR and the new functionality gets implemented very quickly and all the other AD / compiler transformation packages benefit from the change.
Perhaps there are ways to mitigate these implications in the scientific process.
An example might be attention models, which while being more complex than the previous models it gives you more information about what the model does (by allowing you to visualize what inputs the following layer is using).
In deep learning, tweaking some syntax and adding some annotations isn't a big deal, but for the use cases we're interested in we really don't want to rewrite every library to be AD compatible. Zygote is pretty unique (outside of the scientific computing world) in being able to take libraries that were written years before AD existed in Julia, and differentiate them correctly and efficiently.
There are other, more subtle and technical, issues with those approaches, but that's really the Big Deal.
Me reading the paper: “oh”