Hacker News new | past | comments | ask | show | jobs | submit login
The Simple Essence of Automatic Differentiation [pdf] (conal.net)
167 points by bmc7505 5 months ago | hide | past | web | favorite | 16 comments

There's more information here:


That said, has anyone had a look at the paper? What I can't figure out is the assertion that, "In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler plugin." That would be fantastic, but that's not immediately obvious to me. Generally, reverse mode is extremely non-parallel friendly because we're essentially traversing a computation graph. That traversal can be done in parallel, but figuring out how to partition that graph is probably more expensive that traversing it in serial. Now, the paper makes a big deal about not building that graph and passing along the derivative at the same time as the function. Fine. But does that method really produce a more parallel friendly code and is that as efficient as a more traditional graph or tape based method? Unless I'm missing something, I don't see any benchmarks of the method and what would be really helpful would be a ratio of the run time of a routine versus the run time of the routine with the AD method instrumented and the gradient calculated. The big question is how does this ratio hold as the method is scaled up in terms of variables. If the parallel claim holds true, then we should be able to also calculate the resulting parallel efficiencies.

Anyway, I'm genuinely curious if someone has additional information about the computational viability.

I don't think Conal makes any performance claims. The code is more parallel friendly because it doesn't involve global state, but for reverse mode AD you are not calculating derivatives as you go, but instead building up a function that will calculate the derivatives at the end. That's the effectively the same operation as normal reverse mode AD.

Anyway, this discussion still misses the point. The actual program transformation that powers this paper is "compiling to categories", which is essentially making the computation graph explicit in a modular way. This is why you can go through the program in parallel afterwards - sharing and dependencies are already made explicit and there is no need to discover them as you go.

The main innovation of the paper is the factorization of different AD modes in terms of simple transformations on categories (e.g. the Yoneda embedding applied to forward AD gives you reverse AD). In this format it is easy to see that the methods are correct and easy to extend the code since there just isn't very much of it.

This is a great paper (and talks!). It's taking a messy algorithm and factorizing it into simple and reusable components.

I can't tell because I don't speak category theory, but I've been told the essence is similar to how I've implemented the neural networks for spaCy: https://github.com/explosion/thinc

If so, then it definitely works fine. On the other hand I don't think parallelism is so easy to achieve, so maybe it's not the same thing after all.

My understanding is also that it is. Your implementation drove the point home for me that it can be practical.

The dictionary is rather simple

- a lambda \x . f(x) is known as an "internal hom object"

- you get a closure by using the partial application morphism (see https://en.wikipedia.org/wiki/Cartesian_closed_category)

- copying a variable is known as Δ : x ↦ (x,x)

- there is a contravariant functor T^\star: X → T^\star X between a manifold X and its cotangent space, which maps f: X → Y to its adjoint f^\star: T^\star Y → T^\star X

- you can then consider the product of the category of smooth manifolds Man × Man^{op} and define a functor Man → Man × Man^{op}, which maps a manifold X to the product of manifolds (X,T^\star X) and a morphism f : X → Y to the morphism (f,f^\star).

- All you have to do to get from your implementation to the notions in the paper is to convert the prescription you give into a point free style and apply some jargon (You need things like the partial application map and evaluation map, which are defined in any closed cartesian category, but easily translate to programming languages as well)

I don't think there are any great advantages for computational viability. He is introducing a new category-theoretic language, but in the end the same computations have to be done.

In his language, there is no "tape" because instead there is "continuation passing" which plays exactly the same role. I guess he hopes to construct the continuation-passing form at compile time, which limits the application to cases where the tape would be small anyway.

He hints that his approach should allow more parallelization but I don't see it. In practice the people trying to build AD engines are paying attention to the actual computation being done, and understand what can be parallelized. A more abstract understanding of the problem is little help there I think.

This paper won a best paper award at ICFP 2018. Conal Elliott recently gave a longer talk about it at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4

Someone introduced me to Conal's work a few days ago and recommended I watch his "Compiling to Categories" talk which I hadn't seen, and it just so happened that I watched it earlier today before this talk/paper posted. After viewing both for the first time today and reviewing the corresponding papers, one thing jumps out that's certain, and you can see it crystal clear. Conal is brilliant. And I don't say that lightly. The clarity of his work and his perspective on the problems is pristine. It's rare when someone is able to cut through the noise and peer into to the essence because there's no map and it's so hard to achieve the level of clarity that allows you to see the path that breaks through, but somehow he found the insight and nailed it. Damn, well done!

To state it more clearly than some of the other comments, part of the code insight this paper provides is the following:

Subject to transformation of your program into the first order / categorical representation, reverse mode auto differentiation is simply about

1) transforming the program into the continuation passing style category (this “flips” the shape of the compose operation in the category of our program)

2) do the same calculation / transform as you’d do for forward mode. Aka the really simple derivative calc we learn in high school.

There’s an extra trick to be done to make things even better, but that’s the meat of it. That reverse mode becomes (almost) as simple as forward mode to implement. Subject to your language being in this sort of categorical first order representation :)

Reverse mode being simple by any name to explain is a huge innovation alone.

Does this mean that any research into optimizing the performance of languages supporting continuations could improve RAD performance in other systems like PyTorch?

(Sorry if naive question, but your comment was very helpful to see the connection)

Compiling with continuations is a very well understood topic. (“Compiling with continuations” is also a book on the very same topic, that’s 25 years old now? )

It’s a compiler technique. So it can be used anywhere compilers are.

The title seems pretty cool!

Could anyone care to tell me if there is a motivation to learn this topic for ordinary ML enginneer such as myself. It seems these ideas presented in this paper are really helpful for those who develop DL frameworks like pytorch, but what about for those who only use frameworks?

However, regardless of how useful these ideas for me, I really respect researchers who publish excellent papers.

AD is important to understand for ML practitioners in the same way as compilers are important to understand for programmers. You can get away without knowing all the details, but it helps to understand where your gradients come from. However this paper is probably not be a good place to start if you're new to AD. If you want a better introduction, here are a few good resources:

Autodidact is a pedagogical implementation of AD: https://github.com/mattjj/autodidact

A nice literature review from JMLR: http://www.jmlr.org/papers/volume18/17-468/17-468.pdf

This paper reinterprets AD through the lens of category theory, an abstraction for modeling a wide class of problems in math and CS. It provides a language to describe these problems in a simple and powerful way, and is the foundation for a lot of work in functional programming (if you're interested in that kind of stuff). There was a thread on HN recently that discusses why category theory is useful: https://news.ycombinator.com/item?id=18267536

"Category Theory for the Working Hacker" by Philip Wadler is a great talk if you're interested in learning more: https://www.youtube.com/watch?v=gui_SE8rJUM

Also recommend checking out Bartosz Milewski's "Category Theory for Programmers": https://github.com/hmemcpy/milewski-ctfp-pdf

You actually want to know the gist of how these autodiff libraries work to know a) which approaches are fast and which approaches lead to giant complex gradient graphs. b) which approaches are stable and which lead to numerically unstable gradients.

You would be surprised how many code is out there (even for influential papers) whose graphs are obviously bad or whose graph could be fixed easily for more stability. Because people don't think about what the gradient looks like, while they should.

To see automatic differentiation in use in an entirely different field, check out page 81 (PDF page 95) of Stephan Schlamminger's dissertation.


This radically simplified an extremely difficult calculation necessary for a measurement of Newton's gravitational constant. It is the only time I've ever seen automatic differentiation in use in physics.

>> Machine learning and other gradient-based optimization problems

Machihe learning is not a "gradient-based optimisation problem". Gradient optimisation is a machine learning technique. There are many others that involve no differentiation whatsoever.

What is going on here? Every few weeks I hear someone talking about machine learning, or AI, etc, who clearly has no idea what they're talking about, even if they do undestand maths and computer science, like the paper above. Where do people get this confidence from, that they know all they need to know about a field that is probably older than themselves?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact