
The Simple Essence of Automatic Differentiation [pdf] - bmc7505
http://conal.net/papers/essence-of-ad/essence-of-ad-icfp.pdf
======
kxyvr
There's more information here:

[https://github.com/conal/talk-2018-essence-of-
ad](https://github.com/conal/talk-2018-essence-of-ad)

That said, has anyone had a look at the paper? What I can't figure out is the
assertion that, "In contrast to commonly used RAD implementations, the
algorithms defined here involve no graphs, tapes, variables, partial
derivatives, or mutation. They are inherently parallel-friendly, correct by
construction, and usable directly from an existing programming language with
no need for new data types or programming style, thanks to use of an AD-
agnostic compiler plugin." That would be fantastic, but that's not immediately
obvious to me. Generally, reverse mode is extremely non-parallel friendly
because we're essentially traversing a computation graph. That traversal can
be done in parallel, but figuring out how to partition that graph is probably
more expensive that traversing it in serial. Now, the paper makes a big deal
about not building that graph and passing along the derivative at the same
time as the function. Fine. But does that method really produce a more
parallel friendly code and is that as efficient as a more traditional graph or
tape based method? Unless I'm missing something, I don't see any benchmarks of
the method and what would be really helpful would be a ratio of the run time
of a routine versus the run time of the routine with the AD method
instrumented and the gradient calculated. The big question is how does this
ratio hold as the method is scaled up in terms of variables. If the parallel
claim holds true, then we should be able to also calculate the resulting
parallel efficiencies.

Anyway, I'm genuinely curious if someone has additional information about the
computational viability.

~~~
syllogism
I can't tell because I don't speak category theory, but I've been told the
essence is similar to how I've implemented the neural networks for spaCy:
[https://github.com/explosion/thinc](https://github.com/explosion/thinc)

If so, then it definitely works fine. On the other hand I don't think
parallelism is so easy to achieve, so maybe it's not the same thing after all.

~~~
orbifold
My understanding is also that it is. Your implementation drove the point home
for me that it can be practical.

The dictionary is rather simple

\- a lambda \x . f(x) is known as an "internal hom object"

\- you get a closure by using the partial application morphism (see
[https://en.wikipedia.org/wiki/Cartesian_closed_category](https://en.wikipedia.org/wiki/Cartesian_closed_category))

\- copying a variable is known as Δ : x ↦ (x,x)

\- there is a contravariant functor T^\star: X → T^\star X between a manifold
X and its cotangent space, which maps f: X → Y to its adjoint f^\star: T^\star
Y → T^\star X

\- you can then consider the product of the category of smooth manifolds Man ×
Man^{op} and define a functor Man → Man × Man^{op}, which maps a manifold X to
the product of manifolds (X,T^\star X) and a morphism f : X → Y to the
morphism (f,f^\star).

\- All you have to do to get from your implementation to the notions in the
paper is to convert the prescription you give into a point free style and
apply some jargon (You need things like the partial application map and
evaluation map, which are defined in any closed cartesian category, but easily
translate to programming languages as well)

------
bmc7505
This paper won a best paper award at ICFP 2018. Conal Elliott recently gave a
longer talk about it at Microsoft Research:
[https://www.youtube.com/watch?v=ne99laPUxN4](https://www.youtube.com/watch?v=ne99laPUxN4)

~~~
espeed
Someone introduced me to Conal's work a few days ago and recommended I watch
his "Compiling to Categories" talk which I hadn't seen, and it just so
happened that I watched it earlier today before this talk/paper posted. After
viewing both for the first time today and reviewing the corresponding papers,
one thing jumps out that's certain, and you can see it crystal clear. Conal is
brilliant. And I don't say that lightly. The clarity of his work and his
perspective on the problems is pristine. It's rare when someone is able to cut
through the noise and peer into to the essence because there's no map and it's
so hard to achieve the level of clarity that allows you to see the path that
breaks through, but somehow he found the insight and nailed it. Damn, well
done!

------
carterschonwald
To state it more clearly than some of the other comments, part of the code
insight this paper provides is the following:

Subject to transformation of your program into the first order / categorical
representation, reverse mode auto differentiation is simply about

1) transforming the program into the continuation passing style category (this
“flips” the shape of the compose operation in the category of our program)

2) do the same calculation / transform as you’d do for forward mode. Aka the
really simple derivative calc we learn in high school.

There’s an extra trick to be done to make things even better, but that’s the
meat of it. That reverse mode becomes (almost) as simple as forward mode to
implement. Subject to your language being in this sort of categorical first
order representation :)

Reverse mode being simple by any name to explain is a huge innovation alone.

~~~
marmaduke
Does this mean that any research into optimizing the performance of languages
supporting continuations could improve RAD performance in other systems like
PyTorch?

(Sorry if naive question, but your comment was very helpful to see the
connection)

~~~
carterschonwald
Compiling with continuations is a very well understood topic. (“Compiling with
continuations” is also a book on the very same topic, that’s 25 years old now?
)

It’s a compiler technique. So it can be used anywhere compilers are.

------
warabe
The title seems pretty cool!

Could anyone care to tell me if there is a motivation to learn this topic for
ordinary ML enginneer such as myself. It seems these ideas presented in this
paper are really helpful for those who develop DL frameworks like pytorch, but
what about for those who only use frameworks?

However, regardless of how useful these ideas for me, I really respect
researchers who publish excellent papers.

~~~
bmc7505
AD is important to understand for ML practitioners in the same way as
compilers are important to understand for programmers. You can get away
without knowing all the details, but it helps to understand where your
gradients come from. However this paper is probably not be a good place to
start if you're new to AD. If you want a better introduction, here are a few
good resources:

Autodidact is a pedagogical implementation of AD:
[https://github.com/mattjj/autodidact](https://github.com/mattjj/autodidact)

A nice literature review from JMLR:
[http://www.jmlr.org/papers/volume18/17-468/17-468.pdf](http://www.jmlr.org/papers/volume18/17-468/17-468.pdf)

This paper reinterprets AD through the lens of category theory, an abstraction
for modeling a wide class of problems in math and CS. It provides a language
to describe these problems in a simple and powerful way, and is the foundation
for a lot of work in functional programming (if you're interested in that kind
of stuff). There was a thread on HN recently that discusses why category
theory is useful:
[https://news.ycombinator.com/item?id=18267536](https://news.ycombinator.com/item?id=18267536)

"Category Theory for the Working Hacker" by Philip Wadler is a great talk if
you're interested in learning more:
[https://www.youtube.com/watch?v=gui_SE8rJUM](https://www.youtube.com/watch?v=gui_SE8rJUM)

Also recommend checking out Bartosz Milewski's "Category Theory for
Programmers": [https://github.com/hmemcpy/milewski-ctfp-
pdf](https://github.com/hmemcpy/milewski-ctfp-pdf)

------
carapace
Cf. "Compiling to categories" [http://conal.net/papers/compiling-to-
categories/](http://conal.net/papers/compiling-to-categories/)

Also [http://www.autodiff.org/](http://www.autodiff.org/)

------
ISL
To see automatic differentiation in use in an entirely different field, check
out page 81 (PDF page 95) of Stephan Schlamminger's dissertation.

[http://www.schlammi.com/pdf/diss.pdf](http://www.schlammi.com/pdf/diss.pdf)

This radically simplified an extremely difficult calculation necessary for a
measurement of Newton's gravitational constant. It is the only time I've ever
seen automatic differentiation in use in physics.

------
YeGoblynQueenne
>> Machine learning and other gradient-based optimization problems

Machihe learning is not a "gradient-based optimisation problem". Gradient
optimisation is a machine learning technique. There are many others that
involve no differentiation whatsoever.

What is going on here? Every few weeks I hear someone talking about machine
learning, or AI, etc, who clearly has no idea what they're talking about, even
if they do undestand maths and computer science, like the paper above. Where
do people get this confidence from, that they know all they need to know about
a field that is probably older than themselves?

