
Demystifying Differentiable Programming (2018) - noelwelsh
https://arxiv.org/abs/1803.10228
======
noelwelsh
From the programming language point of view, what's interesting to me is that
this means, I think, that there is a differentiation monad. This arises
because monads are equivalent to continuations and hence anything you can
express with continuations can also be expressed as a monad.

I haven't seen this discussed before, except in the context of category theory
([https://ncatlab.org/nlab/show/differentiation](https://ncatlab.org/nlab/show/differentiation))
I'm not sure how this ties into the programming language point of view.

~~~
edflsafoiewq
Can you unpack this a little? What exactly is the monad?

~~~
noelwelsh
If you speak Scala, here's code for reverse mode differentiation as a monad.
The flatMap / bind operation becomes the the chain rule.

If this "explanation" (scare quotes because it's not really an explanation,
just a code dump) doesn't work for you let me know and I can try a different
approach.

    
    
      final case class AD[A](v: A, k: Double => Double) {
        // Chain rule
        def flatMap[B](f: A => AD[B]): AD[B] = {
          val self = this
          val next = f(v)
          AD(next.v, (d: Double) => next.k(d) * self.k(d))
        }
    
        def +(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
          AD(this.v + that.v, (d: Double) => this.k(d) + that.k(d))
    
        def *(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
          AD(this.v * that.v, (d: Double) => (this.k(d) * that.v) + (that.k(d) * this.v))
    
        def sin(implicit ev: A =:= Double): AD[Double] =
          this.flatMap(x => AD(Math.sin(x), (d) => Math.cos(x) * d))
    
        def gradient: Double =
          this.k(1.0)
      }
      object AD {
        def pure(x: Double): AD[Double] =
          AD(x, d => 1.0)
      }

~~~
edflsafoiewq
Oh, that's the monad — × M where M is a monoid right? M being, in this case,
the monoid of functions R -> R under point-wise multiplication.

Shouldn't M actually be a monoid under composition? ie. you should compose
like next.k(self.k(d)), not multiply.

~~~
noelwelsh
Sorry, I don't understand your notation. To me a monad is defined by two
functions, pure and bind (aka flatMap). I don't understand how — × M
corresponds to this.

~~~
tome
— × M is the functor. In Haskell it would generally be used as "Writer M", but
(M,) can also be used.

~~~
noelwelsh
Sorry, that doesn't help. I literally do not understand the notation. I do not
know what — means in this context, I don't understand what role × plays here,
etc.

~~~
tome
"—" means "the parameter of the type constructor goes here". I believe in
Scala you use underscore for this, at the value level at least. "×" is what
mathematicians call the type constructor for the usual product type. In
Haskell this is called (,). I don't know what it's called in Scala.

So, — × M means something like "type T a = T a M", in Haskell notation.

------
dunefox
Also this:
[https://2019.ecoop.org/details/ecoop-2019-papers/11/Automati...](https://2019.ecoop.org/details/ecoop-2019-papers/11/Automatic-
Differentiation-for-Dummies) by Simon Peyton Jones

------
tiarkrompf
Co-author here - nice to see this discussed and happy to answer questions

~~~
nestorD
I have not read the paper yet but, reading the abstract, the idea of using
continuation to store the differentiation information is reminiscent of the
technique used in Zygote[0].

Is there some parenty between the ideas ?

[0]: [https://arxiv.org/abs/1810.07951v4](https://arxiv.org/abs/1810.07951v4)

~~~
tome
I believe Zygote's approach is based on the backpropagator of Pearlmutter and
Siskind. The back propagator approach seems much simpler to me, although the
authors of this paper suggest that it requires non-local program
transformations

> The implementation proposed by Pearlmutter and Siskind returns a pair of a
> value and a backpropagator ... Doing this correctly requires a non-local
> program transformation ... Further tweaks arerequired if a lambda uses
> variables from an outer scope ... In contrast to Pearlmutter and Siskind
> [2008], using delimited continuations enables reverse-mode AD with only
> local transformations. Any underlying non-local transformations are
> implicitly resolved by shift and reset.

I'll have to look more carefully to understand how the CPS version avoids non-
local program transformations.

------
tome
I don't find this very convincing. Continuations are well known to be
completely mystifying. I wrote up what I believe to be a much simpler reverse
mode AD transformation that uses only basic concepts:

[http://h2.jaguarpaw.co.uk/posts/automatic-differentiation-
wo...](http://h2.jaguarpaw.co.uk/posts/automatic-differentiation-worked-
examples/)

~~~
samth
Your notes, while simple, don't cover many of the things the paper does, such
as binding or functions or sharing.

~~~
tome
You're right they don't cover functions. I don't understand what about binding
or sharing they don't deal with though. Could you elaborate?

~~~
tiarkrompf
For starters, what if you want to differentiate through code that traverses a
collection, for example:

    
    
        val x = ...
        ys.map(y => y + x)
    

Each loop iteration needs to contribute a gradient update to x, which is
defined in an outer scope. And what if y => y + x is not given as an inline
lambda, but defined elsewhere. It doesn't seem like your blog posts discusses
any such cases.

~~~
tome
Correct, it doesn't discuss that case, indeed it doesn't treat user-defined
functions at all. I just don't understand in what sense it doesn't treat
binding or sharing.

------
sdenton4
I dream of a day where I can write a program with some stupid
thresholds/flags, run it for a while, and automatically learn better
thresholds...

~~~
joe_the_user
That sounds appealing. How would the program know what "better" was?

~~~
saagarjha
You’d profile for the metric you cared about, like execution time or memory
usage.

~~~
taeric
This has an odd implication that the metric you care about will be static.

Consider your thermostat. What do you optimize for daily with it? Is it the
same in winter as summer? Why not?

To that end, why do we think there is an "optimum" setting?

~~~
noelwelsh
In the case of a thermostat you could define a goal temperature for a given
time range (e.g. 18C during the day, 10C at night) and then your loss is, say,
the squared difference from goal and current temperature.

But I agree that in many cases it is difficult to define a goal.

~~~
taeric
My point was supposed to be that your goal will vary across time.

Week of vacation at home? Probably optimizing for comfort. Standard work week
where you aren't home all day? Cost will be big.

Of course, getting costs in there could be a little tricky. Not impossible,
just likely a ton of heuristics. And there is some level of stress where
maintaining a temp might be cheaper than quickly getting to it, depending on
how far you will vary based on being off. Which is just a long way to say it
is easier to keep the house warm on warm days. :)

If you drive a lot, this would be akin to trying to find the optimum spot to
hold the gas pedal. Sounds like something you could look for, but odds are
high that it has to vary based on circumstances.

So, in the end, we aren't looking for a static parameter, but a system of
dynamic parameters to keep in tune.

~~~
noelwelsh
The usual way a thermostat works is the user specifies goal temperatures in
time ranges and the thermostat tries to achieve them. E.g. our thermostat
allows, IIRC, 4 different segments per day so we have morning (warmish),
daytime (minimal heating), evening (warmish), night (minimal). The
optimization problem them becomes one of achieving these goals.

To account for the dynamics / circumstances you can augment the input the
thermostat receives from just the current temperature to also include
information about the rate of change of temperature (velocity and
acceleration). This is the idea behind PID controllers and it allows things
like easing off the heating if the house is warming quickly so the thermostat
doesn't overshoot the goal.

------
DarmokJalad1701
I was able to follow till the end of Section 2.1. Was completely lost
afterwards.

