
What Is Differentiable Programming? - one-more-minute
https://fluxml.ai/2019/02/07/what-is-differentiable-programming.html
======
throwawaymath
I have to be honest, I don't think this is a good explanation. I don't know
what differential programming is, but I'm fairly sure I have the mathematical
background to understand it. But I didn't come away from this article with any
confidence that I'm following along.

On a superficial level it seems like it:

1\. Generalizes deep learning to an optimization function on decomposable
input, and

2\. Reduces the number of parameters required to learn the input by exploiting
the structure of the input, thereby making learning more efficient.

Is that correct? Is it completely off? What am I missing? Is there any more
meat to the article than this?

Could someone who has upvoted this (and ideally understands the topic well)
provide a different explanation of the concept? It would be great if I could
see a real world example (even a relatively trivial one) represented in both
the traditional matrix computation form and the sexy new differentiable form.

~~~
taktoa
The basic idea is to have an expressive programming language where all
constructs are differentiable. Since the composition of diffeomorphisms is a
diffeomorphism, large programs (like ray-tracers) will be differentiable as a
result.

~~~
enriquto
> The basic idea is to have an expressive programming language where all
> constructs are differentiable. Since the composition of diffeomorphisms is a
> diffeomorphism, (...)

No. It has nothing to do with diffeomorphisms (which are necessarily between
spaces of the same dimension), but with piecewise differentiable functions.

~~~
throwawaymath
Thanks, I was also a little confused about that in the parent comment.

------
damip
Differentiable does not mean easy to optimize. One could imagine implementing
sha-256 using differentiable operators, and yet the system as a whole would
not be optimizable at all. It would be interesting to have compilers that
optimize the "optimizability" of differentiable programs tho...

Also, here are two interesting examples of differentiation through physical
systems for classification:

[https://arxiv.org/pdf/1808.08412.pdf](https://arxiv.org/pdf/1808.08412.pdf)

[https://innovate.ee.ucla.edu/wp-
content/uploads/2018/07/2018...](https://innovate.ee.ucla.edu/wp-
content/uploads/2018/07/2018-optical-ml-neural-network.pdf)

------
ricksharp
Could someone list some practical examples where Differential Programming
would be useful?

I am familiar where Nueral Networks and Convolutional Networks have done well
especially around image processing etc.

But I can’t imagine where having differential code would help unless it is
just tying multiple neural networks together in a continuous chain of
differentiation.

For most programming tasks, I can’t imagine how differentiation would be
possible or beneficial.

Is there a possibility that one could start with a series of unit tests and
partial results and through gradient descent actually arrive at additional
passing test cases? Most of the time in my experience, passing additional test
cases like this requires significantly more complex structures that would not
be found via differentiation.

~~~
improbable22
There are examples quite far from neural networks, the ones I can think of are
broadly optimisation problems:

Many physics problems involve trying to find a function which minimises
something -- energy, entropy, action. Or the state which makes the difference
between two things zero. Sometimes adjusting many parameters slowly down the
gradient is a good way to find these.

In bayesian statistics, the basic problem is to sample from a distribution,
which you know only indirectly, by some kind of monte-carlo method. But the
space to sample can be enormous. If I understand right, advanced ways of doing
this exploit the gradients (of functions defining the distribution) to try to
choose samples efficiently.

People have hacked tensorflow to do all sorts of things which its creators
didn't intend. Or written tools specialised for another particular domain
(like Stan). I guess the excitement is that instead of re-inventing the wheel
in each domain, maybe this can be pushed down to become a language feature
which everyone above uses.

------
ktpsns
Something I don't understand about Automatic Differentiation is: Why not use a
Computer Algebra System instead for generating derivatives of given functions?

~~~
one-more-minute
That's essentially what a source-to-source AD does, just with support for the
extra features that show up in programming languages. For example, handling
variable bindings gets you the typical Wengert list, and handling function
calls gets you the Pearlmutter and Siskind style backpropagator (I wrote a bit
about the relationships at [0]).

The short answer is that CAS systems work with a "programming language" that
doesn't have these features and is therefore a bit too limited for the kinds
of models we're interested in.

[0] [https://github.com/MikeInnes/diff-zoo](https://github.com/MikeInnes/diff-
zoo)

------
ricksharp
I found this paper that helps answer the question:
[https://arxiv.org/pdf/1803.10228.pdf](https://arxiv.org/pdf/1803.10228.pdf)

~~~
JadeNB
When possible, one should link to the abstract page, if for no other reason
than that it makes version tracking easier: Want, Wu, Essertel, Decker, and
Rompf - Demystifying differentiable programming: shift/reset the penultimate
backpropagator
([https://arxiv.org/abs/1803.10228](https://arxiv.org/abs/1803.10228)), a
title which is entirely too cute for its own good.

~~~
ricksharp
Thanks Jade, didn’t realize I grabbed the wrong link.

This paragraph from the paper was helpful for me:

Differentiable programming is of joint interest to the machine learning and
programming language communities. As deep learning models becomes more and
more sophisticated, researchers have noticed that building blocks into a large
neural network model is similar to using functions, and that some powerful
neural network patterns are analogous to higher-order functions in functional
programming [Fong et al. 2017; Olah 2015]. This is also thanks to the
development of modern deep learning frameworks which make defining neural
networks “very much like a regular program” [Abadi et al. 2017; LeCun 2018].

------
xitrium
Be cautious using this article to try to learn anything. Differentiable
programming is not actually related to deep learning; it's another word for
automatic differentiation, a technique that is very important in deep learning
implementations but useful for a variety of other tasks where having gradients
available for arbitrary functions is useful.

The article is correct that "Differentiable Programming" seems to be a
rebranding effort that I believe just helped automatic differentiation work
from the machine learning world get published in Programming Languages
journals. I wouldn't read too much into it.

------
tanilama
Is Differentiable Programming just a rebranding of tape-based auto
differentiation?

~~~
psykotic
The only serious difference between classical AD and newer efforts like this
is that they have an intermediate language ("what goes on the tape") with
recursion and control flow operators which ideally lets them implement front-
ends for normal programming languages.

You can achieve the same effect with classical AD by doing what PyTorch does
in its default mode where the dynamic recursion and control flow is normal
Python code and large tensor ops are asynchronously scheduled on the target
device and an AD tape for that particular dynamic run is constructed each
time. But the viability of PyTorch's approach is domain dependent: there are
relatively few but large tensor ops, and while host-device synchronization is
possible (for making the dynamic control flow dependent on intermediate
device-computed results) it's also very costly. Although the first of these
would be slightly less critical if the host language executed faster.

But, if you're not trying to achieve the kind of decoupling between host and
device which PyTorch and similar frameworks want, then I don't see any
particular reason you can't just use normal forward-mode or reverse-mode AD as
an alternate scalar execution mode for existing code, which just needs a
different interpreter/library, not a different intermediate language. For the
purposes of differentiation, the branching and control flow operators don't
exist, anyway. E.g. if you represent branching explicitly in your graph/tape
as TensorFlow does, the subgradient of if(a, b, c) is just the subgradient of
whichever of b or c is selected. The conditional variable a only serves as a
selector; if you compute the subgradient with respect to the selector, the
answer is always 0 (except for the knife-edge case where it's technically not
defined but is treated as 0).

------
hnuser355
What the hell is that supposed to mean and how it’s different from automatic
differentiation

~~~
ddragon
I think it's just the realization that the execution graph in machine learning
models at this point are not really different from any programming language
AST, which means there is potential in exploring the intersection between
writing programs and writing machine learning models with the AD tools.

~~~
hnuser355
Sorry for curse words I’m very confused

~~~
ddragon
It's ok, it seems at this point the focus is in creating the tools to better
allow the exploration, including making the entirety of a programming language
valid syntax for building any model (supporting the AD). There are the efforts
in Julia Zygote and the Tensorflow for Swift that I know of.

I think the differentiable forth example in the article is interesting in the
context, since it has a differentiable program with gaps, and it uses the
universal approximation property of a neural network to fill them. When your
code is differentiable, it's possible to embed ML models, perhaps to learn a
part of an equation when you already know most of it, and which otherwise
would have too large of a search space. You might even have, like other
commenter said, a compiler being smart enough to rewrite the AST to reduce
convergence problems (which seems to be the main problem with such models). Or
you could download libraries with pretrained models/architectures in the same
way you have any regular program library to embed in deeper systems.

Though I honestly can't tell if any of those are actually valid pursuits or
I'm misunderstanding the possibilities.

