
Differentiable Programming Mega-Proposal - xmmrm
https://forums.swift.org/t/differentiable-programming-mega-proposal/28547
======
taliesinb
Capitalizing on the presence of people who might be new to Automatic
Differentiation and want a deeper understanding of how it works, here is an
interactive Colab notebook I wrote about this topic entitled “Build your Own
TensorFlow” for the Deep Learning Indaba that just happened in Kenya:
[https://colab.research.google.com/drive/14GeXkFd5pQKKNIJ7BMs...](https://colab.research.google.com/drive/14GeXkFd5pQKKNIJ7BMswP0ihlYg5nIbS#forceEdit=true&offline=true&sandboxMode=true)

~~~
Donald
That's an excellent tutorial - more are available on
[http://www.deeplearningindaba.com/practicals-2019.html](http://www.deeplearningindaba.com/practicals-2019.html)
for those interested.

------
chriscaruso
I don't see why a well written library could not serve the same purpose. It
seems like a lot of cruft. I doubt, for example, Python would ever consider
adding this and it's the defacto language that would benefit the most from
something like this - due to the existing tools and communities.

It just seems so narrow and not at the same level of abstraction that
languages typically sit at. I could see the language supporting higher level
functionality so a library could do this without a bunch of extra work (such
as by some reflection).

~~~
AbrahamParangi
I would counter that differentiable programming should perhaps rise to the
level of baseline functionality that most languages should offer.

I think the applications for automatic differentiation and gradient
optimization well exceed what we think of as ML and data science today.

------
abeppu
I'm not a Swift programmer, so perhaps my confusion is just a symptom of
broader ignorance, but I find two things unclear here: \- What does 'first-
class' mean, really? \- Which of these benefits are unique to integrating
notions of derivatives into the language, and which could be enjoyed well-
written libraries?

The mega-proposal links out to a separate doc on embedded DSLs, with broad
statements about what's "typical" or "often" true of existing DSLs -- but
which of those issues are insurmountable? That section mentions Swift's
limited metaprogramming facilities. Why choose to carve out this added support
for a single family of algorithms rather than add in some more general
metaprogramming abilities that enable better EDSLs?

~~~
thedataangel
My understanding of automatic differentiation (AD) is that it's only really
possible at the compiler level, since you need the ability to interpret and
manipulate function definitions themselves. Certainly, no library would be
able to offer the same level of guarantees telling you if you've done it
wrong, nor the same opportunities for optimisation.

~~~
abeppu
It's certainly not the case that autodiff is only possible at the compiler
level. I've implemented forward mode (via dual numbers) and reverse mode (via
tapes / wengert nodes) autodiff in libraries before.

~~~
mlevental
notice the qualifier "really". obviously you can implement autodiff kind of
outside the complier since pytorch and tensor flow exist. but those
implementations constrain you to a select few compositions (please no comments
on Turing completeness with just loops and conditionals). so for example if
statements in pytorch are not differentiable (they might have piece wise
continuous derivates) because pytorch doesn't actually trace the ast. I'm not
a languages expert but outside of implementing in the compiler I imagine you'd
need a homoiconic language to implement as a library.

~~~
chillee
If statements aren't _really_ meaningfully differentiable, regardless of how
you do it.

Take

    
    
        if x == 59:
            return 1000
        else if x > 59:
            return -x
        else:
            return x
    

How do you optimize this to maximize x, regardless of what language you're in?

It's true that you can get a derivative, but the derivative is essentially
meaningless.

~~~
mlevental
I don't understand? It's a piecewise differentiable function and you maximize
how you maximize any such function: do gradient ascent where it's
differentiable and compare against values at the boundary points (ie start,
end of interval and points at which there's a removable discontinuity).

~~~
chillee
I don't disagree that a gradient exists. As the other commenter commented, the
gradient/subgradient will usually exist.

What I'm arguing is that this gradient will not allow you to optimize anything
of interest for the vast majority of programs.

------
taliesinb
If you are interested in following along, the Swift for TensorFlow team has a
design meeting every Friday. The meetings are live-streamed and anyone can
join. I recommend them if you are curious and want to hear more about the
challenges and opportunities! A recent talk had Jeremy Howard (of fasti.ai
fame) present the full-featured deep learning API he has designed on top of
Swift for TensorFlow.

You can find out more on the mailing list
[https://groups.google.com/a/tensorflow.org/forum/m/#!forum/s...](https://groups.google.com/a/tensorflow.org/forum/m/#!forum/swift)

------
cs702
How does this relate to all the work that Christ Lattner et al have been doing
at Google with Swift, MLIR, etc.?[a]

Is this... a separate, parallel, more encompassing proposal?

Is there any coordination between these two groups?

\--

[a]
[https://www.youtube.com/watch?v=yCd3CzGSte8](https://www.youtube.com/watch?v=yCd3CzGSte8)

~~~
samtheprogram
I don't know if "Christ" Lattner was intentional (humorous) or an accident,
but I chuckled.

~~~
cs702
Accident.

------
hsaliak
How does this compare to Jax?
[https://github.com/google/jax](https://github.com/google/jax) Why do this in
the language instead of a library?

~~~
taliesinb
Probably the main justification is that the analysis and transformation steps
needed to compute the vjp and jvp pullbacks of a function (which correspond to
reverse- and forward-mode automatic differentiation) require enough of the
other machinery of a compiler that they are best done WITHIN a compiler. Then
other things become quite natural, too, like producing the tangent vector
versions of data structures like tuples and maps!

Moreover, a statically typed language like Swift is a much better starting
point for this kind of effort than Python. Array shapes and dimensions are
already a type system - you might as well go the whole distance and get all
the other safety, readability, and efficiency benefits!

PS shout out for named array axes as the future of array-based (and hence
differentiable) programming... see
[http://nlp.seas.harvard.edu/NamedTensor](http://nlp.seas.harvard.edu/NamedTensor)
for a good rationale

~~~
hsaliak
jax does have a just in time compiler that lets you compile your python
functions to XLA-optimized kernels (through llvmlite under the hood?).

The fact that you can jit compile and gain the benefits of "doing this within
the compiler" is one of its main selling points.

------
mark_l_watson
This will certainly help people who work on new deep learning theories and
model architectures, but not so much the large crowd of deep learning
practitioners.

In his excellent interview with Lex Fridman, Yann LeCun was critical of any
approach to AI that was not differentiable, even constraint satisfaction, and
other solid optimization techniques. In the context of scaling to very large
problems or models with many billions of parameters, he is probably correct.

I have had problems with the Swift and TensorFlow code drops. Sometimes they
work for me and sometime they don’t. So, very good technology but perhaps wait
for it to mature. I read that some students for the fast.ai course using Swift
have also had some setup difficulties.

EDIT: you might also want to look at Julia for differentiable programming and
Julia with deep learning libraries like Flux is also a ‘turtles all the way
down’ system, where unlike TensorFlow where the guts are implemented in C++,
for Swift and Julia the entire stack can be implemented in a single language.

------
thedataangel
This is actually huge. I saw a proof of concept of something like this in
Haskell a few years back, but it's amazing it see it (probably) making it into
the core of a mainstream language.

This may let them capture a large chunk of the ML market from Python - and
hopefully greatly improve ML apis while they're at it.

~~~
krapht
Huh? Nobody is writing numerically intensive libraries in Python. Clearly this
language proposal is taking aim at C++ and Fortran. Even if this caused
TensorFlow & others to rewrite everything in Swift, people would write Python
bindings to it and keep using Python.

I'll get excited if Apple actually merges this into Swift. It's a niche
feature that their compiler team will need to maintain forever. I actually
have been working on algorithmic differentiation in C++, so it's not even that
I wouldn't want to try Swift out if it actually made it in. However, because
this sort of thing is of such narrow interest I believe the future will stay
with embedded DSLs / libraries / ugly macro/template hackery.

~~~
mlevental
lol this is literally by the group that's rewriting tensorflow in Swift
[https://www.tensorflow.org/swift](https://www.tensorflow.org/swift) so you're
off on the intention here in that it is exactly taking aim at python as the
main data ecosystem language.

------
wenc
I'm curious (for folks in the know): how does differentiable programming
handle non-differentiable points? Can it detect non-differentiable/non-smooth
functions?

Non-smooth functions like abs(), max(), min() have points where derivatives do
not exist. ReLU functions are non-differentiable at their hinge points.

Disjoint IF-THEN-ELSE conditions are discontinuities in the function space,
and are traditionally handled in optimization with mixed-integer formulations
(i.e. split up the space and do something clever like branch-and-bound to find
the optimum).

~~~
chillee
Other people have answered what they do, but this is the big gap between
people talking about 'differentiable programming' in theory, and having it
actually work in practice.

It's true that once you have control flow, the gradient quickly becomes
meaningless. I posted an example here:
[https://news.ycombinator.com/item?id=20892287](https://news.ycombinator.com/item?id=20892287)

That's also the biggest reason I tend to find much of this "differentiable
programming" stuff to be overhyped. It's _hard_ to reformulate programs in a
way s.t. the derivative can mean something meaningful. And I'm not convinced
traditional languages will benefit.

That's not to say that there isn't cases where your program can be formulated
to have a meaningful derivative.

See this differentiable ray tracer:
[https://people.csail.mit.edu/tzumao/diffrt/](https://people.csail.mit.edu/tzumao/diffrt/)

~~~
taliesinb
> It's hard to reformulate programs in a way s.t. the derivative can mean
> something meaningful.

Really? The gradients computed by AD are the exact answer to the following
question: if I were to change this input or parameter an infinitesimal amount,
how much would it change the output of my function? That is _always_
meaningful (when it is defined), and means what I just said. You can easily
make functions where it is not defined, of course, just like you can make a
sphere into two spheres with the Banach-Tarski theorem!

But there are vast, vast forests of numerical computation employed in
industry, science, finance, engineering, where it is _almost always_ defined.

And even for more “chunky” computations where the non-differentiability is
more severe, there are algorithms like REINFORCE that you can use to estimate
gradients through these parts.

~~~
chillee
It's not meaningful in the sense that it won't correspond with your intuition
of "will increasing the input increase my output". Presumably the point of
differentiable programming isn't just getting the derivative for fun, it's for
optimizing some quantity.

For example, take this code.

    
    
         x,y
         for (int i=0; i<x; i++)
             y += 1
         return y
    

It's technically true that the gradient of 0 is correct (modulo boundaries).
But if someone was trying to optimize this function, that's not very helpful.

I believe REINFORCE is not of much help either - it's not magic. I'm not aware
of any stochastic gradient estimators that are helpful in this case (although
if there is a method I'd like to hear about it).

------
__erik
I'm really excited for this. I'm unaware of any mainstream language with first
class support for differentiation, I think its going to be really interesting
to see what people use it for out side of ML.

~~~
adamnemecek
Julia is in the same space.

~~~
__erik
Julia is great but it doesnt play in the domain of apps and servers

~~~
byt143
Yes it does. See genie.jl for a full mvc framework. Also there's mux.jl and
http.jl.

------
akhilcacharya
I really like the idea of building AD directly into the language and compiler
infrastructure itself, but the challenge of upstreaming it in a language built
for very specific tasks makes me concerned. Is Latner just going to end up
making a GSwift? Will it just be forked?

------
layoutIfNeeded
Ew, that’s really not something I would do at the language level...

------
oyashius
This is super exciting

------
adamnemecek
Now do automatic integration.

------
sethryclaus
Well done Swift. Insightful language design.

------
antpls
Can "Differentiable Programming" be related to "Differentiable Privacy", or
have we now one word (and acronym!) to describe two different things?

~~~
omaranto
I think you meant " _differential_ privacy", and no, it's not closely related
to differentiable programming.

[1]
[https://en.wikipedia.org/wiki/Differential_privacy](https://en.wikipedia.org/wiki/Differential_privacy)

