
“Deep Learning has outlived its usefulness as a buzz-phrase” - acostin
https://www.facebook.com/yann.lecun/posts/10155003011462143
======
rememberlenny
[Text from post]

OK, Deep Learning has outlived its usefulness as a buzz-phrase. Deep Learning
est mort. Vive Differentiable Programming!

Yeah, Differentiable Programming is little more than a rebranding of the
modern collection Deep Learning techniques, the same way Deep Learning was a
rebranding of the modern incarnations of neural nets with more than two
layers.

But the important point is that people are now building a new kind of software
by assembling networks of parameterized functional blocks and by training them
from examples using some form of gradient-based optimization.

An increasingly large number of people are defining the network procedurally
in a data-dependant way (with loops and conditionals), allowing them to change
dynamically as a function of the input data fed to them. It's really very much
like a regular progam, except it's parameterized, automatically
differentiated, and trainable/optimizable. Dynamic networks have become
increasingly popular (particularly for NLP), thanks to deep learning
frameworks that can handle them such as PyTorch and Chainer (note: our old
deep learning framework Lush could handle a particular kind of dynamic nets
called Graph Transformer Networks, back in 1994. It was needed for text
recognition).

People are now actively working on compilers for imperative differentiable
programming languages. This is a very exciting avenue for the development of
learning-based AI.

Important note: this won't be sufficient to take us to "true" AI. Other
concepts will be needed for that, such as what I used to call predictive
learning and now decided to call Imputative Learning. More on this later....

~~~
bra-ket
it's really a pity that after 75 years of AI research the best thing we've got
is still based on gradient descent, a brute force trial and error.

~~~
electricslpnsld
If you can’t reasonably get at or use second order information, how else are
you going to optimize arbitrary objectives?

Well, come to think of it it, why don’t DL approaches use BFGS instead of
gradient descent?

~~~
fwilliams
There is literature on Quasi-Newton and Krylov Subspace methods for training
Neural Networks. For example,
[https://dl.acm.org/citation.cfm?id=3104516](https://dl.acm.org/citation.cfm?id=3104516).

I think the primary reason that such methods are not used much in practice is
memory and computational cost: each function evaluation is expensive and you
need to solve a very large system at every iteration.

Also to reply to a sibling comment, you can add momentum and step length
adjustments to second-order methods in much the same way as in steepest-
descent to help escape saddles. The only difference is how the descent
direction is chosen for the optimization.

~~~
steev
This is correct - second order methods are great in theory, but they are
generally computationally prohibitive for high dimensional problems.

------
BucketSort
You may have seen DeepMind's results last year where it trained 3D models to
move through space in different ways, entitled "Emergence of Locomotion
Behaviours in Rich Environments" (
[https://arxiv.org/pdf/1707.02286.pdf](https://arxiv.org/pdf/1707.02286.pdf) ,
[https://www.youtube.com/watch?v=hx_bgoTF7bs&feature=youtu.be](https://www.youtube.com/watch?v=hx_bgoTF7bs&feature=youtu.be)).
If you have a look in the paper, the method they use "Proximal Policy
Optimization" is a great example of differentiable programming that does not
include a neural network. I actually realized this last month when I was
preparing a talk on deep learning, because I thought it used deep neural nets
in its application, but found that it didn't.

~~~
guillefix
Scanning through the paper, I see this "We structure our policy into two
subnetworks, one of which receives only proprioceptive information, and the
other which receives only exteroceptive information. As explained in the
previous paragraph with proprioceptive information we refer to information
that is independent of any task and local to the body while exteroceptive
information includes a representation of the terrain ahead. We compared this
architecture to a simple fully connected neural network and found that it
greatly increased learning speed."

It seems to me they do use neural nets. Proximal Policy Optimization is just a
more novel way of optimizing them.

------
cs702
I wish we could come up with a catchier name, but I LOVE the idea of calling
this _programming_ , because that is precisely what we do when we compose deep
neural nets.

For example, here's how you compose a neural net consisting of two "dense"
layers (linear transformations), using Keras's functional API, and then apply
these two layers to some tensor x to obtain a tensor y:

    
    
      f = Dense(n)
      g = Dense(n)
    
      y = f(g(x))
    

This looks, smells, and tastes like programming (in this case with a strong
functional flavor), doesn't it?

Imagine how interesting things will get once we have nice facilities for
composing large, complex applications made up of lots of components and
subcomponents that are differentiable, both independently and end-to-end.

Andrej Karpathy has a great post about this:
[https://medium.com/@karpathy/software-2-0-a64152b37c35](https://medium.com/@karpathy/software-2-0-a64152b37c35)

~~~
bmc7505

      I hate the name, but LOVE the idea of calling this programming...
    

What would you call it instead?

~~~
letlambda
∇programming

------
BucketSort
I believe this paper by Marcus (
[https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf](https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf)
) earlier this week inspired this.

Edit: I don't mean Marcus inspired the term differentiable programming; he
inspired LeCun to emphasize the wider scope of deep learning after Marcus
attacked it. In fact, LeCun liked a post on twitter rebutting Marcus' paper
that also talks about differentiable programming:
[https://twitter.com/tdietterich/status/948811917593780225](https://twitter.com/tdietterich/status/948811917593780225)

~~~
sytelus
I don't think so. LeCun seems to oppose Marcus's views...

Related:
[https://twitter.com/ylecun/status/921409820825178114?lang=en](https://twitter.com/ylecun/status/921409820825178114?lang=en)

I think LeCun doesn't want a repeat of AI winter because of exponentially
rising hype and expectations out of Deep Learning. There have been few
examples like Selena which he seems to think that people are trying to ride
the deep learning wave to generate false buzz (and cash!) for themselves.

~~~
BucketSort
He does oppose Marcus' views, but he also knows neural nets are only one
approach to differentiable programming. The term is confusing though. It
should read like "linear programming" does, but people are not interpreting it
that way.

------
saycheese
Past HN coverage of Differentiable Programming:

[https://news.ycombinator.com/item?id=10828386](https://news.ycombinator.com/item?id=10828386)

------
ehsankia
1\. Differentiable Programming is horrible branding. It's hard to say, not
catchy, and not as easily decipherable.

2\. Isn't the evolution of Deep Networks more advance setups such as GANs,
RNNs, and so on?

~~~
kinkrtyavimoodh
> Differentiable Programming is horrible branding. It's hard to say, not
> catchy, and not as easily decipherable

Tell that to the people who deliberately popularized the term Dynamic
Programming for something that was neither dynamic nor programming.

____

(From Wiki)

Bellman explains the reasoning behind the term dynamic programming in his
autobiography, Eye of the Hurricane: An Autobiography (1984, page 159). He
explains:

"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name
for multistage decision processes. An interesting question is, Where did the
name, dynamic programming, come from? The 1950s were not good years for
mathematical research. We had a very interesting gentleman in Washington named
Wilson. He was Secretary of Defense, and he actually had a pathological fear
and hatred of the word research. I’m not using the term lightly; I’m using it
precisely. His face would suffuse, he would turn red, and he would get violent
if people used the term research in his presence. You can imagine how he felt,
then, about the term mathematical. The RAND Corporation was employed by the
Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I
felt I had to do something to shield Wilson and the Air Force from the fact
that I was really doing mathematics inside the RAND Corporation. What title,
what name, could I choose? In the first place I was interested in planning, in
decision making, in thinking. But planning, is not a good word for various
reasons. I decided therefore to use the word “programming”. I wanted to get
across the idea that this was dynamic, this was multistage, this was time-
varying. I thought, let's kill two birds with one stone. Let's take a word
that has an absolutely precise meaning, namely dynamic, in the classical
physical sense. It also has a very interesting property as an adjective, and
that it's impossible to use the word dynamic in a pejorative sense. Try
thinking of some combination that will possibly give it a pejorative meaning.
It's impossible. Thus, I thought dynamic programming was a good name. It was
something not even a Congressman could object to. So I used it as an umbrella
for my activities."

~~~
YeGoblynQueenne
>> Let's take a word that has an absolutely precise meaning, namely dynamic,
in the classical physical sense. It also has a very interesting property as an
adjective, and that it's impossible to use the word dynamic in a pejorative
sense. Try thinking of some combination that will possibly give it a
pejorative meaning. It's impossible.

Now I have to try:

    
    
      "Dynamic, multimodal failure" (fail).
    
      "Dynamic instigation of pain for information retrieval" (torture).
    
      "Dynamic evisceration of underage humans" (slaughtering of children).
    
      "Dynamic destruction of useful resources" (environment destruction).
    
      "An algorithm for calculating dynamic stool-rotor collision physics" (shit hits the fan).
    

Not terribly good I guess but I think not that bad either.

~~~
kinkrtyavimoodh
Even in your examples I don't think the 'dynamic' part is negative.

------
elchief
Remember back in 2017 when Deep Learning wasn't legacy? Those were good times

------
fjsolwmv
"Google uses Bayes nets like Microsoft uses 'if' statements" -+ Joel Spolsky,
15 years ago

------
everdev
If you're going to do a "rebrand" atleast use a better name. A six syllable
word doesn't exactly roll off the tongue.

~~~
adamnemecek
differentiable.js.io

------
oh-kumudo
Just a rebranding, though necessary one. Deep Learning is not really all about
'deep' anymore, many successful models don't really need a lot of layers.

~~~
username223
So it's still just neural nets? Cool -- we've seen that before.

~~~
rspeer
Is a single SGD layer a neural net? Is an image filter or an audio filter a
neural net? Is matrix multiplication a neural net? This would strain the
intended definition even farther than it's already been strained.

But all of those are differentiable programming, and rightly so because
they're all pieces that you use and compose together to make interesting
learning mechanisms, including the ones that we vaguely refer to as "deep
learning" now.

I like the terminology. It's not about what the original long-abandoned
motivation for the design was ("neural"). It's not about how gratuitously
complex you can make it ("deep"). "Differentiable" is about how it works and
how we design it.

~~~
nightski
I'm not really buying your argument here. Neural networks are just a
collection of artificial neurons. There is no requirement for multiple layers
or depth of any kind.

~~~
contrarian_
> artificial neurons

Differentiable functions

~~~
clarry
Basically you're implying binary neurons, neuroevolution (which works on non-
differentiable functions) etc. aren't a thing. Or at least they're not
(working with) neural nets.

It's almost like SGD just made decades of AI research into neural networks
just vanish.

~~~
rspeer
I don't see that implication at all.

Nobody is claiming that the definition of "differentiable programming" should
be identical to the definition of "neural net". The claim is, if you want to
assign a name to the thing that TensorFlow, PyTorch, and similar frameworks
do, it's "differentiable programming".

If you want to make a non-differentiable neural net, knock yourself out. The
research still exists and nobody is stopping you.

But while we're talking about terminology, I'd encourage you to stop referring
to the units as "neurons". The false analogy to biology just confuses people.

------
currymj
i remember people jokingly referring to stuff like word2vec (one layer,
millions of dimensions) as "wide learning". this is definitely better branding
than that.

------
bitL
Darn! Just when I invested a lot of money and mastered Deep Learning!

------
outlace
But differentiable programming would exclude deep neural nets trained by
evolutionary methods/genetic algorithms since those are gradient free. With
the term deep learning I think the focus is correctly on the “deep”
(compositional) nature of these models and not necessarily the training
algorithm, of which there are many.

------
contextfree
Does this mean programming language nerds get to play too, maybe after boning
up on our calculus and topology?

~~~
seanmcdirmid
They already are; e.g. Jeff Dean among many others. The question is will the
PL academic community play as well.

Conal Elliott has done a lot of work in this area about 10 years ago. His work
is beautiful but maybe before it’s time.

~~~
flor1s
Interesting comment regarding Conal Elliott. I've always thought there is some
similarity between probabilistic programming (specify a probabilistic model as
a graph), functional reactive programming (specify some reactivity as a graph)
and deep learning (specify some linear algebra / calculus / optimization
operations as a graph). Too bad the word "graphical programming" would be
interpreted as "visual programming" (or programming using plots and charts!)
and not programming using an explicit graph structure.

------
seanmcdirmid
I'm seriously pondering what this means for PL research. There has been some
work in probalistic programming languages, and a significant part of the
community would like to avoid imperative features. However, it seems like this
is a chance for some real invigoration in PL research agendas.

------
billconan
"working on compilers for imperative differentiable programming languages"

what would be an example of such language?

~~~
sedachv
Common Lisp without any changes:
[https://people.eecs.berkeley.edu/~fateman/papers/ADIL.pdf](https://people.eecs.berkeley.edu/~fateman/papers/ADIL.pdf)

Fortran with some changes required:
[http://www.ens.utulsa.edu/~diaz/cs8243/adifor.html](http://www.ens.utulsa.edu/~diaz/cs8243/adifor.html)

C with some changes required:
[http://www.ens.utulsa.edu/~diaz/cs8243/adiff.html](http://www.ens.utulsa.edu/~diaz/cs8243/adiff.html)

------
danbmil99
"See more of Yann LeCun on Facebook" popup, no access to the page. No, I don't
want to create a Facebook account to read a blog post.

Perhaps links to walled-garden pages where you need an account and need to be
logged in should be prohibited or at least discouraged.

~~~
make3
this will get downvoted as all held, but I really think most socially apt
people should have Facebook accounts these days

~~~
qwerty456127
No. E.g. I'm fine with Telegram & Signal + WhatsApp & Skype for the elderly.
Having no intention to become a public person, do I really need a public
profile on a web site that's whole purpose is to spy on me everywhere, analyse
my behavior and contacts, sell the data to others and show me ads?

~~~
bufferoverflow
I was recently testing video quality of video chats on a relatively shitty
connection, and somehow Skype came out way ahead of Whatsapp and Hangouts.

~~~
qwerty456127
I absolutely believe you. When I actually need to voice-call somebody (e.g.
mom, granny or a client company CEO, all the other people I meet prefer
texting) I use Skype - it does this much better than any of the competitors.
So it sounds fairly probable it does video better than others too.

------
tomdre
Off topic. Yann LeCun really looks like Michael Moore who looks like Peter
Griffin.

~~~
dang
Please don't post unsubstantive comments here.

