
Neural Ordinary Differential Equations - asparagui
https://arxiv.org/abs/1806.07366
======
duvenaud
Senior author here, I'm happy to answer any questions.

We just released source code:
[https://github.com/rtqichen/torchdiffeq](https://github.com/rtqichen/torchdiffeq)
. This includes PyTorch implementations of adaptive ODE solvers that can be
differentiated through automatically. So you can mix and match these ODE
solvers with any other differentiable model component.

There's already been a bit of follow-up work, turning Continuous Normalizing
Flows into a practical generative density model:
[https://arxiv.org/abs/1810.01367](https://arxiv.org/abs/1810.01367)

And now we're mainly working on 1) Regularizing ODE nets to be faster to solve
and 2) getting the time-series model to scale up and extend it to stochastic
differential equations.

~~~
ssivark
Could you place the work in context, and provide a simplified explanation for
someone who understands math and ML, but is not familiar with the literature
on normalizing flows and autoencoders? Thanks!

I tried reading, but the abstract and introductory section were a little too
terse for me :-)

~~~
duvenaud
Sure thing. A few years ago, everyone switched their deep nets to "residual
nets". Instead of building deep models like this:

    
    
      h1 = f1(x)
      h2 = f2(h1)
      h3 = f3(h2)
      h4 = f3(h3)
      y  = f5(h4)
    

They now build them like this:

    
    
      h1 = f1(x)  + x
      h2 = f2(h1) + h1
      h3 = f3(h2) + h2
      h4 = f4(h3) + h3
      y  = f5(h4) + h4
    

Where f1, f2, etc are neural net layers. The idea is that it's easier to model
a small change to an almost-correct answer than to output the whole improved
answer at once.

In the last couple of years a few different groups noticed that this looks
like a primitive ODE solver (Euler's method) that solves the trajectory of a
system by just taking small steps in the direction of the system dynamics and
adding them up. They used this connection to propose things like better
training methods.

We just took this idea to its logical extreme: What if we _define_ a deep net
as a continuously evolving system? So instead of updating the hidden units
layer by layer, we define their derivative with respect to depth instead. We
call this an ODE net.

Now, we can use off-the-shelf adaptive ODE solvers to compute the final state
of these dynamics, and call that the output of the neural network. This has
drawbacks (it's slower to train) but lots of advantages too: We can loosen the
numerical tolerance of the solver to make our nets faster at test time. We can
also handle continuous-time models a lot more naturally. It turns out that
there is also a simpler version of the change of variables formula (for
density modeling) when you move to continuous time.

~~~
im3w1l
So one question about that. In

    
    
      h1 = f1(x)  + x
      h2 = f2(h1) + h1
      h3 = f3(h2) + h2
      h4 = f4(h3) + h3
      y  = f5(h4) + h4
    

the functions are all different. But to see it as "a primitive ODE solver",
then the functions should be the same?

So if I understand correctly you have a different take on RNN's but not on
deep residual nets in general?

~~~
sampo
> then the functions should be the same?

If we conceptually think that advancing from one neural net layer to the next
one is the same as taking a time step with an ODE solver, then a bit more
precise notation would be

    
    
        h1 = f(t=1,x)  + x
        h2 = f(t=2,h1) + h1
        h3 = f(t=3,h2) + h2
        h4 = f(t=4,h3) + h3
        y  = f(t=5,h4) + h4
    

Now you can say that the function f is always the same, but it still can give
very different values for Δh when evaluated at different time points.

~~~
soVeryTired
I do think it's misleading to compare the method to a general feed-forward
network though, for two reasons.

First, to preserve the analogy between eq. 1 and 2, the thetas in equation two
should have their own dynamics, which should be learned.

Second, even if Equation 1 doesn't allow it, in a general feed-forward network
it's possible for the state to change dimension between layers. I don't see
how that could happen with the continuous model.

Neat paper, but it'd be nice if they had tied the analogy more explicitly to
RNNs in the introduction.

~~~
duvenaud
The comparison we make is to residual networks, which I think is valid. First,
we do parameterize a theta that changes with time, using a hypernet. But this
is equivalent to the way sampo wrote the equations above - you can just feed
time as another input to the dynamics network to get dynamics that change with
time.

Second, I agree that general feedforward nets allow dimension changes, but
resnets don't. This model is a drop-in replacement for resnets, but not for
any feedforward net. If we gave the wrong impression somewhere, please let us
know.

We didn't make the analogy with RNNs, because I don't think it fits - standard
input-output RNNs have to take in part of the data with every time step, while
here the data only appears at the input (depth 0) layer of the network.

~~~
joe_the_user
Hasn't there been similar work to this in the past?

I don't see a "related work" section in your paper.

~~~
duvenaud
Section 7 is titled "Related Work".

~~~
joe_the_user
Thanks,

Apologies for not looking harder

------
bitL
Wow, I need to master ODEs/PDEs to keep up with Deep Learning now! Seems like
one has to be a master of statistics, operations research, calculus and
algorithms to push it forward!

Comparison to RNN was impressive! Any well-known real-world models for
comparison to state of art?

~~~
duvenaud
The closest thing to a continuous-time RNN that exists are Neural Hawkes
Processes
[[https://arxiv.org/abs/1612.09328](https://arxiv.org/abs/1612.09328)]. They
use a different model where observing the system necessarily changes its
state, which is natural in some settings but doesn't fit in others. On the
other hand, their model scales today and ours doesn't :)

------
heinrichf
Some reporting in MIT's newsletter on AI:
[https://mailchi.mp/technologyreview/a-new-type-of-deep-
neura...](https://mailchi.mp/technologyreview/a-new-type-of-deep-neural-
network-that-has-no-layers)

------
syntaxing
I remember something similar for CFD application but haven't seen much after
that. It would be awesome if we can build a cheap and fast Navier Stokes
solver with neural networks.

------
shsjxzh
Hi, I am very interesting about your models. When you do the back propagation,
it seems that it still needs complex calculation. Although O(1) memory cost is
an important contribution, do you think record some of the intermediate value
will significantly boost the training?

