We just released source code:
This includes PyTorch implementations of adaptive ODE solvers that can be differentiated through automatically. So you can mix and match these ODE solvers with any other differentiable model component.
There's already been a bit of follow-up work, turning Continuous Normalizing Flows into a practical generative density model:
And now we're mainly working on 1) Regularizing ODE nets to be faster to solve and 2) getting the time-series model to scale up and extend it to stochastic differential equations.
I tried reading, but the abstract and introductory section were a little too terse for me :-)
h1 = f1(x)
h2 = f2(h1)
h3 = f3(h2)
h4 = f3(h3)
y = f5(h4)
h1 = f1(x) + x
h2 = f2(h1) + h1
h3 = f3(h2) + h2
h4 = f4(h3) + h3
y = f5(h4) + h4
In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.
We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.
Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.
So if I understand correctly you have a different take on RNN's but not on deep residual nets in general?
If we conceptually think that advancing from one neural net layer to the next one is the same as taking a time step with an ODE solver, then a bit more precise notation would be
h1 = f(t=1,x) + x
h2 = f(t=2,h1) + h1
h3 = f(t=3,h2) + h2
h4 = f(t=4,h3) + h3
y = f(t=5,h4) + h4
First, to preserve the analogy between eq. 1 and 2, the thetas in equation two should have their own dynamics, which should be learned.
Second, even if Equation 1 doesn't allow it, in a general feed-forward network it's possible for the state to change dimension between layers. I don't see how that could happen with the continuous model.
Neat paper, but it'd be nice if they had tied the analogy more explicitly to RNNs in the introduction.
Second, I agree that general feedforward nets allow dimension changes, but resnets don't. This model is a drop-in replacement for resnets, but not for any feedforward net. If we gave the wrong impression somewhere, please let us know.
We didn't make the analogy with RNNs, because I don't think it fits - standard input-output RNNs have to take in part of the data with every time step, while here the data only appears at the input (depth 0) layer of the network.
You might be able to make a better connection to RNNs by having the input data as a 'forcing' function in your ODE. But you probably need some regularity conditions on the input data to make sure the result is nicely behaved.
Well they can but I don't see they why they have to. And couldn't your network also take input and give output for all times?
> First, we do parameterize a theta that changes with time, using a hypernet.
Ah I see. Did you end up using it in the final model? I don't see that in the mnist example, but I could be missing it as I only skimmed the code.
I don't see a "related work" section in your paper.
Apologies for not looking harder
Or something like that.
Take a simple NN with 3 layers: 5 neurons in the input layer, 3 in the hidden and 1 output.
Force inputs to neuron 4 and 5 in the hidden layer to be zero, and force inputs to neurons 2-5 to be zero in the output layer (and ignore their output). I'm assuming the transfer function obeys f(0) = 0, if not, fix output to zero as well.
My thought was this would be similar to how you enforce boundary conditions when solving partial differential equations by directly setting the value of certain matrix elements before running the solver.
Again, may be completely silly.
What about symmetries of the underlying continuous system?
I'm under the impression that having deep nets as ODEs should make it possible to enforce a certain geometry on the information flow (like incompressible fluid, Hamiltonian, etc..) which would correspond to some invariant of the whole network.
Does this idea make sense?
Last time, I read work to port SNN (with ODE solver) to Pytorch
Looking at the links you provided, they don't appear to train the network dynamics, only the feedout weights. In principle you could differentiate through the ODE solvers used in the software package linked to train the dynamics, but as far as I know we were the first people to release an open-source set of differentiable solvers using reverse-mode adjoint sensitivities (the most scalable variant of autodiff for training scalar losses). [https://github.com/rtqichen/torchdiffeq]
In the paper we talk a bit about how you can now more easily train Possion process likelihoods, which might make training SNNs easier. I'm not sure.
Would the DifferentialEquations.jl work in the Julia community qualify? As I understand it for the pure Julia solvers you can run them through autodiff to differentiate with respect to the parameters. Chris Rackauckas has talked a lot about how cool it is that he can take his solvers that were not written to be AD-aware and use somebody else’s AD package that’s not specialized for diff eqs, and combined you get differentialable diff eqs.
We identified that tracing-based reverse-mode adjoint + solving without mutation is not competitive with the other sensitivity analysis methods (this is the method which they have implemented in this repository) when comparing it to methods which are fully compiled and utilize mutation. Since the methods in SciPy don't utilize mutation (and add quite a bit of their own overhead, and don't necessarily work well with the JIT compilation being of context switching, as previously demonstrated in http://juliadiffeq.org/2018/04/30/Jupyter.html ), I wouldn't be surprised that they saw a speedup anyways over the odeint+adjoint method described in the paper, but that's mostly because the method from the paper is highly inefficient.
In terms of efficient implementations, the first ones which are equivalent to a reverse-mode adjoint were actually those of FATODE ( https://dl.acm.org/citation.cfm?id=2048596&dl=ACM&coll=DL ) which implemented quite a few Runge-Kutta and Rosenbrock methods with forward-mode and reverse-mode transforms of the solvers manually written in there (described as "discrete sensitivity analysis") for hooking into Fortran AD implementations. JuliaDiffEq's methods are based along this existing heritage and testing.
But sensitivity analysis goes back a lot further. In their paper though they utilize the adjoint sensitivity analysis, which the most common software one would point to for this is SUNDIALS CVODES (https://computation.llnl.gov/sites/default/files/public/cvs_... ). If you compare page 26 of the SUNDIALS manual to the Neural ODE paper, you'll notice that it's eerily similar because it's almost the same method. And actually, 2.7.1 describes an improvement to the method in the Neural ODE paper by using "checkpointing". In the Neural ODE paper, to do a reverse solve of the adjoint ODE it solve the forward ODE from the beginning time point until the point. Clearly, this is really slow because it requires a lot of forward solves over long intervals. You can instead save a few time points (checkpoints) along the forward solve and use those as the starting point, and this is the method implemented in CVODES and IDAS. In JuliaDiffEq, since we were focused on less memory intensive applications (CVODES did it for PDEs), we did it in a way where you get a continuous solution in one forward solve, and then just use the interpolation for an essentially free calculation for the Jacobian at a given time point. This of course goes from many ODE solves to a single ODE solve, at the cost of using more memory, and we proposed implementing checkpointing in a GSoC. But anyways, the derivation of the paper for the adjoint method wasn't new and it was actually already improved by 2005 in the TOMS paper on CVODES ( https://computation.llnl.gov/projects/sundials/toms_cvodes_w... ), with other methods for improving runtime done as well.
In a previous Twitter thread, the senior author also claimed that using the autodifferentiation to do the vector-Jacobian products was a first. For reference, what I mean is that there is a vjp in the adjoint ODE and this can be solved without explicitly building the Jacobian and seeding the backsolve of the derivative function f appropriately. I will admit that in JuliaDiffEq we weren't doing this, but we have since improved our implementation to make use of this trick (and the advantage of this method is shown in the Arxiv paper linked above). That said, the Neural ODE paper isn't the first group to do this either, with earlier work actually being traced back to CasADi (https://link.springer.com/chapter/10.1007/978-3-642-30023-3_...), and of course the author of CasADi was confused as well in the same Twitter thread about the novelty claim here.
And this is just the tip of the iceberg for schemes utilizing adjoint methods. There's entire pharmacometric software (NONMEM) and engineering software (Modelica) built back in the 80's for building the adjoint equations in ways to get these sensitivities for parameter estimation of methods.
So I honestly cannot find a single part of the adjoint method either as mentioned or implemented where this PyTorch implementation is the first. I think a lot of this comes from a lack of technical expertise in the area of numerical differential equations and just an honest mistake though. There are other glaring examples of this in the paper as well. For example, in the paper they mention:
>To solve ODE initial value problems numerically, we use the implicit Adams method implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires solving a nonlinear optimization problem at every step.
However, it's well known that not all implicit methods have better stability guarantees than explicit methods. Specifically, the implicit Adams methods they mention are Adams-Bashforth-Moulton methods are not unconditionally stable and have stability similar to explicit methods (you can find this in a lot of numerical analysis course lecture notes, like this one: http://www.math.vt.edu/people/embree/math5466/lecture37_38.p... . The best resource on this is probably Hairer Solving Ordinary Differential Equations I: Non-stiff Problems). Because of this lack of stability, LSODE with Adams coefficients is only recommended non-stiff equations and common stiff test examples like the ROBER will cause it to fail. Instead, the backwards differentiation formula (BDF) is the similar method which is used for stiff equations (there are lots of sources on this. For example, ode113 is the Adams and ode15s is the BDF, and go look at MATLAB's documentation for how they do the recommendations. Another example is CVODE which implements both the Adams and BDF methods, and only recommends the Adams method for non-stiff equations). This also means that the current set of PyTorch differential equation solvers is only applicable to (some) non-stiff ODEs BTW.
I do want to end this critique though by saying that the paper itself describes a very new and novel application of differential equation solvers as part of a neural net in a way that seems very beneficial. So even though what was demonstrated is not technically sophisticated compared to current software and techniques, their application is a very interesting idea and a new research direction which is deserving of a best paper. I don't think we should lose track that the idea of an RNN as an Euler discretization and thus building the continuous one to get better training times is a very neat solution to a machine learning problem and I very much respect the authors for this advance.
I think that what we can learn from this case is that numerical differential equation and machine learning communities need to come together to do this in a way that pushes the research and the software forward. We are starting to do this in Julia, and that's why in our Arxiv paper you can see the JuliaDiffEq crew has teamed up with the machine learning and AD people (Jerret Revels and Mike Innes) to make sure we can get the full adaptive + events + stiffness handling etc. solves working with AD, and get these utilized in new applications for machine learning. With the help of one of the paper's authors we hope to put out a package in January for implementing said neural ODEs in Julia. From there we can all probe the method our own ways. Since my background is in numerical differential equation solvers, I plan to look at this to see what kinds of solvers are optimal in this application (I think some EPIRK type methods might be the right deal). The ML folks will likely want to actually use it as a model to train and predict. The AD folk see this as a good test case for source-to-source transformations, since building a tape a la Flux or PyTorch can be pretty inefficient for large nonlinear operations like an ODE solver (so we plan on seeing what happens with Zygote.jl on this. Right now it errors, but we'll see!).
I see a very nice future merging differential equations and machine learning in various ways, and this paper is a fantastic start in that direction. Let's just make sure we cite the relevant existing work from both communities.
As for Sundials and FatODE, my understanding was that they used finite differences, forward mode, or differentiating the solver operations for at least some aspect of their sensitivity analysis.
On another topic, I think you might be misunderstanding how we're running our adjoint sensitivity analysis. You say
> In the Neural ODE paper, to do a reverse solve of the adjoint ODE it solve the forward ODE from the beginning time point until the point. Clearly, this is really slow because it requires a lot of forward solves over long intervals.
This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.
I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?
I also agree there is a lot of numerical work to be done in this area, and I'm glad that people more knowledgable than us (such as yourself) are looking at it too!
No, their sensitivity analysis doesn't use finite differences. They perform the sensitivity analysis as described in their documentation. If methods for the Jacobian calculation are not provided, then they utilize finite differences on the Jacobian calculation of course, using the same routine as for the stuff solver. But with a given Jacobian function there's no finite differences or forward mode.
>This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.
No. Of course it's a single solve going backwards in time. I see what my misread was, but I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable. Without a reversible integrator (implicit Adams is not reversible), it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct. A good test equation for this is probably the Lornez equation with standard parameters over a time like [0,300]. The backwards pass will diverge and not necessarily be on the same butterfly wing as the forwards pass. The Julia code using CVODE is shown here ( https://gist.github.com/ChrisRackauckas/fef4ae7778320530d44b... ) and you see that starting from [1.0,1.0,1.0] the result of going backwards is then off by [-17.5445, -14.7706, 39.7985]. So there you go, CVODE's Adams method ends up on a different "wing" of the butterfly when integrated backwards, ending up not even close to the actual initial point (CVODE is the successor to LSODE, both by Alan Hindmarsh, but utilizes constant leading coefficient forms to reduce computations. So not exactly the same as the paper, but very close). Thus to ensure correctness, existing sensitivity analysis packages only get Jacobians of f using data from forward passes. The Neural ODEs may have had a small enough Lyopunov coefficient or a short enough integration that this wasn't an issue, but it is in general something to note. Of course, if you are only doing this on Hamiltonian systems...
>I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?
Any statement on it is too broad to be useful. Runge-Kutta Chebyshev methods are explicit methods for stiff systems. Implicit Adams is an implicit method for non-stiff systems. And there's many more examples. The stiffness handling also depends on implementation details. Using functional iteration on a BDF method reduces the region of stability, which is why BDF needs to use Newton's method for solving the implicit equation in order to be applicable to stiff ODEs. It's best to just talk about the stability of individual methods and their implementation.
Right, but instantiating an entire Jacobian is always going to scale at least quadratically with time. The point I was trying to make is that the existing non-adjoint approaches were never going to scale to large systems with millions of parameters. This is the main attraction of reverse-mode, and it appeared to me that this was a major obstacle for fitting large models using existing packages (excepting CasADi).
> I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable.
> it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct.
I agree that a purely reverse-mode gradient solve will diverge from the forward trajectory to some degree. But to say the gradients are 'correct' or not seems a bit strange to me. Every numerical solve introduces some degree of error, and I think the most useful discussion to have is the tradeoff between computation cost and numerical error. Re-solving the system forwards is one strategy to reduce error at the cost of computation. Another strategy would be reducing the error tolerance of the reverse solve. There are situations where our strategy might give worse precision wrt the parameter gradients for a given computational budget, but I wouldn't dismiss it out of hand. Especially since it's about as computationally cheap as one could hope for - O(1) memory and similar time cost as the forward solve. Also, it worked for our applications.
> Any statement on it is too broad to be useful.
I appreciate the detailed reply. But is there anything you can say about when to try implicit methods over explicit? What was the motivation for developing implicit methods in the first place?
I should have said something like: We were the first to implement adjoint sensitivities in a modern, tracing-based autodiff framework suitable for machine learning [HIPS autograd]. But I messed up and should have given credit to Joel Andersson for being first, my apologies.
Is the Zero-G environment of deep space the ideal environment for continuous differential models?
Also, any outline of how one would go about training one of your systems? What kind of dataset does it do best with?
You train these models in the same way you train a neural network, with stochastic gradient descent. For supervised learning, its main benefit is extra flexibility in the speed/precision tradeoff. For time-series problems, we expect this will ultimately allow us to handle data that's collected at irregular intervals - but we haven't yet moved past the prototype stage in that setting.
In the GP case, those things are random and independent of each other, so the central limit theorem applies and we get a simple Gaussian.
In the ODE case, the infinitesimals are deterministic and depend on the previous ones in the sequence, so the final answer is deterministic and impossible to compute exactly in general.
You could also use a GP to model the dynamics of an ODE, and this was done recently: [https://arxiv.org/abs/1803.04303] although the drawback was that they couldn't train the model by backpropagation.
I wonder if you've given any thought to generalizing to fractional differential equations? My intuition tells me that the dynamics that you're learning are "local" in the sense that the ODE solvers depend only on the current state (and maybe some recent history), whereas learning the dynamics of a fractional system could give the system a larger "history" in the case of your time-series models.
I have two small questions regarding the paper:
1. When comparing to Normalizing Flows (planar flows), in Section 4.1, how were
these fitted in the Maximum Likelihood Training section? If I understand
it correctly NF's don't have a closed form inverse, s.t. ML training should
not be possible.
2. Do you encounter any issues regarding stability during training? Other Flow
based approaches such as Glow use certain tricks to ensure that the Flow
initial reduces to an identity transform, to increase stability and ensure
2. We had to set the error tolerance relatively small during training to keep the gradients stable. I don't think we used any fancy initialization tricks, but to be honest I have to ask Ricky Chen and Will Grathwohl, who ran all the FFJORD experiments.
z0 ~ Normal(0, I)
z1 = f1(z0)
z2 = f2(z1)
z3 = f3(z2)
x = f4(z3)
log p(x) = log p(z0) + log det | df/dz0 |
dlogp(z(t)) = -trace(df/dz)
Neural autoregressive flows are powerful density models, but are computationally costly to sample from. Continuous normalizing flows cost about the same to evaluate densities and to sample. We compared the two in a follow-up paper: [https://arxiv.org/abs/1810.01367]
One could imagine starting with a convolutional neural network that is also a residual network (I may be butchering the proper terminology here) and taking the limit of an infinitely fine discretization in space as well as time to arrive at a PDE instead of an ODE.
Would the adjoint method approach that you used for backpropagation work in this case?
1. What is a black box ode solver?
(It sounds like a proprietary tool but that doesn't make sense and googling didn't bring up anything useful)
2. Have you found any areas where there is a significant difference in the quality of the results using this method rather then the normal discrete method?
I am sorry if these questions are a bit bellow you, or if you already answered them in the paper.
[a] : https://docs.scipy.org/doc/scipy/reference/generated/scipy.i...
2. This paper only had proofs of concepts and toy demos.
But our follow-up paper, FFJORD [https://arxiv.org/abs/1810.01367] we used the continuous normalizing flows to get SOTA efficiently sample-able density models.
Comparison to RNN was impressive! Any well-known real-world models for comparison to state of art?