240 points by asparagui on Dec 13, 2018 | hide | past | favorite | 60 comments

 Senior author here, I'm happy to answer any questions.We just released source code: https://github.com/rtqichen/torchdiffeq . This includes PyTorch implementations of adaptive ODE solvers that can be differentiated through automatically. So you can mix and match these ODE solvers with any other differentiable model component.There's already been a bit of follow-up work, turning Continuous Normalizing Flows into a practical generative density model: https://arxiv.org/abs/1810.01367And now we're mainly working on 1) Regularizing ODE nets to be faster to solve and 2) getting the time-series model to scale up and extend it to stochastic differential equations.
 Could you place the work in context, and provide a simplified explanation for someone who understands math and ML, but is not familiar with the literature on normalizing flows and autoencoders? Thanks!I tried reading, but the abstract and introductory section were a little too terse for me :-)
 Sure thing. A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this:`````` h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) `````` They now build them like this:`````` h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 `````` Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.
 So one question about that. In`````` h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 `````` the functions are all different. But to see it as "a primitive ODE solver", then the functions should be the same?So if I understand correctly you have a different take on RNN's but not on deep residual nets in general?
 > then the functions should be the same?If we conceptually think that advancing from one neural net layer to the next one is the same as taking a time step with an ODE solver, then a bit more precise notation would be`````` h1 = f(t=1,x) + x h2 = f(t=2,h1) + h1 h3 = f(t=3,h2) + h2 h4 = f(t=4,h3) + h3 y = f(t=5,h4) + h4 `````` Now you can say that the function f is always the same, but it still can give very different values for Δh when evaluated at different time points.
 I do think it's misleading to compare the method to a general feed-forward network though, for two reasons.First, to preserve the analogy between eq. 1 and 2, the thetas in equation two should have their own dynamics, which should be learned.Second, even if Equation 1 doesn't allow it, in a general feed-forward network it's possible for the state to change dimension between layers. I don't see how that could happen with the continuous model.Neat paper, but it'd be nice if they had tied the analogy more explicitly to RNNs in the introduction.
 The comparison we make is to residual networks, which I think is valid. First, we do parameterize a theta that changes with time, using a hypernet. But this is equivalent to the way sampo wrote the equations above - you can just feed time as another input to the dynamics network to get dynamics that change with time.Second, I agree that general feedforward nets allow dimension changes, but resnets don't. This model is a drop-in replacement for resnets, but not for any feedforward net. If we gave the wrong impression somewhere, please let us know.We didn't make the analogy with RNNs, because I don't think it fits - standard input-output RNNs have to take in part of the data with every time step, while here the data only appears at the input (depth 0) layer of the network.
 You're absolutely right - sorry, somehow I managed to miss the explicit time parameter in your equation two, and didn't read carefully enough to see that you were restricting the discussion to resnets and normalising flows.You might be able to make a better connection to RNNs by having the input data as a 'forcing' function in your ODE. But you probably need some regularity conditions on the input data to make sure the result is nicely behaved.
 im3w1l on Dec 15, 2018 > standard input-output RNNs have to take in part of the data with every time stepWell they can but I don't see they why they have to. And couldn't your network also take input and give output for all times?> First, we do parameterize a theta that changes with time, using a hypernet.Ah I see. Did you end up using it in the final model? I don't see that in the mnist example, but I could be missing it as I only skimmed the code.
 joe_the_user on Dec 14, 2018 Hasn't there been similar work to this in the past?I don't see a "related work" section in your paper.
 Section 7 is titled "Related Work".
 Thanks,Apologies for not looking harder
 magicalhippo on Dec 14, 2018 Now it's been ages since I dabbled with neural nets so this might be completely silly, but can't a change in dimension be thought of as forcing the weights to/from certain nodes to be zero?
 Ah, that would stop certain dimensions from changing, but the output would still be the same size.
 While technically still the same size, I think he's proposing that it's, in a sense, isomorphic to a dimension change if the fix to zero is propogates throughout the remainder of the layers (until the next 'change' that is).Or something like that.
 I hadn't entirely fleshed out the idea, but yeah.Take a simple NN with 3 layers: 5 neurons in the input layer, 3 in the hidden and 1 output.Force inputs to neuron 4 and 5 in the hidden layer to be zero, and force inputs to neurons 2-5 to be zero in the output layer (and ignore their output). I'm assuming the transfer function obeys f(0) = 0, if not, fix output to zero as well.My thought was this would be similar to how you enforce boundary conditions when solving partial differential equations by directly setting the value of certain matrix elements before running the solver.Again, may be completely silly.
 gundeep59 on Dec 14, 2018 They model the above as dh(t)/dt, generalizing the discrete case (the equations you wrote) to a continuous case. Check the eq 2 in the paper. The statement following the equation makes it clear that "Starting from the input layer h(0), we can define the output layer h(T) to be the solution to this ODE initial value problem at some time T". Here, as per my understanding, h(0) can be the input itself. The function f mentioned in the eq 2 is the RNN cell.
 spaced-out on Dec 15, 2018 This paper explains the reasoning begins thathttps://arxiv.org/abs/1512.03385
 chombier on Dec 15, 2018 > We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system?What about symmetries of the underlying continuous system?I'm under the impression that having deep nets as ODEs should make it possible to enforce a certain geometry on the information flow (like incompressible fluid, Hamiltonian, etc..) which would correspond to some invariant of the whole network.Does this idea make sense?
 My dissertation was about energy- and symplecticity preserving methods for Hamiltonian ODEs. Try to find the book by Blanes or the one by Leimkuhler.
 snrji on Dec 14, 2018 Off-topic: could you tell us what degree did you study? What is your academic background?
 I did a CS undergrad at the University of Manitoba. Then took some time off to do a startup and was in the army reserves. Then when to UBC to do an MSc in CS + Stats. My Phd was officially in the made-up-sounding subject of "Information Engineering" at Cambridge, but really I just worked on Bayesian nonparametrics the whole time. I didn't start working on deep learning until my postdoc.
 Thanks!
 Donald on Dec 14, 2018 His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/
 > His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/Thank you.
 From a software standpoint, will any of these ideas be ported to TensorFlow or is this very different?
 How is this different from spiking neural networks (SNN)? Seems like easier and simplified (e.g. no synapse) and densely connected version of SNN, with less control on inter neuron connections.Last time, I read work to port SNN (with ODE solver) to Pytorch https://github.com/Hananel-Hazan/bindsnet https://www.frontiersin.org/articles/10.3389/fninf.2018.0008...
 My understanding is that the advantage of something like a SNN is mainly in custom-hardware ultra-lower-power applications. But it adds a lot of complexity and makes training a lot more difficult, compared to something like ODE nets or standard discrete neural nets.Looking at the links you provided, they don't appear to train the network dynamics, only the feedout weights. In principle you could differentiate through the ODE solvers used in the software package linked to train the dynamics, but as far as I know we were the first people to release an open-source set of differentiable solvers using reverse-mode adjoint sensitivities (the most scalable variant of autodiff for training scalar losses). [https://github.com/rtqichen/torchdiffeq]In the paper we talk a bit about how you can now more easily train Possion process likelihoods, which might make training SNNs easier. I'm not sure.
 > as far as I know we were the first people to release an open-source set of differentiable solvers using reverse-mode adjoint sensitivitiesWould the DifferentialEquations.jl work in the Julia community qualify? As I understand it for the pure Julia solvers you can run them through autodiff to differentiate with respect to the parameters. Chris Rackauckas has talked a lot about how cool it is that he can take his solvers that were not written to be AD-aware and use somebody else’s AD package that’s not specialized for diff eqs, and combined you get differentialable diff eqs.
 Thanks for the detailed remarks, and for pointing out that my statement above about the novelty of our implementation was wrong - CasADi uses the same algorithm as was released in 2013. My apologies. Joel Andersson pointed this out to us and we added a cite to his thesis several months ago.As for Sundials and FatODE, my understanding was that they used finite differences, forward mode, or differentiating the solver operations for at least some aspect of their sensitivity analysis.On another topic, I think you might be misunderstanding how we're running our adjoint sensitivity analysis. You say> In the Neural ODE paper, to do a reverse solve of the adjoint ODE it solve the forward ODE from the beginning time point until the point. Clearly, this is really slow because it requires a lot of forward solves over long intervals.This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?I also agree there is a lot of numerical work to be done in this area, and I'm glad that people more knowledgable than us (such as yourself) are looking at it too!
 >As for Sundials and FatODE, my understanding was that they used finite differences, forward mode, or differentiating the solver operations for at least some aspect of their sensitivity analysis.No, their sensitivity analysis doesn't use finite differences. They perform the sensitivity analysis as described in their documentation. If methods for the Jacobian calculation are not provided, then they utilize finite differences on the Jacobian calculation of course, using the same routine as for the stuff solver. But with a given Jacobian function there's no finite differences or forward mode.>This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.No. Of course it's a single solve going backwards in time. I see what my misread was, but I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable. Without a reversible integrator (implicit Adams is not reversible), it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct. A good test equation for this is probably the Lornez equation with standard parameters over a time like [0,300]. The backwards pass will diverge and not necessarily be on the same butterfly wing as the forwards pass. The Julia code using CVODE is shown here ( https://gist.github.com/ChrisRackauckas/fef4ae7778320530d44b... ) and you see that starting from [1.0,1.0,1.0] the result of going backwards is then off by [-17.5445, -14.7706, 39.7985]. So there you go, CVODE's Adams method ends up on a different "wing" of the butterfly when integrated backwards, ending up not even close to the actual initial point (CVODE is the successor to LSODE, both by Alan Hindmarsh, but utilizes constant leading coefficient forms to reduce computations. So not exactly the same as the paper, but very close). Thus to ensure correctness, existing sensitivity analysis packages only get Jacobians of f using data from forward passes. The Neural ODEs may have had a small enough Lyopunov coefficient or a short enough integration that this wasn't an issue, but it is in general something to note. Of course, if you are only doing this on Hamiltonian systems...>I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?Any statement on it is too broad to be useful. Runge-Kutta Chebyshev methods are explicit methods for stiff systems. Implicit Adams is an implicit method for non-stiff systems. And there's many more examples. The stiffness handling also depends on implementation details. Using functional iteration on a BDF method reduces the region of stability, which is why BDF needs to use Newton's method for solving the implicit equation in order to be applicable to stiff ODEs. It's best to just talk about the stability of individual methods and their implementation.
 > But with a given Jacobian function there's no finite differences or forward mode.Right, but instantiating an entire Jacobian is always going to scale at least quadratically with time. The point I was trying to make is that the existing non-adjoint approaches were never going to scale to large systems with millions of parameters. This is the main attraction of reverse-mode, and it appeared to me that this was a major obstacle for fitting large models using existing packages (excepting CasADi).> I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable.> it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct.I agree that a purely reverse-mode gradient solve will diverge from the forward trajectory to some degree. But to say the gradients are 'correct' or not seems a bit strange to me. Every numerical solve introduces some degree of error, and I think the most useful discussion to have is the tradeoff between computation cost and numerical error. Re-solving the system forwards is one strategy to reduce error at the cost of computation. Another strategy would be reducing the error tolerance of the reverse solve. There are situations where our strategy might give worse precision wrt the parameter gradients for a given computational budget, but I wouldn't dismiss it out of hand. Especially since it's about as computationally cheap as one could hope for - O(1) memory and similar time cost as the forward solve. Also, it worked for our applications.> Any statement on it is too broad to be useful.I appreciate the detailed reply. But is there anything you can say about when to try implicit methods over explicit? What was the motivation for developing implicit methods in the first place?
 duvenaud on Dec 17, 2018 Well, you can differentiate through the operations of ODE solvers written in any framework that has autodiff. But the whole idea of adjoint sensitivity analysis is to not differentiate through the operations of the solver, to save memory and to control numerical error more directly.
 duvenaud on Dec 17, 2018 Whoops, I just realized that I failed to qualify that statement properly. CasADi [https://web.casadi.org/], released in 2013, uses adjoint sensitivities to compute gradients through ODE solutions using symbolic (but still reverse-mode) differentiation.I should have said something like: We were the first to implement adjoint sensitivities in a modern, tracing-based autodiff framework suitable for machine learning [HIPS autograd]. But I messed up and should have given credit to Joel Andersson for being first, my apologies.
 Congrats on winning Best Paper ;)Is the Zero-G environment of deep space the ideal environment for continuous differential models?
 Thanks! If I understand your question, you're asking if the ODE time-series model we proposed works best in things like idealized frictionless or gravity-less setting? The answer is that we expect it to work (in principle) even for messy, non-physics-experiment-style situations. This because we don't model the system dynamics directly, but in a latent space that we learn. So even in a messy situation like health monitoring of humans, as long as the data are being driven mostly by hidden factors that are changing smoothly through time (like overall health, infections, hormones or whatever) the neural ODE can abstract away from the messy data. This is all still just in principle - there are a few technical challenges to scaling these models up. But nothing that looks insurmountable right now.
 What results do you have on state of the art metrics?Also, any outline of how one would go about training one of your systems? What kind of dataset does it do best with?
 None in this paper, which is just showing proofs of concept. But our follow-up paper does have SOTA density modeling among the class of efficiently-sample-able generative models. This is because the design of continuous flow models is less constrained than for discrete flows. But other than that, it's still early days for this whole model class.You train these models in the same way you train a neural network, with stochastic gradient descent. For supervised learning, its main benefit is extra flexibility in the speed/precision tradeoff. For time-series problems, we expect this will ultimately allow us to handle data that's collected at irregular intervals - but we haven't yet moved past the prototype stage in that setting.
 How does this relate to previous work that's described NNets in the limit of infinite width as GPs?
 Ooo, great question. In both cases, we define a function using a sum over an infinite number of infinitesimal things.In the GP case, those things are random and independent of each other, so the central limit theorem applies and we get a simple Gaussian.In the ODE case, the infinitesimals are deterministic and depend on the previous ones in the sequence, so the final answer is deterministic and impossible to compute exactly in general.You could also use a GP to model the dynamics of an ODE, and this was done recently: [https://arxiv.org/abs/1803.04303] although the drawback was that they couldn't train the model by backpropagation.
 I'm a little late to this thread, but wanted to thank you for the paper! I found it interesting enough to blog about it (https://rkevingibson.github.io/blog/neural-networks-as-ordin...).I wonder if you've given any thought to generalizing to fractional differential equations? My intuition tells me that the dynamics that you're learning are "local" in the sense that the ODE solvers depend only on the current state (and maybe some recent history), whereas learning the dynamics of a fractional system could give the system a larger "history" in the case of your time-series models.
 Thanks for this awesome peace of research! I'm really looking forward to further developments in the field :)I have two small questions regarding the paper:1. When comparing to Normalizing Flows (planar flows), in Section 4.1, how were these fitted in the Maximum Likelihood Training section? If I understand it correctly NF's don't have a closed form inverse, s.t. ML training should not be possible.2. Do you encounter any issues regarding stability during training? Other Flow based approaches such as Glow use certain tricks to ensure that the Flow initial reduces to an identity transform, to increase stability and ensure reliable convergence.
 1. Great question! You're correct that standard NF isn't efficiently invertible. CNF is, and we wanted a fair comparison. So for this experiment, we reversed the direction that NF transforms the data, so that it goes from the data to the latent space. Training this way means that you can't use the resulting model as a generator, but it at least let us compare likelihoods with CNF for this paper.2. We had to set the error tolerance relatively small during training to keep the gradients stable. I don't think we used any fancy initialization tricks, but to be honest I have to ask Ricky Chen and Will Grathwohl, who ran all the FFJORD experiments.
 Hello! Can you comment on how this relates to: https://arxiv.org/pdf/1804.00779.pdf
 Good question. The model developed in that paper, (neural autoregressive flows) is a discrete-layered architecture. It's a member of the Normalizing Flows family of models. Normalizing flows define a parametric density by transforming a sample from a Gaussian by a series of transformations:`````` z0 ~ Normal(0, I) z1 = f1(z0) z2 = f2(z1) z3 = f3(z2) x = f4(z3) `````` and they use the change of variables formula to compute p(x):`````` log p(x) = log p(z0) + log det | df/dz0 | `````` In our paper, we propose a continuous-time version of normalizing flows, called Continuous Normalizing Flows. We derived a continuous-time version of the change of variables formula:`````` dlogp(z(t)) = -trace(df/dz) `````` Anyways, the confusion is probably that both models are called flows. We wish that Normalizing Flows had used a different name, so that we could save the word "flow" for continuous-time transformations, but it's too late for that :)Neural autoregressive flows are powerful density models, but are computationally costly to sample from. Continuous normalizing flows cost about the same to evaluate densities and to sample. We compared the two in a follow-up paper: [https://arxiv.org/abs/1810.01367]
 Would it be possible to apply the kind of reasoning you used to a discretization of space also?One could imagine starting with a convolutional neural network that is also a residual network (I may be butchering the proper terminology here) and taking the limit of an infinitely fine discretization in space as well as time to arrive at a PDE instead of an ODE.Would the adjoint method approach that you used for backpropagation work in this case?
 Hi,1. What is a black box ode solver?(It sounds like a proprietary tool but that doesn't make sense and googling didn't bring up anything useful)2. Have you found any areas where there is a significant difference in the quality of the results using this method rather then the normal discrete method?I am sorry if these questions are a bit bellow you, or if you already answered them in the paper.
 Black box means you have no access to how a system actually does some task. All you know is that given an input the system will provide the output. ODE solver is an ordinary differential equation solver. For example [a]
 duvenaud on Dec 14, 2018 1. As the other reply said, we just mean an ODE solver whose internal operations are a detail we don't need to worry about.2. This paper only had proofs of concepts and toy demos. But our follow-up paper, FFJORD [https://arxiv.org/abs/1810.01367] we used the continuous normalizing flows to get SOTA efficiently sample-able density models.
 Several recent papers take steps to isolate weights from updates during training to prevent catastrophic forgetting. In the ODE formulation is there a way to do something similar?
 Interesting question. There might be, but that's more a question of fitting parameters. This paper was about a different way to set up a parametric model, that can be trained in the usual way. I don't think the fact that internally it uses ODEs changes its susceptibility to catastrophic forgetting in the online learning setting.
 Was the backwards integration stable? It would be surprising from a mathematical perspective of ODEs.
 Wow, I need to master ODEs/PDEs to keep up with Deep Learning now! Seems like one has to be a master of statistics, operations research, calculus and algorithms to push it forward!Comparison to RNN was impressive! Any well-known real-world models for comparison to state of art?
 The closest thing to a continuous-time RNN that exists are Neural Hawkes Processes [https://arxiv.org/abs/1612.09328]. They use a different model where observing the system necessarily changes its state, which is natural in some settings but doesn't fit in others. On the other hand, their model scales today and ours doesn't :)
 For using a numerical solver e.g. Runge-Kutta you need nothing to master, statistics operations master.
 Some reporting in MIT's newsletter on AI: https://mailchi.mp/technologyreview/a-new-type-of-deep-neura...
 I remember something similar for CFD application but haven't seen much after that. It would be awesome if we can build a cheap and fast Navier Stokes solver with neural networks.
 Hi, I am very interesting about your models. When you do the back propagation, it seems that it still needs complex calculation. Although O(1) memory cost is an important contribution, do you think record some of the intermediate value will significantly boost the training?

Search: