
Neural Networks as Ordinary Differential Equations - agronaut
https://rkevingibson.github.io/blog/neural-networks-as-ordinary-differential-equations/
======
rmbeard
Early work on this approach is contained in Jean-Pierre Aubin's book on neural
networks written back in'96 where he used differential equation/inclusions and
control theoretic approaches to study neural networks, see here
[https://www.amazon.com/Neural-Networks-Qualitative-
Physics-V...](https://www.amazon.com/Neural-Networks-Qualitative-Physics-
Viability/dp/0521445329/)

~~~
pilooch
Wow, Jean-Pierre Aubin mentioned on first comment on HN, that makes me so
happy. I have had him as a teacher 20 years ago, an incredible teacher, always
grounding the maths into their history, and its social production, with
descriptions of Banach life and others along with their motivations.

Thanks to Aubin, I did my PhD in AI a long time ago and am still working in
the field now!

------
itin
Is it just me or is the math here incorrect?

[https://imgur.com/a/QXUZerV](https://imgur.com/a/QXUZerV)

~~~
itin
specifically, I don't see where the last step G(h_t) = etc. follows from

~~~
rkevingibson
Yup, that's a mistake. Shouldn't have the +h term at the end.

------
joe_the_user
So the time term in these "neural" differential equations is the depth of the
network. I see the connection between resnet and the differential equation
solver. Basically, you're saying your final result is the end-point of a curve
governed by a differential equation whose initial conditions are the input
set. You train the equation's parameters in a fashion akin to how you train a
standard neural net.

The thing is, with ordinary neural nets you have two arguments for how the
thing works.

One view is you are taking inputs and projecting those into a high dimensional
feature space and then drawing a separating hyperplanes between points in two
or more categories. If you have a proper shaped hyperplanes, the results will
tend to generalize.

Another view is you're doing something like logistic regression. You have a
function and you're trying to reduce the distance between it and the "true
classification function" that takes points to their proper category. You use
gradient descent to minimize the distance between your function and this "real
function" that you get through data.

So anyway, I am not sure how these intuitions apply to differential equation
version. I'm not sure what other intuitions there are for why these equations
should work.

~~~
tnecniv
> Another view is you're doing something like logistic regression. You have a
> function and you're trying to reduce the distance between it and the "true
> classification function" that takes points to their proper category. You use
> gradient descent to minimize the distance between your function and this
> "real function" that you get through data.

This viewpoint doesn't change when you switch to ODEs. The evolution of the
system T seconds into the future is a (piecewise) continuous function of the
initial state.

~~~
joe_the_user
Me: Another view is you're doing something like logistic regression....

tnecniv: _This viewpoint doesn 't change when you switch to ODEs. The
evolution of the system T seconds into the future is a (piecewise) continuous
function of the initial state._

Sure, it doesn't. But it moves to ground where it seems hard to tell what's
happening. Approximations of one function by a sum of set functions with
variable parameters is a standard arrow in the mathematical "quiver" \- Taylor
series, linear regression, etc. A neural net naturally is a complex ensemble
of such approximations. You gain tremendous flexibility of representation but
lose any idea of the shape of the curve that's involved.

With the ODE theory, you have a continuous function that is used instead of
the parameters. It is hard for me to conceptualize exactly what's happening
beyond that.

I mean, neural nets seem to able to approximate just about any complex
classification function already.

So what I believe we need in recasting/representing/etc neural nets is:

1) Ways to speed training. Ways to update a trained system for small-ish
changes in the system logic without needing a full retraining.

2) Ways to explain how a neural net reaches its decisions. Ways to tell a
neural net to not use a given reasoning branch if it's using "garbage
reasoning".

3) Ways to merge several neural nets that operate on related but not identical
problems.

Etc.

I'm not sure if this model brings us closer to this.

------
Muximize
Isn't this also what the Julia team is trying to do?
[https://julialang.org/blog/2018/12/ml-language-
compiler](https://julialang.org/blog/2018/12/ml-language-compiler)

~~~
ChrisRackauckas
We are building tooling which will allow you to put together differential
equations and neural networks in any way you please. This is just one way of
putting the two together. The applications we were looking at are different,
but it utilize similar tooling. We have forward-mode, reverse-mode, forward
sensitivity analysis, and adjoint sensitivity analysis (the method mentioned
here, which is actually from at least the 90's) implemented and tested with
the DifferentialEquations.jl integrators. We had a recent paper exploring the
timing differences between the different sensitivity analysis (gradient
calculation) methods:

[https://arxiv.org/abs/1812.01892](https://arxiv.org/abs/1812.01892)

The neural ODE falls into the case where the number of parameters is large,
and so the traditional adjoint methods seem to do best there when optimized.
So in some sense, we aren't "trying" to do it, but have had library
implementations for these gradient calculations for a few years now. The
documentation is here:
[http://docs.juliadiffeq.org/latest/analysis/sensitivity.html](http://docs.juliadiffeq.org/latest/analysis/sensitivity.html)

But we are heading in new directions with this as well. The traditional
adjoints seem to do best here because of inefficiencies with tracing-based
reverse-mode autodiffs like the kind found in Flux.jl or Autograd. Basically,
those kinds of tape-based autodiffs rely on each operation being expensive,
which isn't true for the internals of an ODE solver (but tends to be true for
applications like neural nets where large matrix multiplications dominate).
This means that the overhead of building the tape is really noticable and thus
it doesn't do well in this application. Additionally, the tape is dependent on
the input because the trace of the operations is dependent on what branches
are taken. This means that, even if the ODE solver is fully optimized and
every piece is JIT compiled, the backprop tape itself cannot be compiled and
saved. Again, this isn't an issue for large matmuls, but this is another
factor that comes into play when the operations are not sufficiently costly.

However, the blog post you link to discusses source-to-source AD as an
alternative. Source-to-source builds the backprop code directly from the full
source, doing both branches at the same time. This means you can AD the source
and then compile it. This methodology solves the issues we found with tracing-
based autodifferentiation for this application, so I'll be happy when we get
to testing it. Zygote.jl isn't quite ready for source-to-source AD on the full
DifferentialEquations.jl code, but it's getting close.

------
leggomylibro
Very interesting - does this mean that you could also rephrase arbitrary
differential equations as neural networks, then apply parallel computing to
solve them?

Sorry if that question doesn't make sense; I'm not really clear on how
ordinary differential equations are usually solved these days.

~~~
Gladdyu
You could express an arbitrary (well-behaved) ODE as a neural network by
discretizing it in timesteps, but you would not extract any additional
parallelism.

In normal ODEs you can already compute each X_{i}(t)/dt in parallel, but you
will still have to evaluate layer/time t completely before evaluating
layer/time t+Dt as it feeds forward in both the NN and ODE case.

~~~
tanderson92
Yes and no: there is parallel time integration for ODEs which parallelizes
using multigrid techniques.

~~~
ChrisRackauckas
Yes, but tests tend to show you need like >64 cores for things like parareal
methods to do better than standard serial methods. So they exist, but aren't
quite practical yet. Maybe a GPU-based one in specific cases can be
interesting, but no one has been able to demonstrate an efficient enough code
yet. It's definitely an interesting topic.

~~~
tanderson92
Thanks for the perspective on the actual performance of these algorithms. So
yeah, as an algorithmic fact it isn't necessary to proceed sequentially,
though perhaps not performant to proceed in parallel either.

Anyway, I heard AMD has 64 cores on a single chip so it may not be too long..

~~~
TheGrassyKnoll
Neuromorphic Supercomputer With 1 Million Cores Mimics the Human Brain

    
    
      https://www.tomshardware.com/news/human-brain-neuromorphic-supercomputer-manchester,38027.html

------
gsyhbhsybs
Is it too much to hope that eventually it will be established that all the
other neural nets are just extremely hard non-linear partial differential
equations?

~~~
gnulinux
The language of PDEs is Turing complete so is the language of neural networks
(with a non-linearity and arbitrarily many layers) so this is trivially true
in general (they have equal computational expressiveness). But "compiling"
neural networks to PDEs might be a hard problem that we can never solve, in
general.

~~~
turingspiritfly
[https://link.springer.com/chapter/10.1007%2F11494645_21](https://link.springer.com/chapter/10.1007%2F11494645_21)
seemingly states ODEs can simulate turing machines. Is the simulation enough
proof ofit's turing completeness, and in what sense can PDEs be called a
language?

~~~
gnulinux
Being able to simulate a Turing machine is the definition of being Turing
complete. So that proof is sufficient.

In mathematics (and computer science) a language is just a set of strings, and
a machine is just a language attached to a semantics. I.e. whatever symbols
you need to represent PDEs, inductively generated by the axioms and inference
rules of mathematics and theoretical physics (like for lambda-calculus:
lambda, x0,x1,... '(', ')' and reduction and construction rules)

~~~
SubiculumCode
Simulation has a number meanings, depending on the field. I tend to loosely
think of a simulation as a model with an acceptable level of error, but I am
not a mathmatician.

~~~
gnulinux
This is true, thanks. As the other answer said, in this field we say given any
machine M, "M2 simulates M" if M2 computes the same function as M i.e. given
any input i, iff M(i) halts then M2(i) halts and M(i) == M2(i) otherwise they
both loop. This can also be called "compiling" informaly: given any Turing
machine, you can "compile" it to C (i.e. constructing a C program that
simulates M) which means C is Turing-complete.

Note that this is a very handwavy definition. The "types" of input, output,
and what it means to "compute" will depend on your language, semantics and
encoding. (E.g. for lambda calcus "compute" means appying reduction rules
until the machine halts, if ever)

~~~
SubiculumCode
I appreciate the explanation. Clearly in this context simulation does not
equal approximation !!

------
Gladdyu
How would this handle systems in which the dimensionality of the hidden layers
is not equal to the input dimension, as is often the case? For solving the ODE
you'd need to get at least 1 sample point, which might restrict the amount of
information you can capture in such a network.

~~~
mlthoughts2018
This paper is explicit that it only applies to sequences of ResNet layers,
which happen to have the property that the input and output dimensionalities
are equal.

It’s a property of the functional form of stacked ResNet layers that allows
the ODE layer to be used instead.

You are right that many networks will require at least some (and usually many)
layers that change dimensionality. So ODE layers are not going to wholesale
replace anything, apart from possibly submodules of a bigger network where the
submodule is made up just of sequences of ResNet layers.

ODE layers will also have new applications, such as for irregularly spaced
sequential inputs or outputs.

------
foxes
So this has become a problem in optimal control, and we have to boundary
conditions (input/output data)?

Specifically, find a control function (so the weights of the activation
function) which maps the system to your desired output?

Then there are also much better ways to solve two point boundary problems
rather than just shooting forward. [0] Is a good place to start, (although its
old).

[0] Numerical Methods for Two-Point Boundary-Value Problems. Herbert Keller.

------
duckworthd
A couple of questions,

1) In the motivation for the adjoint method, is u given along with A and C? I
don't see how to compute u^T B without u.

2) How do a(t), da(t)/dt, and dG(..)/dh relate to A, B, C, u, and v from the
motivation?

~~~
aqsalose
Also was wondering the same thing. Also, the linked adjoint method tutorial
discusses more general class of problems and it is not immediately clear to me
how to map them to this linear equation version: here B is the unknown we want
to evaluate and the constraint is set on B; in the pdf tutorial one wants to
compute gradient_p f, which sounds similar than the formulation in the blog
post, because we usually want the gradient (B?) to evaluate d_p f(x,p) at some
point p (u?), but the constraints are given for g(x,p) = 0.

------
tabtab
There may be different ways to more or less model the same thing. The trade-
offs may be machine efficiency, human grokkability, prove-ability
(mathematically analyzable), and accuracy (sometimes approximations are good
enough, and gain on other features).

Perhaps related: Factor tables for AI:
[https://github.com/RowColz/AI](https://github.com/RowColz/AI)

------
8bitsrule
Can someone tell me what I need to add to Firefox so that this:

\\( \mathbf{h}_{n+1} = F(\mathbf{h}_{n}) \

is displayed as the author obviously intended?

~~~
jimrandomh
You may have Javascript disabled or be blocking the script that does the
rendering, KaTeX. It works for me in both Firefox and Chrome.

~~~
8bitsrule
Yes, it was a blocked 'cloud' script. Thanks to you both.

------
andrewtbham
repeat?

[https://news.ycombinator.com/item?id=18676986](https://news.ycombinator.com/item?id=18676986)

~~~
traek
The first sentence links to that paper. This is presented as a high-level
introduction to that and related papers/ideas.

~~~
glitchc
Feels like aggressive advertising. Not the best way to promote your work. Let
it stand for itself, if others see value in it, they will cite it.

~~~
3JPLW
Are you aware the author of the blog post is doing exactly what you say? Kevin
is not an author of the original paper. He's seeing value in it, citing it,
and synthesizing it.

