
Differentiable Control Problems - dunefox
https://fluxml.ai/2019/03/05/dp-vs-rl.html
======
alg_fun
> In contrast, running the neural network takes 5μs (twenty thousand times
> faster) with only a small loss in accuracy. This “approximate function
> inversion via gradients” trick is a very general one that can not only be
> used with dynamical systems, but also lies behind the fast style transfer
> algorithm.

Very interesting! Follow up question -- how would you choose the network
architecture?

~~~
Libbum
Since the network only acts on a small portion of the entire system, we can
constrain it in such a way that dramatically simple NNs work just fine.

`FastChain(FastDense(3,32,tanh), FastDense(32,32,tanh), FastDense(32,2))`
(from [0]) would take three inputs from your basis, run it through one hidden
layer and provide you with two trained parameters.

This [1] example uses two hidden layers, its one of the more complex solutions
I've seen so far. To move to this complexity from a simpler chain, we first
make sure our solution is not in a local minima [2], then proceed to increase
the parameter count if the NN fails to converge.

[0]
[https://diffeqflux.sciml.ai/dev/FastChain/](https://diffeqflux.sciml.ai/dev/FastChain/)
[1]
[https://github.com/ChrisRackauckas/universal_differential_eq...](https://github.com/ChrisRackauckas/universal_differential_equations/blob/a68e7566105f0e08abdad80aee1fe8c4a4b5d51c/SEIR_exposure/seir_exposure.jl#L52)
[2]
[https://diffeqflux.sciml.ai/dev/examples/local_minima/](https://diffeqflux.sciml.ai/dev/examples/local_minima/)

------
atmosfir
Isn't this just optimization of control parameters?

What is the comparison with existing control engineering methods like PID
tuning, MPC and optimal control?

~~~
pidtuner
I think you are right, in the case of the trebuchet, it just computes a black
box approximation of the inverse of the system. The difference being that an
analytic inversion would solve the problem for all wind and target conditions,
while the NN solution will only work for for value ranges used to obtain the
data that trained the NN.

In the case of the inverted pendulum, again the disadvantage of using NN is
the black-bock nature of the control algorithm. As a control engineer this
gives me the chills, because using black-box algorithms tell you nothing about
the robustness of the closed loop system. With model-based control, at least
we have strong mathematical tools to guarantee that the closed loop will be
robust enough to handle variations outside the data that we used for training
(our model). With black-box algorithms like NN you have no guarantees. I would
not get into a plane controlled by a NN for sure, look what happened with the
737MAX when software engineers thought they could solve dynamical system
problems.

~~~
Libbum
To respond to both the parent question, and this comment: indeed, this is
black-box optimal control in essence.

However, this method is just one small aspect of the SciML [0] ecosystem now.
The article is a little outdated in that sense.

Once obtaining your NN control parameter, it's now possible to use Sparse
Identification of Nonlinear Dynamics (SINDy) on that parameter to recover
equations of motion governing it [1].

The real promise of these methods is to use the universal approximator power
of NNs to get around the 'curse of dimensionality' & uncover presently unknown
representations of motion within any system. Take a look at [2] for a more
detailed description.

[0]: [https://sciml.ai/](https://sciml.ai/) [1]:
[https://datadriven.sciml.ai/dev/sparse_identification/sindy/](https://datadriven.sciml.ai/dev/sparse_identification/sindy/)
[2]: [https://arxiv.org/abs/2001.04385](https://arxiv.org/abs/2001.04385)

~~~
pidtuner
"The real promise of these methods is to use the universal approximator power
of NNs...", still if one is to use a grey-box non-linear model dx/dt = F(x, u,
t), why use NNs to characterize F? I would be more comfortable using a
polynomial to characterize non-linearity than a "deep" black-box.

Polynomials are much easier to "train" because it is just one linear
regression with no iteration. It has also been hinted that NN are in essence
polynomial regressions [0]. Furthermore, most activation functions are base on
e^x where the actual implementation of e^x in a computer is again a
polynomial!

[0] [https://arxiv.org/abs/1806.06850](https://arxiv.org/abs/1806.06850)

~~~
unishark
Gradient descent is already about as easy a training method as can be. Just a
little freshman calculus and programmers can do the "state of the art"
optimization of modern times. It's also scalable. If your polynomial
regression gets too large because of the model complexity (for comparison,
typical deep networks can have millions of parameters) you can't invert your
matrix and probably end up using a similar method anyway.

I would have thought a computer uses tables to compute e^x. There's also
piecewise linear activation functions that are trivially easy to compute
gradients of.

The whole "universal approximation" perspective is pretty vague to begin with.
I'd say generally people don't understand why NN's work as well as they do.
Previously theorists expected they would need a lot more training data to
work, given their complexity. So it's driven to a large degree by empirical
success. I am certainly really interested to see people accomplishing the same
things with less sophisticated methods, since there is no doubt it has been
overused/hyped in some areas just to make the papers and proposals sexier.

~~~
srean
> The whole "universal approximation" perspective is pretty vague to begin
> with

Multiple times this. This claim gets trotted around frequently to showcase
superiority of NNs.

At best this is a red herring at worst it is dishonest. The problem is they
aren't the only universal approximators. There is a whole slew of them,
nearest neighbor approximators, polynomials, rational splines, kernel methods
… Furthermore the universal approximation property holds under conditions.

Finally, the ability to represent a function arbitrarily well (approximation
property) does not mean that one will be able to find the representation from
data easily (learning property). Empirical evidence suggests that among the
class of universal approximators we know, NNs seems easy to train effectively.
Why this is so s not quite well understood.

~~~
pidtuner
wavelets, sum of exponentials, fourier, ... I just mentioned polynomials
because they are easiest. But people just jump into the NN bandwagon to get
attention. Truth is that is just another tool, and a good engineer has to
choose the best tool form the toolbox and not just pick the hammer everytime.

~~~
ChrisRackauckas
For reference, the DiffEqFlux library has a bunch of classical basis layers
[1] and ways to tensor product them [2] for this reason. The real answer as to
when to use a neural network is quite complicated [3], but in summary the
results all point to the fact that for approximating an R^n -> R^m function,
one only needs polynomially many parameters in order to do it well (as proven
in a few cases like in that linked paper for "any case where Monte Carlo
algorithms are not exponential in dimension"). Tensor products of classical
basis functions have to cover every combination of terms (sin(ix)*sin(jy)) so
they naturally grow like p^n if you have p parameters in each dimension, so
this exponential parameter growth is the curse of dimensionality and this
polynomial growth is the formal way of describing how neural networks overcome
the curse of dimensionality. So what is useful can depend on a number of
factors (another property is the isotropy of the function you're trying to
approximate), but this asymptotic property is what makes neural networks a
good tool in the high dimensional world where they are commonly used. That
makes them quite good as well for things like feedback controllers of larger
ODE systems. But yes, in smaller dimensional cases Fourier basis and such are
good choices.

    
    
        [1] https://diffeqflux.sciml.ai/dev/layers/BasisLayers/
        [2] https://diffeqflux.sciml.ai/dev/layers/TensorLayer/
        [3] https://arxiv.org/abs/1908.10828

~~~
pidtuner
Fitting to sum of N exponentials is also a linear problem with no iterations
[https://math.stackexchange.com/questions/1428566/fit-sum-
of-...](https://math.stackexchange.com/questions/1428566/fit-sum-of-
exponentials/3808325#3808325)

