
Deep Learning in the Real World: Dealing with Non-Differentiable Loss Functions - fruty
https://fruty.io/2019/11/04/deep-learning-in-the-real-world-how-to-deal-with-non-differentiable-loss-functions/
======
scarecrow112
The post lists Auto Differentiation as one of the techniques that can overcome
non-differentiable loss functions. Can someone explain how this is even
possible? After all automatic differentiation[1] is a way to compute
gradients/derivatives in a way which could become costly if otherwise
done(symbolic differentiation[2]). The function or the operations defined in
the function(in case of source-to-source differentiation[3]) needs to be
differentiable.

[1] -
[https://www.youtube.com/watch?v=z8GyNneq5D4](https://www.youtube.com/watch?v=z8GyNneq5D4)

[2] -
[https://stackoverflow.com/a/45134000](https://stackoverflow.com/a/45134000)

[3] - [https://github.com/google/tangent](https://github.com/google/tangent)

EDIT: 1. Added reference. 2. Formatting

~~~
AbrahamParangi
Non-differentiable here doesn't mean _actually_ non-differentiable in the
mathematical sense, it just means that the function does not expose a
derivative that is accessible to you.

~~~
maffydub
I'm not sure. According to Wikipedia
([https://en.wikipedia.org/wiki/Differentiable_function](https://en.wikipedia.org/wiki/Differentiable_function)),

> a differentiable function of one real variable is a function whose
> derivative exists at each point in its domain

Most programs aren't differentiable functions because conditionals aren't in
general differentiable. For example "if x > 0 { 1 } else { -1 }" doesn't have
a derivative at 0 (and so, by the definition above) isn't differentiable at 0.

...or have I missed something?

~~~
MrMoenty
In deep learning, you generally don't require differentiability on the entire
domain, only on most points you're likely to encounter. So a finite number of
non-differentiable points is fine: You just trust that you're never going to
hit them by chance (the probability that you do is 0), and if by some miracle
you do, you just use a subgradient.

Case in point, the currently most used activation function in neural nets, the
rectified linear unit

ReLU(x) = max(x, 0),

is clearly not differentiable everywhere either.

~~~
wenc
> by some miracle you do, you just use a subgradient

This is the most succinct comment I have encountered on how people think about
non-differentiability in deep learning.

This helped me reconcile my experiences with the deep learning paradigm. Thank
you.

You see, in the numerical optimization of general mathematical models (e.g.
where the model is a general nonlinear -- often nonconvex -- system of
equations and constraints), you often _do_ hit non-differentiable points by
chance. This is why in mathematical modeling one is taught various techniques
to promote model convergence. For instance, a formulation like x/y = k is
reformulated as x = k * y to avoid division by zeros in y during iteration
(even if the final value of y is nonzero) and to avoid any nonsmoothness
(max(), min(), abs() functions for instance are replaced with "smooth"
approximations). In a general nonlinear/noconvex model, when you encounter
non-differentiability, you are liable to lose your descent direction and often
end up losing your way (sometimes ending up with an infeasible solution).

However it seems to me that the deep learning problem is an unconstrained
optimization problem with chained basis functions (ReLU), so the chances of
this happening is slighter and subgradients provide a recovery method so the
algorithm can gracefully continue.

This is often not the experience for general nonlinear models, but I guess
deep learning problems have a special form that lets you get away with it.
This is very interesting.

~~~
fspeech
I don't know why you think subgradient is that important. It's just a
shorthand for anything reasonable. DNNs are overwhelmingly underdetermined and
have many many minimizers. It's not so important to find the best one (an
impossible task for sgd) as to find one that is good enough.

~~~
wenc
> I don't know why you think subgradient is that important.

I underquoted. It's more the approach to handling of nondifferentiability in
deep learning problems that is of interest to me, whether it involves
subgradients or some other recovery approach.

These approaches typically do not work well in general nonlinear systems, but
they seem to be ok in deep learning problems. I haven't read any attempts to
explain this until I read parent comment.

> It's just a shorthand for anything reasonable. DNNs are overwhelmingly
> underdetermined and have many many minimizers.

This is not true for general nonlinear systems, hence my interest.

------
fyp
The course notes for "Learning Discrete Latent Structure" should be helpful if
you're interested in this topic: [https://duvenaud.github.io/learn-
discrete/](https://duvenaud.github.io/learn-discrete/)

------
BayezLyfe
Related: "In general a machine learning system is built and trained to
optimize a specified target objective: classification accuracy in a spam
filter or tumor diagnostic, efficiency in route planning or Amazon box
packing. Unlike these precise performance metrics, the criteria of safety,
trust, and nondiscrimination often cannot be completely quantified. How then
can models be trained towards these auxiliary objectives?"

From "Interpreting AI Is More Than Black And White":
[https://www.forbes.com/sites/alexanderlavin/2019/06/17/beyon...](https://www.forbes.com/sites/alexanderlavin/2019/06/17/beyond-
black-box-ai/)

------
electricslpnsld
I thought this was going to be about differential inclusions and deep learning
which sounds pretty rad! Is anyone out there working in this direction ?

