
Empiricism and the limits of gradient descent - togelius
http://togelius.blogspot.com/2018/05/empiricism-and-limits-of-gradient.html
======
simonster
There are a couple of factual errors here. First, the difference between
backprop and evolution is smaller than the author indicates. The error signal
used in modern backprop training is stochastic because it is computed on a
minibatch (which is why it's called stochastic gradient descent). This
stochasticity seems important to achieving good results. And the most popular
evolutionary algorithm in the deep learning world is Evolution Strategies,
which effectively approximates a gradient. Ordinary genetic algorithms are not
gradient-based and have recently shown promise in limited domains, but can't
compete with gradient-based algorithms for supervised learning.

The key claim in the article, that gradient descent could not discover physics
from equations seems, like it is a statement about neural networks, not
gradient descent. Given sufficient training data, a neural network can
probably learn to model physics. I sympathize with the concern that it's very
difficult to translate a neural network's knowledge into human concepts, but I
see no reason to believe that optimizing the same system with an evolutionary
algorithm would make this problem any easier. You could e.g. try to do program
induction (which was supposed to be the future of AI many decades ago) instead
of modeling the data directly, but choosing to perform program induction does
not preclude the use of a neural network. Neural networks trained by gradient
descent can generate ASTs (e.g. [http://nlp.cs.berkeley.edu/pubs/Rabinovich-
Stern-Klein_2017_...](http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-
Klein_2017_AbstractSyntaxNetworks_paper.pdf)).

[Edited to remove reference to universal approximation; as comments point out,
even if a neural network can approximate a function, it isn't guaranteed to be
able to learn it. But I am reasonably confident that a neural network can
learn Newton's second law.]

~~~
roywiggins
> Given sufficient data, according to the Universal Approximation Theorem, a
> neural network can learn to model physics.

It just says there are weights to approximate any function, not that you can
actually learn the weights. Neural networks trivially can't learn how to
approximate noncomputable functions to any accuracy, and there might be a lot
of other functions that neural networks are terrible at actually learning.

~~~
simonster
It's a fair point that the Universal Approximation Theorem does not guarantee
that the weights can be learned. OTOH, the physical laws that the article
states a neural network cannot discover are computable functions.

~~~
AstralStorm
You need a stronger bound than this. They have to be possible to approximate
govern specific network size, architecture and activation functions.
Calculating that (or good statistics that will say so approximately) is a hard
problem... It is solvable for a bunch of activations in a layered perceptron
but attempt extending this to something more complex.

------
tlb
I'm optimistic about the potential for evolutionary algorithms. I've used both
EAs and gradient descent in developing robot controllers.

But the argument here about why gradient descent won't be able to learn
certain things is weak. Thought experiments are not a reliable guide to what
what GD can or can't do.

It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it
took thousands of years of human thought to arrive at them. So the argument
"it's not clear how an algorithm could extract F=ma from the data" isn't a
strong criticism, because humans also can't do it by induction.

The long process culminating in F=ma involved a lot of abstract symbolic
thought. Whether human-level abstract symbolic thought can be learned through
GD (probably in combination with some sort of Monte Carlo tree search) is an
open question. It can only be answered by trying to build things and seeing if
they work.

If you want to make an argument about the limits of GD and induction, it'd be
better to compare to a problem humans can solve reliably, rather than an
insight that one genius had after decades of thought while standing on the
shoulders of other geniuses.

~~~
perl4ever
I don't understand how E=mc^2 can _not_ be in the data. If it's a universal
law, isn't it in more (all) data than any pattern that _isn 't_ universal?

~~~
maxander
The article is making a simpler point than that. If I show you the table:

    
    
      A	B	C
      1	4	3
      20	45	15
      8	15	7
    

And so on for some arbitrary number of rows, you can look at the table all you
want but you will not _perceive_ "A+C=B". It's just not written there. To get
A+C=B you have to generate something else in addition to the table, namely a
hypothesis- but this is a _creative_ act, not an empirical one.

~~~
tlb
If you connect A and B as the input to a linear neural net, and train against
C, it'll very quickly arrive at weights of [-1, +1] and be able to correctly
predict C given A and B. Whether or not it represents it in notation humans
are familiar with, it has learned it for the practical purpose of being able
to compute the function.

~~~
goatlover
But how would a neural network know to connect data for mass with data for the
measured speed of light? Why would a neural network be looking for an equation
for energy conversion in the first place? If you just provide tons of raw data
from instruments, what does that mean? What do yo do with it?

Sure, a human can clean the data and put it into a format that gives
meaningful results. But if we're just talking about an AI learning from raw
data with no supervision, where does it even start?

~~~
mannykannot
As someone who finds this interesting but does not know enough to take a
position, I think a bigger question would be how does it come up with the
abstract concept of energy?

I am aware that Alpha Go Zero came up with various strategic abstractions of
the game that are recognized by competent players, and some novel ones, but I
do not know where this program and its self-play training stands in the
dichotomy of this debate.

~~~
goatlover
Go has a clear objective that you can train for. What would be the objective
in coming up with a physics law from raw data? What would it even be training
for? That sounds like asking whether DL could create a new board game from a
bunch of data on human behavior.

> think a bigger question would be how does it come up with the abstract
> concept of energy?

I don't see how it could, and that's kind of how Kant argued against
empiricism. You can't derive a conceptual understanding of the world from raw
data. There's nothing in raw data to structure or make sense of it without
some way to interpret the data. Even calling it data is an interpretive act
(as opposed to noise).

------
sepranu
The points the author makes about gradient descent are accurate, in a sense.
However, they oversimplify the technique (as it is currently applied today)
and the context in which it is used. It seems as if the author, like many
others, has a basic understanding of the subject's basic mechanisms, but not
the context in which experts understand them.

The example the author cites regarding evo algorithms learning physical laws
is laughable - "It's just not in the data - it has to be invented" applies
equally to both the backprop and the evolutionary learning algorithms.

"In this case, the representation (mathematical expressions represented as
trees) is distinctly non-differentiable, so could not even in principle be
learned through gradient descent."

This is incorrect, almost like saying NLP data is not differentiable. For
instance, set this representation up as the output of a network (or, if you
wanted to be fancier, the central component of an autoencoder), and see how
well it predicts/correlates with the experimental data. This is the error,
which is back-propagated through the network's nodes.

FWIW, many theoreticians believe that the unreasonable effectiveness of neural
networks and especially transfer learning _is a result_ of their well-
suitedness to encode laws of physics and Euclidean geometry. The author's
final points about a nine-year-old survey may be out of date w.r.t.
contemporary neural networks, which often have spookily good local minima and
do not behave in the way intuition about gradient descent might suggest.

------
hmartiros
In my experience if you have even a little smoothness in your problem's cost
manifold, taking advantage of gradients is invaluable to sample efficiency.
Many losses which don't seem differentiable can be reformulated as such - you
can look around and see a wide array of algorithms being put into end-to-end
learned frameworks. If the dimensionality is small, second-order methods (or
approximations thereof) can do dramatically better yet. However, I'm also a
fan of evolutionary algorithms. I see no reason why evolutionary rules can't
be defined with awareness of gradient signals.

~~~
vinn124
> Many losses which don't seem differentiable can be reformulated as such...

agreed, especially with policy gradients.

> If the dimensionality is small, second-order methods (or approximations
> thereof) can do dramatically better yet.

i have not seen second order derivatives in practice, presumably due to memory
limitations. can you point me to examples?

~~~
hmartiros
They aren't common in deep learning, but if you look to estimation problems
like odometry, optimal control, and calibration, the typical approach is to
build a least squares estimator that optimizes with a gauss-newton
approximation to the Hessian, or other quasi-newton methods. Gradient descent
comparatively exhibits very slow convergence in these cases, especially when
there is a large condition number. In the case of an actual quadratic loss
function, it can (by definition) be solved in one iteration if you have the
Hessian. However, getting it efficiently within most learning frameworks is
difficult, as they primarily only compute VJPs or HVPs.

------
pmalynin
Once again, there is a general conflation of evolutionary algorithms and
learning without a differentiable error function. I've presented this argument
again with a discussion with antirez here on HN [1], but the crux of is that
Reinforcement Learning as it stands is optimization over (maybe)-non-
differential errors. For AlphaGo there is no gradient, per se, that says you
will optimize your wins if you go this way (now, it is optimized by training
towards the win-rate "score", which could be an error score) -- look at
REINFORCE for other variations. Evolutionary Learning and Reinforcement
Learning as two sides of the same coin.

[1]
[https://news.ycombinator.com/item?id=16652138](https://news.ycombinator.com/item?id=16652138)

------
creo
(This is of course my opinion, not scientific fact) The biggest argument for
EA is that proper implementation does not get stuck in local minimum. GD and
SGD have that tendency, if not in one spot then probably looping between many.
Problem with EA by the other hand is that its mathematical model is based on
probability which is quite tricky to operate on even for above-average
programmer.

------
ssivark
I don't see why stochastic has to be worse than evolutionary algorithms. In
high-dimensional spaces, there is a large space of possible "mutations". SGD
just biases certain mutations based on the gradient from a minibatch. That
sounds a lot like evaluating the fitness of a mutated population on a
minibatch and culling members with low fitness. In fact, there are many
demonstrations that the stochastic nature of SGD (coming from the use of
minibatches) is crucial for effective learning.

------
ozy
Notice how blacks swans is a labeling/categorization problem. Whereas F=ma is
a causal relationship. The first has no induction problem, there is no swan-
ness except in our minds. The second has no induction problem, because you
model causal relationships and predict from them, better predictions means
less wrong. The problem of induction is only one of "ultimate reality".

------
andbberger
Meh, weird article. None of the nice modern results (imagenet family, etc)
were achieved just through gradient descent - this article, like most, seems
to be missing the forest for the trees with deep learning.

It's not about the network architecture, or gradient descent on their own -
it's the interaction, the dynamical system over weight space that training is.

Behind every great modern deep learning result? An enormous hyperparameter
search and lots of elbow grease to carefully tune that dynamical system
juuuuust right so the weight particle ends up in just the right place when
training finishes. Smells like evolution to me. Deepmind even formalized the
evolutionary process a deep learning researcher runs manually when fine-tuning
a model into population based training [https://deepmind.com/blog/population-
based-training-neural-n...](https://deepmind.com/blog/population-based-
training-neural-networks/)

~~~
PeterisP
"An enormous hyperparameter search and lots of elbow grease to carefully tune
that dynamical system" i.e. the classic GDGS (gradient descent by grad
student) approach where you have a grad student train a system, decide in
which direction the parameters should be updated (i.e. look at the gradient),
tweak the system and repeat until convergence.

