
Learning to learn by gradient descent by gradient descent - tonybeltramelli
http://arxiv.org/abs/1606.04474
======
LoSboccacc
story time: I tried something like this, putting a nnet output as the transfer
weight of another neural net and training the first one feeding in the second
net input as input and training it on the second net error, but couldn't train
the first network because I didn't know how to derive the transfer function
for the back propagation algorithm.

so I opted for training the first net using a randomized genetic algorithm and
function descent on it, which as an afterthought is dangerously close on how
biology kind of work, but it was exceptionally slow.

so I split up the training batches, went to the uni computer room and left the
job running on every computer by night to collect result by morning. in the
morning I'd collect the best genes from each machine, mix them all for another
few round of training, select the best in the population and reseed them on
all the machines by night.

after a week of painstakingly organizing, seeding and collecting results, the
network never managed to converge around the problem, but boy it was fun
trying! The problem was driving a car around a lap of a track using five
"distance from kerb" sensor as input angled at 30deg from each other starting
from center.

I remember I was inspired by an image recognition company, which was using a
training network for training network for motion detection over security
cameras, so this approach wasn't exactly novelty even back then (2001ish).

anyway, this got me noticed by a lab assistant and got a thesis on how to
optimize neural network to run in 4.4bit fixed math for use in extra low power
devices. that one worked! too bad nothing ever came out of it.

edit: some fixin

~~~
wrong_variable
That fact they nothing came out of it is the result of your research ! keep
doing the good work :)

~~~
LogicFailsMe
See dp4a on GTX1080. You're a hot commodity if you can build on that. See also
Google's TPU where they managed 8-bit inference but probably not 8-bit
training (though that may be more of a memory or bandwidth limitation).

------
tansey
Just read (skimmed) this paper yesterday actually.

Looks interesting-- but there are no timing graphs! It's kind of a strawman
argument to say "We can't use Newton's method because it's too slow to
calculate the Hessian," and then go and present all your performance graphs in
terms of number of iterations.

~~~
HelloNurse
Usually not providing time measurements means that each iteration is
extravagantly expensive and the authors didn't find test cases with good
actual performance, but in this case there seems to be the major twist of
completely hiding away the optimizer training cost.

To be fair, it should be noted that there are no claims of actual good
performance, only claims that the technology works: "Our experiments have
confirmed that learned neural optimizers compare favorably against state-of-
the-art optimization methods used in deep learning."

------
drmeister
Neither the paper nor the comments have mentioned this:
[https://en.wikipedia.org/wiki/Truncated_Newton_method](https://en.wikipedia.org/wiki/Truncated_Newton_method)

The Truncated Newton method uses an inner solver that only runs for a few
iterations to approximate the Hessian. The approximate Hessian is used to
approximately solve Newtons equation. I've implemented it and it works very
well. When it gets close to the solution the convergence is very fast.

I mention it because it sounds similar to what the paper discusses but you use
conjugate gradients in the inner solver and Newton's equation in the outer
solver.

------
swehner
The title could have been chosen as "Learn to learn by gradient descent by
gradient descent" or "Learning learning by gradient descent by gradient
descent"

~~~
hammock
Or "Learn learning by gradient descent by gradient descent." All you're doing
is changing the verb tenses... All you've done is change the verb tenses...
All you did is change the verb tenses

~~~
thenewwazoo
This comment is perfect!

edit: guys it's a pun

~~~
dvanduzer
This comment will have already been perfect.

------
BucketSort
Can all algorithms then be cast as learning problems and their optimal
versions produced this way? Seems like amazing work, but I don't know enough
to confirm.

~~~
dangerlibrary
Gradient descent has local maxima problems, so it's not always going to
produce an "optimal" result.

~~~
merraksh
Gradient ascent reaches a local maximum eventually, but gradient descent is
guaranteed to find local _minima_ only.

~~~
no_flags
They're the same thing, give or take a minus sign... right?

~~~
merraksh
Indeed

    
    
      max {f(x): x in X} = - min {-f(x): x in X}
    

However, gradient _ascent_ on a convex minimization problem will get stuck in
a local maximum (as a convex minimization problem has f(x) convex, hence with
local minima = global minima), and viceversa for gradient descent algorithms
on concave maximization problems.

------
Loic
Very interesting, I am eager to read about the future research on non "simple
convex problems". This where it could provide benefits as in the industry we
have a lot of them, a lot of domain knowledge to _go around_ the local minima
etc. and a robust ML based approach could really help there instead of being
obliged to accumulate in our algorithms years of trial and errors.

------
josephdviviano
Yo dawg, I heard you like gradient descent, so I put optimizers on your
optimizer so you can learn while you learn.

------
oiuytrewq
No timing results, no comparisons with Nesterov type methods. To all the
commenters that have said "this looks promising": this doesn't look promising
at all. Why do you think in all the years of people optimizing things with
gradient descent no one has tried this? Answer: they have, and it doesn't
work.

~~~
thanatropism
Timing results are a poor substitute to O(N) complexity estimates.

I mean, does a method take longer because it's doing lots of virtual memory
stuff or because it uses a lot of computron?

------
merraksh
_In spite of this, optimization algorithms are still designed by hand._

Well, they are tuned automatically. There are derivative-free optimization
algorithms that have been designed to tune optimization algorithms on a set of
instances.

------
nathan_f77
I was just thinking about this the other day! If machine learning can be
applied to almost any problem, then surely it could be applied recursively to
optimize itself. I'm glad to see that someone worked on this.

------
latenightcoding
The first time I read about gradient descent and optimization algorithms, this
was the first thing that came to my mind, this looks promising.

------
ifdefdebug
I actually do hope that all those learning techniques do have limitations and
that intelligence cannot be achieved in principle by machine learning. OK, I
see the problems with that statement (first of all, define intelligence) but
for instance, I hope the so-called "singularity" cannot be reached in
principle and if somebody could prove it once and for all, please do so.

~~~
chriswarbo
There's an interesting question about the return on investment for self-
improvement. The singularity idea is that after a self-modifying algorithm
makes an improvement to itself, it becomes better at improving itself, and
hence the next improvement is found more quickly.

There's also another aspect though: the first improvement will (by definition)
be the easiest to find, and subsequent improvements might get harder and
harder to find. This acts to slow down self-improvement.

I think it's interesting to consider which of these will dominate in a
particular domain, and my own research is related.

------
penetrarthur
When I just finished Andrew Ng ML course and had to solve a real life problem,
this was actually the first thing that came into my mind. Too bad I couldn't
formulate the problem then(still can't) and that was basically the end of my
ML career.

------
LukaAl
Next article: "Learning to learn to learn by gradient descent by gradient
descent by gradient descent".

Then "Learning to learn to learn to learn by gradient descent by gradient
descent by gradient descent by gradient descent" and keep going. Turtles all
the way down!

P.s: I understand the beauty of this article, but I was surprised none get
this irony :-)

~~~
deong
I think everyone gets it -- the article is intentionally given the slightly
cheeky title.

------
xianshou
Yo dawg, I heard you liked gradients, so I put some learning in your learning
so you can descend while you descend.

[http://knowyourmeme.com/memes/xzibit-yo-
dawg](http://knowyourmeme.com/memes/xzibit-yo-dawg)

~~~
mempko
You beat me to it! I made one.

[http://imgur.com/HI8d0B1](http://imgur.com/HI8d0B1)

