
Decoupled Neural Interfaces Using Synthetic Gradients - yigitdemirag
https://deepmind.com/blog#decoupled-neural-interfaces-using-synthetic-gradients
======
nicklo
Super cool stuff in this paper.

At its heart, this is a new training architecture that allows parameter
weights to be updated faster in a distributed setting.

The speed-up happens like so: instead of waiting for the full error gradient
to propagate through the entire model, nodes can calculate the local gradient
immediately and estimate the rest of it.

The full gradient does eventually get propagated, and it is used to fine-tune
the estimator, which is a mini-neural net in itself.

Its amazing that this works, and the implications that full back-prop may not
always be needed shakes up a lot of assumptions about training deep nets. This
paper also continues the trend of this year of using neural nets as
estimators/tools to improve the training of other neural nets. (I'm looking at
you GANs).

Overall, excited to see where this goes as other researchers explore the
possibilities when you throw the back-prop assumption out.

------
Houshalter
This is a big deal. One of the weaknesses of NNs is their time/memory
complexity. Having to store every previous state and iterate backwards through
them to the beginning of time, to learn. And if you want to learn "online"
then you need to do that at every single step. It's unlikely the human brain
works that way, and that has been one of the big arguments for why the brain
can't use something similar to backprop.

~~~
visarga
It has been proven that sparse backwards connections, even with tied weights,
can substitute backprop. In biological systems the feedback routes can't be
the same and share connections with the feed forward routes.

~~~
taliesinb
Are you referring to
[http://arxiv.org/pdf/1411.0247v1.pdf](http://arxiv.org/pdf/1411.0247v1.pdf) ?
That doesn't mention sparse backward connections, but it does show that
feedback weights that aren't computing the actual derivative dloss/din can
till support learning. The network 'learns to learn', so to speak.

~~~
hacker42
I also dimly remember that sparsity in the random fixed backward connections
still works. There is actually a figure about that in the paper you've linked
to.

Interestingly, feedback alignment is also patented, but it is unclear that is
actually helpful, except for explaining neuroscience. To my knowledge there
has been no application of it in two years.

------
lqdc13
Anyone knows how they generated the illustration images?

Seems much better than relying on the client to render cpu intensive JS.

~~~
maxjaderberg
Keynote for the static ones, After Effects for the gifs

------
imh
Do most of the images not load for you guys too? (nevermind, works now!)

~~~
Houshalter
None of the images load for me, even after many refreshes.

------
m1ck
Is this a big deal?

~~~
edmack
This area is a big deal - ML networks need to be much deeper and denser to
provide human-level understanding, and training networks is currently a
considerable bottleneck.

~~~
visarga
Does this method make it easier to spread a neural network over multiple
GPUs/machines? I mean, does it reduce the amount of data being communicated
between compute nodes or just decouples the updates from the need to wait for
the rest of the net to finish?

~~~
nl
_Does this method make it easier to spread a neural network over multiple GPUs
/machines?_

Yes, but this isn't the primary focus of this work.

This is about a method of approximating error rates (gradient) back up the
neural network.

This is important because allowing the use of approximate error rates means
that earlier layers can be trained without waiting for error back-propagation
from the later layers.

This asynchronous feature helps on a (computer) network too - there is no need
to wait for back-propagation across the network.

As they point out the error will back-propagation eventually. The analogy with
an eventually consistent database system (and the effect that has on
scalability) is pretty clear.

------
tlarkworthy
wait... the model can be linear. So then it's just the second order terms of
the error gradient? Or the Jacobian or something?

~~~
mdda
Hmmm - I'm also thinking that this is one of those things that probably has a
much better explanation - and the science/maths will (hopefully) backfill why
it works so well.

I half-remember from somewhere that as long as the gradient descent direction
has the correct sign 'in expectation', then the SGD will ~work. So there's a
whole lot of flexibility in there for having a good idea that at least doesn't
fail horribly.

For instance, in other DeepMind work, they do lots of asynchronous weight
updates - and the accuracy decrease from ignoring any kind of 'locking' is
dwarfed by the speed increase of being able to run more stuff in parallel.

Another image I can't shrug off is that of Q-learning in a game, where the
updates implicitly pass back from 'the future' (which also works ~better than
it should). In this case, the linear model would just be an estimator of where
the update values are going to land...

~~~
ebalit
HOGWILD [1] lock-free weight updates are a great example of how forgiving SGD
can be.

1 : [https://arxiv.org/abs/1106.5730](https://arxiv.org/abs/1106.5730)

