
Deep Networks with Stochastic Depth - nicklo
http://arxiv.org/abs/1603.09382v1
======
nl
I suspect this was posted because of Delip Rao's write-up[1] (which I suggest
might be a better link).

It's a nice - if somewhat controversial - summary.

40% speedup on DNN training with state-of-the-art results.

[1] [http://deliprao.com/archives/134](http://deliprao.com/archives/134)

------
sdenton4
It's kind of bonkers that this works. It suggests that the whole belief that
layers are learning different representations is completely wrong: if layer
three is expecting a certain kind of intermediate representation from layer
two, and is then given the raw input, one would expect layer three to choke.

Instead, the depth seems to be giving something like a progressive unwinding
of the feature space.

It would be interesting to compare the trained networks to networks trained in
the usual way, to see if they're coming up with similar coefficients in spite
of the different training methods, out if this is producing something
completely different.

~~~
albertzeyer
Note that this was done for 100-1000 layer depth. So each individual layer
only slightly increases the "high-levelness" of the features. In the same
sense, that is why Deep Residual network works - initially, all layers are
close to identity.

------
radarsat1
Having not read the paper, something I find unclear: is it only the feedback
path that is skipped, or is the feedforward path also skipped? The abstract
mentions replacing the layer with an identity function. I'm not sure how this
would work, wouldn't it change the result (i.e the encoding used by the
following layer would be corrupted) if you just multiply the inputs by 1 and
add them?

Otherwise, how precisely do you "skip" a layer without corrupting the training
of lower layers?

Edit: the answer is in the definition of "skip layers", introduced in a
previous paper:
[http://arxiv.org/abs/1512.03385](http://arxiv.org/abs/1512.03385) which
introduces identity functions into the layer equation.. I guess I have more
reading to do on this topic.

~~~
albertzeyer
The full layer is skipped. I.e. replaced by identity. Sth like g(x) =
switch(prob, x, f(x)).

Deep Residual Networks are similar but different. There, you add with
identity. Sth like g(x) = x + f(x).

~~~
radarsat1
Yes, but my question was more, when the layer is "skipped", what happens to
the input for the next layer? But clearly, it is designed such that the
identity function still provides somehow useful information to the next layer.
(i.e doesn't significantly transform its domain and range) I was wondering how
this could work. It's just that, intuitively, I would think that the next
layer is being trained on a specific transformation performed by the skipped
layer, so I still don't fully understand how replacing a whole layer with the
identity function doesn't completely mess up the training of all subsequent
layers. But maybe the secret is that it only lasts for a small number of
iterations, and perhaps this short-lived deviation actually helps inject some
minima-escaping trajectory. (I have read that injecting random noise can have
similar effects. Is this just a different kind of random noise?)

------
romaniv
I'm reading Delip's followup post[1] and it reminds me how much of ANN stuff
is till pretty much alchemy.

[1] [http://deliprao.com/archives/137](http://deliprao.com/archives/137)

------
karterk
This is literally one of the most exciting papers I have read recently that
will have quite some impact on deep learning models. The major drawback of
deep architectures today is training time and any.improvement to that will
have a drastic effect on my productivity.

Right now I basically run N architectures on N GPUs at the same time to speed
things up. And that's a luxury.

