
Stacked Approximated Regression Machine: A Simple Deep Learning Approach - kartikkumar
https://arxiv.org/abs/1608.04062
======
imurray
I'm going to check back later to see if anyone manages to reproduce it.
Perhaps by the time it's presented at NIPS.

A twitter conversation reflecting some scepticism, but agreeing it would be
interesting if it all checks out:
[https://twitter.com/fchollet/status/771862837819867136](https://twitter.com/fchollet/status/771862837819867136)

~~~
cs702
I'm less skeptical than fchollet (creator of Keras, for those here who don't
know), but agree that we need to wait until the usual suspects at Google,
FaceBook, Toronto, Montreal, Stanford, etc. have replicated this. In all
likelihood the team will release code soon, either before or after NIPS, so we
will all be able to check things out for ourselves.

~~~
cs702
One of the authors, Zhangyang Wang, just wrote this on his personal page: "We
have discussed and decided to work on a software package release, perhaps
accompanying it with a more detailed technical report in the future. Once the
software package is ready, we will update everybody."

[http://www.atlaswang.com/](http://www.atlaswang.com/)

~~~
joe_the_user
Paper withdrawn

[https://arxiv.org/abs/1608.04062](https://arxiv.org/abs/1608.04062)

It's kind of an odd thing. I (random non-academic amateur) actually spent a
bunch of time trying to parse the paper, which was kind of a combination of
interesting ideas and incomprehensible ambiguities.

One real academic researcher also put some time into it. The good part of the
paper is explained here. My guess is the problem is going from ARM to SARM.

[http://gabgoh.github.io/SARG/](http://gabgoh.github.io/SARG/)

While I'm sure most people involved think of the experience as a wash, I feel
like I learned a bunch about deep learning in the process.

PS, also sad that the author did this.

~~~
gabrielgoh
Hi, I'm the author of the blog post. I added a blurb to the beginning the blog
post explaining all the drama, and precisely what claim was made that was
withdrawn.

The problem is not in the ARG->SARG approximation, but the bit on unsupervised
pretraining. The paper could stand on its own without that section, but
without that result it would have been a significantly more mediocre NIPS
submission. Hope this clarifies things.

~~~
joe_the_user
First, thanks for the excellent blog, it gave me a better idea what was
happening

As far as the ARG-> transformation goes, maybe that's just something I don't
get, I can see how one goes from sparse encoding to repeated ARG-type
transformations and how this repeated application approximates the solution of
a sparse encoding problem. And it is suggestive that these application look
like a layers of a neural net.

But when you switch to stacking, what are you doing? Solving one sparse
encoding problem then another? What analogy is there to say this works ... or
that it would work better than just single sparse encoding? At that point, is
it just "try it and see?"

One of the impressions I got from scanning the literature is that deep nets
are kind of generally treacherous beasts - just getting a locally 1st layer
may not be desirable off the bat. People have settled on backpropagation for
very subtle reasons. See "Overfitting in Neural Nets: Backpropagation,
Conjugate Gradient, and Early Stopping", Caruana, Lawrence, et. al where
backpropagation finds better solutions than the "more powerful" conjugate
gradient method.

~~~
gabrielgoh
you are right. you are using the output of the previous sparse solution as
input into the new one, i.e. stacking sparse coders.

Your second question of why this is a good idea is the million dollar
question. Its pretty much "lets try it and see", with some heuristic reasoning
thrown into the mix (its mirrors the brain, it abstracts information, etc,
etc).

btw, I don't think people use early stopping anymore. It's been replaced by
more powerful forms of regularization, such as dropout. The deep learning
world is getting more tame, and that makes me happy.

------
nl
Update on this: it has been withdrawn:
[https://arxiv.org/abs/1608.04062](https://arxiv.org/abs/1608.04062)

------
cs702
As far as I understand this, these guys claim they can train convolutional and
many other types of deep neural nets faster by pretraining each layer with a
new unsupervised technique via which the layer sort of learns to compress its
inputs (a local optimization problem), and then they fine tune the whole
network end-to-end with supervised SGD and backpropagation as usual. They have
not released code, so no one else has replicated this yet -- as far as I know.

If the claim holds, the implication is that layers can _quickly_ learn much of
what they need to learn _locally_ , that is, without requiring backpropagation
of gradients from potentially very distant layers. I can't help but wonder if
this opens the door for more efficient asynchronous/parallel/distributed
training of layers, potentially leading to models that update themselves
continuously (i.e., "online" instead of in a batch process).

I wouldn't be surprised if the claim holds. There is mounting evidence that
standard end-to-end backpropagation is a rather inefficient learning
mechanism. For example, we now know that deep neural nets can be trained with
_approximate gradients_ obtained by shifting bits to get the sign and order of
magnitude of the gradient roughly right.[1] In some cases it's even possible
to restrict learning to use binary weights.[2] More recently, we have learned
that it's possible to use "helper" linear models during training _to predict
what the gradients will be_ for each layer, in-between true-gradient updates,
allowing layers to update their parameters locally during backpropagation.[3]
Finally, don't forget that in the late 2000's, AI researchers were doing a lot
of interesting work with unsupervised layer-wise training (e.g., DBNs composed
of RBMs, stacked autoencoders).[4]

This is a fascinating area of research with potentially huge payoffs. For
example, it would be really neat if we find there's a "general" algorithm via
which layers can learn locally from inputs continuously ("online"), allowing
us to combine layers into deep neural nets for specific tasks as needed.

[1] [https://arxiv.org/abs/1510.03009](https://arxiv.org/abs/1510.03009)

[2] [https://arxiv.org/abs/1602.02830](https://arxiv.org/abs/1602.02830)

[3] [https://deepmind.com/blog#decoupled-neural-interfaces-
using-...](https://deepmind.com/blog#decoupled-neural-interfaces-using-
synthetic-gradients)

[4]
[https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf](https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf)

EDITS: Expanded the original comment so it conveys better what I actually
meant to write, while keeping language as casual and informal as possible.
Also, I softened the tone of my more speculative observations.

~~~
joe_the_user
The paper itself is fairly sparse but references a number of approaches for
more-quickly-learning neural-net-related learning systems from ~2013 (PCAnet,
SCATnet,etc).

The paper presents these and other approaches as being instances of a
classical, general form, _regularized regression_ but with the "stacked"
property involving each layer iterating however many times and then the next
layer changing parameters (or features) and further iterating.

From my barely-informed viewpoint, this sounds like a fascinating way to unify
the earlier efforts and one which could yield a variety of other approaches -
even if the particular variation they use doesn't work out. But I assume lots
of more-informed people are going to be looking at this.

~~~
joe_the_user
Note: even though the paper talks about Approximate Regression Machine layers
and uses equations that look sort-of like the equations of regularized
regression, the layers aren't about regularized regression but about sparse
dictionary encoding, a quite different approach.

------
billconan
is there a book you can recommend about the fundamentals (like sparse coding)
to understand papers like this?

~~~
jlg23
[http://www.scholarpedia.org/article/Sparse_coding](http://www.scholarpedia.org/article/Sparse_coding)

~~~
nojvek
This was a great read. Thanks.

