
Learning Depth-Three Neural Networks in Polynomial Time - abecedarius
https://arxiv.org/abs/1709.06010
======
activatedgeek
I will try to provide an intuition for the uninitiated to get started with
understanding the results in this paper.

In Supervised Machine Learning, our only goal is to predict the underlying
"unknown" distribution given a set of data instances from the same. Let us
say, I gave you set of images of cats and dogs to be classified. As a machine
learning engineer, you would probably build me a Binary Classifier on top of
an SVM and all done. But, how would you justify it to me that how well your
algorithm generalizes? Or, in other words, how would you provide me a
guarantee that your model will work "well" on images that you haven't seen
yet. What would a model working "well" mean? If you recall SVM basics, one
tries to build a max-margin hyperplane in some high-dimensional space. We will
call the set of all such hyperplanes as our "concept class" and this is what
you are trying to learn to be technical (hold that thought). When you finally
give me your SVM model with some parameters, we will call it the target
concept. Now, coming back to the point of "well"-ness of SVM, what guarantees
can you provide me when I provide you images outside your training set but
from the same underlying distribution (to be fair to you). We call the error
you make during your training as the "empirical risk" and the error you make
outside your training set on unseen data as the "risk". Our aim is to minimize
"risk", or in words to "generalize" well.

In Learning Theory, we are concerned with defining the "well"-ness of a
learning algorithm. More generally, we are interested in answering the
question "what is learnable and how well?". The fundamental answer to that
question is a theory known as the PAC (Probably Approximately Correct) Theory
[1].

Paraphrasing in plain English (I'd recommend you to take a look at the formal
definition affirmatively right after this), it states that learning is
tractable if and only if we are able to provide a polynomial time algorithm
such that for any possible set of samples from an unknown underlying
distribution, we can provide an upper bound for the generalized error/risk
with a confidence measure. This also has to happen so that we are able to
provide a polynomial lower bound for the number of samples we use to run the
algorithm.

Skipping a few other details, PAC theory is powerful because it provides a
statistical framework to quantify the "wellness" in a probabilistic setting.
On top of this, it makes absolutely no assumption about the underlying
distribution and hence is "distribution-free". This means that whatever
learning algorithm you provide, it should work on any distribution possible
(with added technicality of the number of training samples needed being
polynomial in terms of the error and confidence measure).

To get a more intuitive understanding, consider the prediction to the series
of numbers "1,2,4,8,16,_". Your first guess most likely will be "32" because
it seems like a GP with ratio 2. But I say that the next number is actually
"12345". Am I wrong? No. Were you wrong? No. The take away is that true
learning is intractable when the underlying distribution is truly "unknown"
(like in this case). Instead, I will quantify the correctness and tell you
that I am 99% confident that the the generalization error for unseen points of
my predictions will be at most 10%. Note that those numbers are arbitrary to
get a point across.

Now that we have PAC-theory in place, many times it is hard to use this tool
because it is very strict, instead we rely on distribution-dependent bounds
which might be more easily computable in some scenarios like the Rademacher
complexity, Growth Function or VC Dimension.

Coming back to the paper, Neural Networks have been a hard nut to crack in
terms of such theoretical bounds because sometimes these bounds tend to be
trivial and useless (e,g, Probability <= 10). First the paper introduces lots
of probabilistic tools on the basis of which it claims to have found an
"efficient" (because polynomial time) PAC algorithm for learning intersections
of polynomially many halfspaces.

[1] A Theory of the Learnable
([http://web.mit.edu/6.435/www/Valiant84.pdf](http://web.mit.edu/6.435/www/Valiant84.pdf))

------
dmichulke
I applied a few things in that field but my knowledge may be a bit rusty.

Still, learning 3 layer NN in polynomial time is a big step forward because
this entails we don't get stuck at an error (or at least we now know that we
got stuck) which entails a probably non-random weight initialization. This is
a big advantage for using and testing simple NN.

It might also finally lend a concrete interpretation to any NN, removing its
black-box nature and increasing their adoption.

That said, I haven't read the article.

------
_0ffh
Question about "We give a polynomial-time algorithm for learning neural
networks with one hidden layer of sigmoids feeding into any smooth, monotone
activation function (e.g., sigmoid or ReLU)":

ReLU is piecewise linear, I always thought that was non-smooth. Did I get that
wrong?

~~~
isoprophlex
I think the discontinuity in the first derivative is smoothed or interpolated
somehow, and that this also happens in existing nn approaches where a gradient
must be computed...

I could be wrong though

------
laretluval
In conjunction with the universal approximation theorem, does this mean that
this algorithm can learn all p-concepts in polynomial time?

~~~
yorwba
Polynomial in the number of hidden units you need to express the network,
which may very well be exponential in the input dimension.

------
deepnotderp
This paper seems somewhat suspect.

For one, "depth-three" implies three layers, and in the standard terminology
of the field what they really mean is "depth one".

And another major red flag:

 _We give a polynomial-time algorithm for learning neural networks with one
hidden layer of sigmoids feeding into any smooth, monotone activation function
(e.g., sigmoid or ReLU)_

How is ReLU smooth?

~~~
yorwba
They cite [https://arxiv.org/abs/1610.09887](https://arxiv.org/abs/1610.09887)
for their definition of network depth, which defines it in such a way that
e.g. a ReLU network of depth 2 is of the form linear2(ReLu(linear1(input))).
That means, depth is the number of linear layers.

The "depth-three" model in this paper is a bit strange in that their second
layer has only one output, so the third linear layer doesn't have any effect.
I would have called this "depth two"; but it _is_ internally consistent with
their definition of depth.

> How is ReLU smooth?

It is 1-Lipschitz, which is smooth enough for them.

~~~
jkabrg
> The "depth-three" model in this paper is a bit strange in that their second
> layer has only one output, so the third linear layer doesn't have any
> effect. I would have called this "depth two"; but it is internally
> consistent with their definition of depth.

No, it does have an effect: It takes a linear combination of the outputs of
the previous layer, and then applies a non-linearity $\sigma'$. If $\sigma'$
is the logistic function, then the output of the last layer is a probability.

~~~
yorwba
No, the last layer is just a linear layer, there's no non-linearity. That's
what's strange about their definition. The depth-three network only applies a
non-linearity twice, which would conventionally be labeled as depth two.

~~~
jkabrg
You're wrong. See section 5.1. I've drawn a graph of it that shows it to be a
classical NN with a single output node.

~~~
yorwba
Yes, it is a classical NN with a single output node. I'm not disputing that, I
just think their calculation of depth is strange. The network only applies the
sigmoid function twice, and would ordinarily be regarded as having a depth of
two. The third linear layer is fixed to multiplying by 1, which is what I
meant by "has no effect". (Did you miss that I was talking about the _third_
layer?)

~~~
jkabrg
Here's a diagram [1] of a one-hidden-layer NN. It has 2 activation functions.
Their NN is of the same type.

[1] -
[https://raw.githubusercontent.com/qingkaikong/blog/master/40...](https://raw.githubusercontent.com/qingkaikong/blog/master/40_ANN_part3_step_by_step_MLP/figures/figure4_hidden_perceptron.jpg)

~~~
yorwba
The question is, how deep is that network?

------
m3kw9
That’s a loaded sentence with 2 terms i have not Idea what they are. Depth 3
and polynomial time

~~~
averagewall
The abstract says depth-3 means a NN with 1 hidden layer. So I guess the 3
refers to one input layer, one hidden layer, and one output layer.

Polynomial time roughly means "fast enough that it could be practical at large
scale if you have enough computing power". It's in contrast to exponential
time which means "no amount of money can ever buy enough computers to do it at
scale".

My question is what's the existing learning time for such a network? Maybe
it's already just as good but these guys have a proof that it works in general
instead of just hoping and finding that it usually seems to be fast enough?

~~~
taeric
My question is more of how many shallow neural networks are still in use. The
deep in deep learning is typically much greater than three.

~~~
stochastic_monk
You're absolutely right. But this is still a _huge_ step forward. There has
been work on the VC dimension of neural networks for a long time (and it's
been shown to be finite), which is a necessary but not sufficient condition
for _efficient_ PAC learnability.

If it can be done for 3 layers, then maybe it can be done for more. And I
happen to really like it when my problems have polynomial time guarantees.

[VC dimension is a quantitative measure related to the complexity that a
model's parameters allow it to express. I think of it as analogous to entropy
for physical systems.]

------
master_yoda_1
That is a classical problem with hacker rank. Nobody get it but its still
trending high :)

------
bluetwo
Holy cow. I don't get it.

~~~
Choco31415
See m3kw9’s comment. Responses include explanations for some of the
terminology.

~~~
bluetwo
Thanks.

