
Gated Linear Networks - asparagui
https://arxiv.org/abs/1910.01526
======
fxtentacle
That is an amazing paper, a great result and new neutral architectures are
long overdue.

But I don't believe that this has any significance in practice.

GPU memory is the limiting factor for most current AI approaches. And that's
where the typical convolutional architectures shine, because they effectively
compress the input data, then work on the compressed representation, then
decompress the results. With gated linear networks, I'm required to always
work on the full input data, because it's a one step prediction. As the
result, I'll run out of GPU memory before I reach a learning capacity that is
comparable to conv nets.

~~~
friendly_aixi
Convolution is a linear operation; in the case of images, you can view it as a
multiplication with a doubly block circulant matrix. I can't see any barriers
to hybrid approaches here, though it seems difficult to avoid using
backpropagation for credit assignment within the convolutional layers.

Re: significance, how about their application in regression:
[https://arxiv.org/abs/2006.05964](https://arxiv.org/abs/2006.05964) ? Or in
contextual bandits:
[https://arxiv.org/abs/2002.11611](https://arxiv.org/abs/2002.11611) ?

Disclaimer: I am one of the authors (Joel).

~~~
fxtentacle
Wow, that G-GLN paper is really new. Thanks for sharing it :)

My first impression is that this will be very challenging to use for many of
the people currently using AI in practice, because of the requirement to have
a convex and gaussian-distributed result.

For example, I currently work on optical flow and the loss functions are
usually very jumpy and usually only convex within a few pixels around the
correct result. I have seen plenty of modeling errors in optical flow SOTA
papers, for example casting a boolean occlusion term so that it can be added
to the loss (Won't work, no gradient). I have also seen how strongly authors
struggle with the irregularities of their loss function and issues with
convergence, for example by fixing all random initialization to a seed and
then parameter-scanning on that value (Very expensive).

Notable examples of the difficulties faced would be
[https://arxiv.org/abs/1904.04998](https://arxiv.org/abs/1904.04998) which I
have never seen converge from random initialization or
[https://arxiv.org/abs/1711.07837](https://arxiv.org/abs/1711.07837) which
diverges without supervised pre-training.

Also, while many people use the gaussian-distributed euclidean norm of the
difference between prediction and groundtruth as their main loss term, there
has been a lot of discussions recently if that is a good idea, because it
forces the neural network to represent uncertainty with blur. But optical flow
tends to have sharp edges where objects end.

Combined, that means the problems that I work on usually do not have a
gaussian-distributed loss and usually are irregular and never convex, so I'm
not sure if I could use G-GLN.

But the application to contextual bandits looks VERY interesting to me :)

I see great potential in using conv layers as pre-compression and then
applying decision trees on the resulting intermediate representation.

What I previously did for object segmentation was to sample a random but fixed
set of feature pair differences and use the signs as bit flags. I then trained
the decision trees on those bits to predict the boundary shape around that
pixel. I got the general idea from this paper:
[https://arxiv.org/abs/1406.5549](https://arxiv.org/abs/1406.5549)

But it sounds like moving from difference bits to halfspaces and from linear
regression to GLN, your paper "Online Learning in Contextual Bandits using
Gated Linear Networks" could greatly improve on that. I'd be curious to see
how those bandits do on BSDS500.

BTW, are you aware of any discussion groups for AI image processing that are
open to members of the public?

[https://www.reddit.com/r/deeplearning/](https://www.reddit.com/r/deeplearning/)
seems to be mostly people that just took a Udemy / Coursera course, so it's
usually about re-using existing models and almost no talk about research.

~~~
helges
> BTW, are you aware of any discussion groups for AI image processing that are
> open to members of the public?

r/machinelearning is the most suitable subreddit for this, since it is
actively frequented by researchers and the quality of discussion is much
higher than in r/deeplearning

~~~
fxtentacle
Thanks :)

------
Immortal333
"We show that this architecture gives rise to universal learning capabilities
in the limit, with effective model capacity increasing as a function of
network size in a manner comparable with deep ReLU networks."

What exactly this statement means?

~~~
fxtentacle
They mean that if you add parameters, the learning capability of their
approach grows by a similar amount as if you would add the same number of
parameters to a conv+ReLu network (the standard approach).

That "universal" is a weird claim in my opinion, but they mean that with
enough parameters, this architecture can learn everything.

~~~
Immortal333
I was able to get the second part of the statement. but I haven't seen the use
of "in limit" in a statement like this.

Yes, the universal approximation is a strong claim. NN has been proven to have
universal approximation theoretically.

~~~
friendly_aixi
The result here is stronger, in the sense that typical NN universality results
are statements with respect to just capacity (and not how you optimise them).
Here, the result holds with respect to both capacity + a choice of suitable no
regret online convex optimisation algorithm (e.g. online gradient descent). Of
course, this is just one desirable property of a general purpose learning
algorithm.

~~~
currymj
Do you know of any other kinds of universal function approximators that also
have a good regret bound like this, or is yours the first one?

global convergence to any arbitrarily weird function at rate O(sqrt(T)) seems
amazing, almost too good to be true, and I’m wondering what the catch is.
Maybe it’s just a moderately nice property but not extraordinary? Maybe there
are some horrible constants hiding in there?

~~~
friendly_aixi
1) It's definitely not the first. Other methods have universal guarantees of
some form or other with well quantified rates of convergence, e.g. k-NN would
be the best known example.

2) There are some restrictions on the class of density functions it can model,
so arbitrarily weird is a bit strong, but the model class is very general.

3) The weights needed to model any function in this class although finite, can
be arbitrarily large. The regret of a single neuron has a dependence on the
diameter of the convex set your weights reside in, so there is a nasty
constant of sorts in there, and this will also carry over when you analyse the
regret of a complete network. With such a general statement, it's unavoidable
sadly.

4) The universality result on its own is just a nice property. See it as the
first stepping stone to a more meaningful analysis. What you really want is
for the model class to grow as you add more neurons, using weights within a
realistic range, and that the method performs well in practice on some
problems people care about -- we provide empirical evidence that the capacity
grows on par with deep relu networks with our capacity experiments, and show a
bunch of results where the method works, but we don't have a theoretical
characterisation of the class of density functions it can model well (i.e. if
the function has some nice structural property, then a network of reasonable
size is guaranteed to learn a good approximation quickly). Such a result would
be extraordinary in my eyes. Because the network is composed of simple and
well understood building blocks, I am optimistic that such an analysis will be
possible in the future.

~~~
currymj
thank you very much for the detailed response!

much to chew on here, it really does seem like a very interesting class of
models. from the papers it sounds like in practice clipping weights to a small
set works okay, so the constant factors shouldn't be too bad.

i may have to sit down and try to implement these...

------
nl
I didn't realise Hutter was on leave from ANU at DeepMind.

------
caretak3r
As a relative neophyte in this realm, this is fascniating to read. Comparing
this to the the models/methods to derive said properties, is good education
for me.

