
Wide Neural Networks of Any Architecture Are Gaussian Processes - gyang
https://arxiv.org/abs/1910.12478
======
throwlaplace
Without reading the paper I bet it comes down to the central limit theorem,
which itself comes down to the fact that 3rd order and higher terms don't
matter asymptotically. The question is always when do you start appreciably
approaching the asymptote (answer: we often have no idea).

edit: re higher order terms i'm talking about the proof of the classic clt
[https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...](https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of_classical_CLT)

theres a taylor series expansion of the characteristic of the centered rv
that's truncated to second order.

~~~
SiempreViernes
You're right about the central limit theorem appearing, but series expansions
didn't appear; instead it is the fact that the weights are initialized to
random values that seems to carry the day.

I couldn't find any mention about a _trained_ NN, this is strictly about the
initial state. Yang does reference a few papers that supposedly leverage the
GP correspondence to gain some insight about how to better initialize a NN,
for example this:
[https://arxiv.org/abs/1803.01719](https://arxiv.org/abs/1803.01719)

~~~
gyang
Yes. I will have things to say about training, but that requires building up
some theoretical foundations. This paper is the first step in laying it out.
Stay tuned! :)

~~~
SiempreViernes
So do you already have any intuition of what training does to the initial GP?
Obviously the training adjust the various weights in complicated ways, which
to me feels like it should correspond to some sort of marginalisation on the
GP, but I'm not really aware if that's a thing people do (undoubtedly someone
has tried it though).

~~~
gyang
@fgabriel mentioned this below: if the network is parametrized in a certain
way, then the GP evolves according to a linear equation (if trained with
square loss). In this linear equation, a different kernel shows up, known as
the Neural Tangent Kernel. An intuitive way to think about this is to Taylor
expand the parameters-to-function map around the initial set of parameters: f
= f_0 + J d\theta, where J is the Jacobian of the neural network function
against the parameters. Following this logic, the change in parameters affects
the neural network function roughly _linearly_ , as long as the parameters
don't venture too far away from their original values. The Neural Tangent
Kernel is then given by JJ^T.

In addition to the paper mentioned by @fgabriel, this paper [1] explains it in
more detail as well, and the equations you are looking for are 14, 15, and 16.

[1] [https://arxiv.org/abs/1902.06720](https://arxiv.org/abs/1902.06720)

------
radford-neal
This looks interesting (have only glanced at it so far). However, the abstract
and introduction are a bit misleading regarding what I did in my 1994 thesis
([http://www.cs.utoronto.ca/~radford/thesis.abstract.html](http://www.cs.utoronto.ca/~radford/thesis.abstract.html)).
My results about convergence of neural networks to Gaussian processes were not
confined to shallow networks. Also, I discussed how to get network priors to
converge to non-Gaussian stable processes, by using priors for weights that
have infinite variance. Such non-Gaussian priors may well be more interesting
than Gaussian processes, for problems where a "really sophisticated smoother"
is not going to be adequate.

~~~
gyang
Hi Radford, I'm very happy that this paper got your attention, but I'm
regretful that I did not represent your research accurately!

Indeed, in your thesis, section 2.3 is on "Priors for networks with more than
one hidden layer." I will fix this in the next version of the paper, and as a
prerequisite, I'd like to make sure I understand your contributions correctly.

Is the following summary accurate?

In section 2.3, you explored the GP limit for more than 1 hidden layer
numerically, as well as some thoughts on the decay behavior of the GP kernel
associated to an MLP with step function nonlinearity. You also considered
mixing Gaussian and non-Gaussian priors in different hidden layers and finally
mused about the infinite depth limit of the infinitely-wide MLP.

However, I could not find a rigorous treatment of the multi-layer GP limit (in
the vein of Lee et al. and Matthews et al. (2019)). Does it exist elsewhere in
the paper?

~~~
radford-neal
Looking now at my thesis, I agree that I don't explicitly argue theoretically
for why multilayer networks will (under suitable conditions) converge to
Gaussian processes. However, it follows (at some level of rigour) pretty
directly from the fact (which I do note) that if a single-hidden layer has
multiple outputs, the functions computed by these outputs will be independent
(in the prior) as the number of hidden units goes to infinity. So if you add
another hidden layer, the functions computed by the units in this layer will
be independent (they're like outputs of the previous layer), and the argument
for why the outputs from this layer form a GP goes through as before. I'm not
sure why I didn't explicitly note this. It's implicitly assumed in my
discussion of how the covariance function for networks with step function
hidden units changes as you add more layers.

~~~
gyang
Right, so your argument would work if you allow the layer widths to tend to
infinity sequentially (so this corresponds to finite networks where each
previous layer is much bigger than the next layer). This is the argument
presented by Lee et al. But note it's nontrivial to argue that this limit
holds when the widths of all layers tend to infinity at the same time
(arguably the more natural limit), which is one of the main contributions of
Matthews et al. In my paper here, I also consider this limit where the widths
tend to infinity at the same time.

In any case, I'll update the paper to reflect our discussion here. Thanks,
Radford!

------
joe_the_user
This sounds important and interesting but isn't _wide_ the key word here?

They talk about shallow NNs and deep fully connected NNs but that would seem
to leave out a lot.

I mean, the article puts forward a distinct language/model to expression
neural nets in, which is cool but are they talking about all or most of the
NNs you see today? If so, huge but still.

Fo-get-a-bout-it, see SiempreViernes' comment: "I couldn't find any mention
about a _trained_ NN, this is strictly about the initial state. "(emphasis
added)

[https://news.ycombinator.com/item?id=21653516](https://news.ycombinator.com/item?id=21653516)

~~~
gyang
Hi, the author here. Thanks for your interest! Let me try answering some of
your questions.

> This sounds important and interesting but isn't _wide_ the key word here?

Yes, width is very important for this result. Given the size of modern deep
neural networks, I (and most people in the deep learning theory community, by
now) believe the large width regime is the appropriate regime to study neural
networks.

> I mean, the article puts forward a distinct language/model to expression
> neural nets in, which is cool but are they talking about all or most of the
> NNs you see today?

Try throw me an architecture and watch if I can't throw you back a GP :)

> Fo-get-a-bout-it, see SiempreViernes' comment: "I couldn't find any mention
> about a trained NN, this is strictly about the initial state. "(emphasis
> added)

Yes. I will have things to say about training, but that requires building up
some theory. This paper is the first step in laying it out. Stay tuned! :)

~~~
credit_guy
> Try throw me an architecture and watch if I can't throw you back a GP :)

On the pragmatic side, would that GP train faster than the NN? In my little
experimentation with GPs, I found them awfully slow. However, maybe what I
tried (it was black box for me) used some brute force approach, and there are
other more fine-tuned algorithms. Since you are an expert in the area, what's
your take?

~~~
gyang
I think in general, "training" a GP, i.e. doing GP inference (or kernel
regression) is not done for speed reasons, but rather because they are _sample
efficient_. More concretely, the practical folk wisdom regarding GPs is that
when there are not much data, then GP inference with a well-chosen kernel can
give you much more bang for the buck than a neural network. However, when
there are a lot of data (especially in the perceptual domains like vision and
language), neural networks typically train faster and generalize better.

I wouldn't say I'm an expert at _using GPs_ , so actual GP practitioners feel
free to correct me if I'm wrong :)

------
zwaps
Good results often seem "obvious" in ex post. So take this as a compliment.

Intuitively (I have never read a paper in this field), since you are talking
about wide networks, I also expected that a CLT would be used. For "dense"
layers it is pretty obvious that one should be able to characterize each layer
aggregation based on a CLT, and so forth. Some sort of mild independence
assumption, mixing or martingales, on the sampling should be sufficient. I
think therefore the goal of a paper for any architecture would be to figure
out a way to generalize this for different layers.

However, one thing I notice is that you seem to assume that weights (or
whatever is initialized) are initialized with a Gaussian distribution?

That seems a bit restricted. The appeal of this approach with wide networks, I
think, is that any independent initialization of weights would lead to
transformations of GP.

Perhaps I am misunderstanding also the implications. Could you generally trace
a dependence onto the distribution of the last layer weights, even if they are
not normal? Or do you need the GP for your conditioning?

On the one hand, Gaussian initialization is I think not really encompassing
"all architectures" as practically used, but more specifically, it seems that
the end goal of this research program would be to generalize beyond this (much
like in regression, one uses CLTs exactly to get away from parametric
assumptions). Or is that where you plan to go?

~~~
zwaps
Reading a bit more, it's really interesting that your result relies on Lemma
G.4, which is a CLT based on independence or mixing (as you wish), whereas all
theorems assume that anything put in is Gaussian.

It "smells" like that is, or should not be necessary.

The elegance of this approach, and GP in general, is that you use scale and
independence and get to a specific distribution. Therefore, assuming that
inputs are Gaussian seems restrictive. In some appropriate sense, it should
not be required.

But again, I am probably misreading something.

What I would be looking for as a referee is: "Based on ANY random
initialization (mild independence condition), it holds that wide networks
become Gaussian"

~~~
gyang
Hi @zwaps, thanks so much for your interest! My answer to @throwlaplace seems
relevant to your question, so let me copy it here and comment on your specific
questions afterward.

> The CLT would be a good guess at approaching this problem, and indeed it is
> the approach of prior works [1][2]. But in this paper, the key answer is
> actually law of large numbers, though CLT would feature more prominently if
> we allow weights to be sampled from a non-Gaussian distribution.

> The TLDR proof goes like this: via some recursive application of law of
> large numbers, we show that the kernel (i.e. Gram matrix) of the final layer
> embeddings of a set of inputs will converge to a deterministic kernel. Then
> because the last layer weights are Gaussian, the convergence of the kernel
> implies convergence of the output distribution to a Gaussian.

> 99% of the proof is on how to recursively apply the law of large numbers.
> This uses a technique called Gaussian conditioning, which, as its name
> suggests, is only applicable because the distribution of weights is
> Gaussian.

So, you are right that Prop G.4 can be easily generalized to non-Gaussian
cases. However, this prop is only 1% of the entire proof as explained above;
the 99% is on inductively handling weight matrices that are possibly reused
over and over again (like in an RNN), and a priori it's not clear we can say
nice things about their behavior (and this is also where previous arguments
relying CLT break down).

As mentioned above, the meat of the argument is the Gaussian conditioning
technique, which roughly says the following: A Gaussian random matrix A,
conditioned on a set of equations of the form y = A x or y = A^T x with
deterministic x's and y's, is distributed as E + Pi_1 A' Pi_2, where E is some
deterministic matrix, Pi_1 and Pi_2 are some orthogonal projection matrices,
and A' is an iid copy of A. See Lemma G.7. This lemma allows us to inductively
reason about a weight-tied neural network by conditioning on all the
computation done before a particular step in the induction. However, this
technique is not available if the weights are not sampled from a Gaussian.

Now, you are right this result should apply to any reasonable non-Gaussian
initialization as well, as seen from experiments. There are standard
techniques for swapping out Gaussian variables with other "reasonable"
variables (see the section on "Invariance Principle" in [3]), so it becomes
roughly an exercise in probability theory. I think most folks who have seen
such "universality" results would guess that Gaussians can be swapped with
uniform variables, etc. From a machine learning perspective, perhaps this is
not as interesting as showing the universality in _architecture_ , especially
as new architectures are invented like a flood, and old theoretical results
become irrelevant quite quickly. More importantly, the tensor program
framework gives an automatic way of converting an architecture to a GP, and I
believe this is a tool many folks in the ML community will find useful.

[1] Deep Neural Networks as Gaussian Processes.
[https://openreview.net/forum?id=B1EA-M-0Z](https://openreview.net/forum?id=B1EA-M-0Z)

[2] Gaussian Process Behaviour in Wide Deep Neural Networks.
[https://openreview.net/forum?id=H1-nGgWC-](https://openreview.net/forum?id=H1-nGgWC-)

[3] O'Donnell, Ryan. Analysis of boolean functions.

------
p1esk
OK, let's say NNs are GPs. What can we do with this information?

~~~
joshvm
You can use it to estimate model uncertainty, Yarin Gal has some nice writeups
on this:
[https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801a...](https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa532c1ce.html)
(in this case using dropout networks as GP approximations).

~~~
PeterisP
How would we use a property of networks with _random weights_ to estimate
uncertainty of trained models in which which the weights are (as much as we
can) trained to be _not_ random?

------
mikewarot
So, if I understand this correctly...

Gaussian Processes are a way of trying every possible function to fit a set of
data points, being constrained a bit more with each new data point.

All neural networks in the list, given sufficient size, are essentially
Gaussian in their behavior, and thus share the same features and limitations.

Right?

------
mikorym
I don't quite follow what is a Gaussian process. What do you use, node
weights, node inputs/outputs, or...?

------
gutuysjhfyt
This is beautiful.

