edit: re higher order terms i'm talking about the proof of the classic clt https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...
theres a taylor series expansion of the characteristic of the centered rv that's truncated to second order.
I couldn't find any mention about a trained NN, this is strictly about the initial state. Yang does reference a few papers that supposedly leverage the GP correspondence to gain some insight about how to better initialize a NN, for example this: https://arxiv.org/abs/1803.01719
In addition to the paper mentioned by @fgabriel, this paper
 explains it in more detail as well, and the equations you are looking for are 14, 15, and 16.
I have only barely glanced at the paper mind-you, so I couldn't say the details but still.
The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works . But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.
The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.
99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.
> The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea)
Check out the github repo  attached to this paper and look at plot (E) in the README. It shows the empirical rate of convergence for different architectures, but in general the Frobenius norm of the deviation from limit decays like 1/sqrt(width).
 Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z
 Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-
> At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods.
I'm not familiar with ignoring higher order terms, besides approximations and bounds like Chebyshev's inequality and Chernoffs.
I had a little workshop paper earlier this year showing that you can apply the Edgeworth expansion to wide, but finite neural networks: https://arxiv.org/abs/1908.10030
Indeed, in your thesis, section 2.3 is on "Priors for networks with more than one hidden layer." I will fix this in the next version of the paper, and as a prerequisite, I'd like to make sure I understand your contributions correctly.
Is the following summary accurate?
In section 2.3, you explored the GP limit for more than 1 hidden layer numerically, as well as some thoughts on the decay behavior of the GP kernel associated to an MLP with step function nonlinearity. You also considered mixing Gaussian and non-Gaussian priors in different hidden layers and finally mused about the infinite depth limit of the infinitely-wide MLP.
However, I could not find a rigorous treatment of the multi-layer GP limit (in the vein of Lee et al. and Matthews et al. (2019)). Does it exist elsewhere in the paper?
In any case, I'll update the paper to reflect our discussion here. Thanks, Radford!
They talk about shallow NNs and deep fully connected NNs but that would seem to leave out a lot.
I mean, the article puts forward a distinct language/model to expression neural nets in, which is cool but are they talking about all or most of the NNs you see today? If so, huge but still.
Fo-get-a-bout-it, see SiempreViernes' comment:
"I couldn't find any mention about a trained NN, this is strictly about the initial state. "(emphasis added)
> This sounds important and interesting but isn't wide the key word here?
Yes, width is very important for this result. Given the size of modern deep neural networks, I (and most people in the deep learning theory community, by now) believe the large width regime is the appropriate regime to study neural networks.
> I mean, the article puts forward a distinct language/model to expression neural nets in, which is cool but are they talking about all or most of the NNs you see today?
Try throw me an architecture and watch if I can't throw you back a GP :)
> Fo-get-a-bout-it, see SiempreViernes' comment: "I couldn't find any mention about a trained NN, this is strictly about the initial state. "(emphasis added)
Yes. I will have things to say about training, but that requires building up some theory. This paper is the first step in laying it out. Stay tuned! :)
On the pragmatic side, would that GP train faster than the NN? In my little experimentation with GPs, I found them awfully slow. However, maybe what I tried (it was black box for me) used some brute force approach, and there are other more fine-tuned algorithms. Since you are an expert in the area, what's your take?
I wouldn't say I'm an expert at using GPs, so actual GP practitioners feel free to correct me if I'm wrong :)
Intuitively (I have never read a paper in this field), since you are talking about wide networks, I also expected that a CLT would be used. For "dense" layers it is pretty obvious that one should be able to characterize each layer aggregation based on a CLT, and so forth. Some sort of mild independence assumption, mixing or martingales, on the sampling should be sufficient.
I think therefore the goal of a paper for any architecture would be to figure out a way to generalize this for different layers.
However, one thing I notice is that you seem to assume that weights (or whatever is initialized) are initialized with a Gaussian distribution?
That seems a bit restricted. The appeal of this approach with wide networks, I think, is that any independent initialization of weights would lead to transformations of GP.
Perhaps I am misunderstanding also the implications. Could you generally trace a dependence onto the distribution of the last layer weights, even if they are not normal? Or do you need the GP for your conditioning?
On the one hand, Gaussian initialization is I think not really encompassing "all architectures" as practically used, but more specifically, it seems that the end goal of this research program would be to generalize beyond this (much like in regression, one uses CLTs exactly to get away from parametric assumptions). Or is that where you plan to go?
It "smells" like that is, or should not be necessary.
The elegance of this approach, and GP in general, is that you use scale and independence and get to a specific distribution.
Therefore, assuming that inputs are Gaussian seems restrictive. In some appropriate sense, it should not be required.
But again, I am probably misreading something.
What I would be looking for as a referee is:
"Based on ANY random initialization (mild independence condition), it holds that wide networks become Gaussian"
> The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works . But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.
> The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.
> 99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.
So, you are right that Prop G.4 can be easily generalized to non-Gaussian cases. However, this prop is only 1% of the entire proof as explained above; the 99% is on inductively handling weight matrices that are possibly reused over and over again (like in an RNN), and a priori it's not clear we can say nice things about their behavior (and this is also where previous arguments relying CLT break down).
As mentioned above, the meat of the argument is the Gaussian conditioning technique, which roughly says the following: A Gaussian random matrix A, conditioned on a set of equations of the form y = A x or y = A^T x with deterministic x's and y's, is distributed as E + Pi_1 A' Pi_2, where E is some deterministic matrix, Pi_1 and Pi_2 are some orthogonal projection matrices, and A' is an iid copy of A. See Lemma G.7. This lemma allows us to inductively reason about a weight-tied neural network by conditioning on all the computation done before a particular step in the induction. However, this technique is not available if the weights are not sampled from a Gaussian.
Now, you are right this result should apply to any reasonable non-Gaussian initialization as well, as seen from experiments. There are standard techniques for swapping out Gaussian variables with other "reasonable" variables (see the section on "Invariance Principle" in ), so it becomes roughly an exercise in probability theory. I think most folks who have seen such "universality" results would guess that Gaussians can be swapped with uniform variables, etc. From a machine learning perspective, perhaps this is not as interesting as showing the universality in architecture, especially as new architectures are invented like a flood, and old theoretical results become irrelevant quite quickly. More importantly, the tensor program framework gives an automatic way of converting an architecture to a GP, and I believe this is a tool many folks in the ML community will find useful.
 O'Donnell, Ryan. Analysis of boolean functions.
Would be nice to get the same result during training for all these architectures and I believe it will be the next paper of G. Yang and I eager to read it.
Depending on how complete the map is it may let you know us come up with 'physical' laws of information. I am rooting for something which I call Boltzmann convergence.
This is interesting because markovian processes are much easier to intuit about.
Gaussian Processes are a way of trying every possible function to fit a set of data points, being constrained a bit more with each new data point.
All neural networks in the list, given sufficient size, are essentially Gaussian in their behavior, and thus share the same features and limitations.