Intuitively (I have never read a paper in this field), since you are talking about wide networks, I also expected that a CLT would be used. For "dense" layers it is pretty obvious that one should be able to characterize each layer aggregation based on a CLT, and so forth. Some sort of mild independence assumption, mixing or martingales, on the sampling should be sufficient.
I think therefore the goal of a paper for any architecture would be to figure out a way to generalize this for different layers.
However, one thing I notice is that you seem to assume that weights (or whatever is initialized) are initialized with a Gaussian distribution?
That seems a bit restricted. The appeal of this approach with wide networks, I think, is that any independent initialization of weights would lead to transformations of GP.
Perhaps I am misunderstanding also the implications. Could you generally trace a dependence onto the distribution of the last layer weights, even if they are not normal? Or do you need the GP for your conditioning?
On the one hand, Gaussian initialization is I think not really encompassing "all architectures" as practically used, but more specifically, it seems that the end goal of this research program would be to generalize beyond this (much like in regression, one uses CLTs exactly to get away from parametric assumptions). Or is that where you plan to go?
It "smells" like that is, or should not be necessary.
The elegance of this approach, and GP in general, is that you use scale and independence and get to a specific distribution.
Therefore, assuming that inputs are Gaussian seems restrictive. In some appropriate sense, it should not be required.
But again, I am probably misreading something.
What I would be looking for as a referee is:
"Based on ANY random initialization (mild independence condition), it holds that wide networks become Gaussian"
> The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works . But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.
> The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.
> 99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.
So, you are right that Prop G.4 can be easily generalized to non-Gaussian cases. However, this prop is only 1% of the entire proof as explained above; the 99% is on inductively handling weight matrices that are possibly reused over and over again (like in an RNN), and a priori it's not clear we can say nice things about their behavior (and this is also where previous arguments relying CLT break down).
As mentioned above, the meat of the argument is the Gaussian conditioning technique, which roughly says the following: A Gaussian random matrix A, conditioned on a set of equations of the form y = A x or y = A^T x with deterministic x's and y's, is distributed as E + Pi_1 A' Pi_2, where E is some deterministic matrix, Pi_1 and Pi_2 are some orthogonal projection matrices, and A' is an iid copy of A. See Lemma G.7. This lemma allows us to inductively reason about a weight-tied neural network by conditioning on all the computation done before a particular step in the induction. However, this technique is not available if the weights are not sampled from a Gaussian.
Now, you are right this result should apply to any reasonable non-Gaussian initialization as well, as seen from experiments. There are standard techniques for swapping out Gaussian variables with other "reasonable" variables (see the section on "Invariance Principle" in ), so it becomes roughly an exercise in probability theory. I think most folks who have seen such "universality" results would guess that Gaussians can be swapped with uniform variables, etc. From a machine learning perspective, perhaps this is not as interesting as showing the universality in architecture, especially as new architectures are invented like a flood, and old theoretical results become irrelevant quite quickly. More importantly, the tensor program framework gives an automatic way of converting an architecture to a GP, and I believe this is a tool many folks in the ML community will find useful.
 Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z
 Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-
 O'Donnell, Ryan. Analysis of boolean functions.