Wide Neural Networks of Any Architecture Are Gaussian Processes 144 points by gyang 13 days ago | hide | past | web | favorite | 34 comments

 Without reading the paper I bet it comes down to the central limit theorem, which itself comes down to the fact that 3rd order and higher terms don't matter asymptotically. The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea).edit: re higher order terms i'm talking about the proof of the classic clt https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...theres a taylor series expansion of the characteristic of the centered rv that's truncated to second order.
 You're right about the central limit theorem appearing, but series expansions didn't appear; instead it is the fact that the weights are initialized to random values that seems to carry the day.I couldn't find any mention about a trained NN, this is strictly about the initial state. Yang does reference a few papers that supposedly leverage the GP correspondence to gain some insight about how to better initialize a NN, for example this: https://arxiv.org/abs/1803.01719
 Yes. I will have things to say about training, but that requires building up some theoretical foundations. This paper is the first step in laying it out. Stay tuned! :)
 So do you already have any intuition of what training does to the initial GP? Obviously the training adjust the various weights in complicated ways, which to me feels like it should correspond to some sort of marginalisation on the GP, but I'm not really aware if that's a thing people do (undoubtedly someone has tried it though).
 @fgabriel mentioned this below: if the network is parametrized in a certain way, then the GP evolves according to a linear equation (if trained with square loss). In this linear equation, a different kernel shows up, known as the Neural Tangent Kernel. An intuitive way to think about this is to Taylor expand the parameters-to-function map around the initial set of parameters: f = f_0 + J d\theta, where J is the Jacobian of the neural network function against the parameters. Following this logic, the change in parameters affects the neural network function roughly linearly, as long as the parameters don't venture too far away from their original values. The Neural Tangent Kernel is then given by JJ^T.In addition to the paper mentioned by @fgabriel, this paper [1] explains it in more detail as well, and the equations you are looking for are 14, 15, and 16.
 Well, if there's a 1-1 map between GP and NNs, shouldn't we be able to determine the effect of a gradient descent step on a GP by combining the two maps?I have only barely glanced at the paper mind-you, so I couldn't say the details but still.
 For trained wide neural networks, you can have a look at "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (https://arxiv.org/abs/1806.07572) where we explain the training of very wide ANN. This sparked a numerous amount of research and in the few last results about training wide ANN you can have a look at: https://arxiv.org/abs/1909.08156 and https://arxiv.org/abs/1904.11955
 Hi, author here. Thanks for your interest!The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works [1][2]. But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.> The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea)Check out the github repo [3] attached to this paper and look at plot (E) in the README. It shows the empirical rate of convergence for different architectures, but in general the Frobenius norm of the deviation from limit decays like 1/sqrt(width).[1] Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z[2] Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-
 A good paper mentioned in the sources is "Neural Tangent Kernel: Convergence and Generalization in Neural Networks".> At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods.I'm not familiar with ignoring higher order terms, besides approximations and bounds like Chebyshev's inequality and Chernoffs.
 If you take the sum of a large, but finite number of random variates from a probability distribution you can show that the resulting distribution is approximately a Gaussian perturbed by Hermite polynomials. The size of these perturbations decays inversely with the order of the polynomial divided by two minus one. This is the so-called Edgeworth expansion. I found this chapter to have a good explanation of the Edgeworth expansion if you're interested in more detail: http://web.math.ku.dk/~erhansen/bootstrap_05/doku/noter/Edge...I had a little workshop paper earlier this year showing that you can apply the Edgeworth expansion to wide, but finite neural networks: https://arxiv.org/abs/1908.10030
 This looks interesting (have only glanced at it so far). However, the abstract and introduction are a bit misleading regarding what I did in my 1994 thesis (http://www.cs.utoronto.ca/~radford/thesis.abstract.html). My results about convergence of neural networks to Gaussian processes were not confined to shallow networks. Also, I discussed how to get network priors to converge to non-Gaussian stable processes, by using priors for weights that have infinite variance. Such non-Gaussian priors may well be more interesting than Gaussian processes, for problems where a "really sophisticated smoother" is not going to be adequate.
 Hi Radford, I'm very happy that this paper got your attention, but I'm regretful that I did not represent your research accurately!Indeed, in your thesis, section 2.3 is on "Priors for networks with more than one hidden layer." I will fix this in the next version of the paper, and as a prerequisite, I'd like to make sure I understand your contributions correctly.Is the following summary accurate?In section 2.3, you explored the GP limit for more than 1 hidden layer numerically, as well as some thoughts on the decay behavior of the GP kernel associated to an MLP with step function nonlinearity. You also considered mixing Gaussian and non-Gaussian priors in different hidden layers and finally mused about the infinite depth limit of the infinitely-wide MLP.However, I could not find a rigorous treatment of the multi-layer GP limit (in the vein of Lee et al. and Matthews et al. (2019)). Does it exist elsewhere in the paper?
 Looking now at my thesis, I agree that I don't explicitly argue theoretically for why multilayer networks will (under suitable conditions) converge to Gaussian processes. However, it follows (at some level of rigour) pretty directly from the fact (which I do note) that if a single-hidden layer has multiple outputs, the functions computed by these outputs will be independent (in the prior) as the number of hidden units goes to infinity. So if you add another hidden layer, the functions computed by the units in this layer will be independent (they're like outputs of the previous layer), and the argument for why the outputs from this layer form a GP goes through as before. I'm not sure why I didn't explicitly note this. It's implicitly assumed in my discussion of how the covariance function for networks with step function hidden units changes as you add more layers.
 Right, so your argument would work if you allow the layer widths to tend to infinity sequentially (so this corresponds to finite networks where each previous layer is much bigger than the next layer). This is the argument presented by Lee et al. But note it's nontrivial to argue that this limit holds when the widths of all layers tend to infinity at the same time (arguably the more natural limit), which is one of the main contributions of Matthews et al. In my paper here, I also consider this limit where the widths tend to infinity at the same time.In any case, I'll update the paper to reflect our discussion here. Thanks, Radford!
 This sounds important and interesting but isn't wide the key word here?They talk about shallow NNs and deep fully connected NNs but that would seem to leave out a lot.I mean, the article puts forward a distinct language/model to expression neural nets in, which is cool but are they talking about all or most of the NNs you see today? If so, huge but still.Fo-get-a-bout-it, see SiempreViernes' comment: "I couldn't find any mention about a trained NN, this is strictly about the initial state. "(emphasis added)
 Hi, the author here. Thanks for your interest! Let me try answering some of your questions.> This sounds important and interesting but isn't wide the key word here?Yes, width is very important for this result. Given the size of modern deep neural networks, I (and most people in the deep learning theory community, by now) believe the large width regime is the appropriate regime to study neural networks.> I mean, the article puts forward a distinct language/model to expression neural nets in, which is cool but are they talking about all or most of the NNs you see today?Try throw me an architecture and watch if I can't throw you back a GP :)> Fo-get-a-bout-it, see SiempreViernes' comment: "I couldn't find any mention about a trained NN, this is strictly about the initial state. "(emphasis added)Yes. I will have things to say about training, but that requires building up some theory. This paper is the first step in laying it out. Stay tuned! :)
 > Try throw me an architecture and watch if I can't throw you back a GP :)On the pragmatic side, would that GP train faster than the NN? In my little experimentation with GPs, I found them awfully slow. However, maybe what I tried (it was black box for me) used some brute force approach, and there are other more fine-tuned algorithms. Since you are an expert in the area, what's your take?
 I think in general, "training" a GP, i.e. doing GP inference (or kernel regression) is not done for speed reasons, but rather because they are sample efficient. More concretely, the practical folk wisdom regarding GPs is that when there are not much data, then GP inference with a well-chosen kernel can give you much more bang for the buck than a neural network. However, when there are a lot of data (especially in the perceptual domains like vision and language), neural networks typically train faster and generalize better.I wouldn't say I'm an expert at using GPs, so actual GP practitioners feel free to correct me if I'm wrong :)
 Looking forward to your further results
 Good results often seem "obvious" in ex post. So take this as a compliment.Intuitively (I have never read a paper in this field), since you are talking about wide networks, I also expected that a CLT would be used. For "dense" layers it is pretty obvious that one should be able to characterize each layer aggregation based on a CLT, and so forth. Some sort of mild independence assumption, mixing or martingales, on the sampling should be sufficient. I think therefore the goal of a paper for any architecture would be to figure out a way to generalize this for different layers.However, one thing I notice is that you seem to assume that weights (or whatever is initialized) are initialized with a Gaussian distribution?That seems a bit restricted. The appeal of this approach with wide networks, I think, is that any independent initialization of weights would lead to transformations of GP.Perhaps I am misunderstanding also the implications. Could you generally trace a dependence onto the distribution of the last layer weights, even if they are not normal? Or do you need the GP for your conditioning?On the one hand, Gaussian initialization is I think not really encompassing "all architectures" as practically used, but more specifically, it seems that the end goal of this research program would be to generalize beyond this (much like in regression, one uses CLTs exactly to get away from parametric assumptions). Or is that where you plan to go?
 Reading a bit more, it's really interesting that your result relies on Lemma G.4, which is a CLT based on independence or mixing (as you wish), whereas all theorems assume that anything put in is Gaussian.It "smells" like that is, or should not be necessary.The elegance of this approach, and GP in general, is that you use scale and independence and get to a specific distribution. Therefore, assuming that inputs are Gaussian seems restrictive. In some appropriate sense, it should not be required.But again, I am probably misreading something.What I would be looking for as a referee is: "Based on ANY random initialization (mild independence condition), it holds that wide networks become Gaussian"
 Hi @zwaps, thanks so much for your interest! My answer to @throwlaplace seems relevant to your question, so let me copy it here and comment on your specific questions afterward.> The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works [1][2]. But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.> The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.> 99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.So, you are right that Prop G.4 can be easily generalized to non-Gaussian cases. However, this prop is only 1% of the entire proof as explained above; the 99% is on inductively handling weight matrices that are possibly reused over and over again (like in an RNN), and a priori it's not clear we can say nice things about their behavior (and this is also where previous arguments relying CLT break down).As mentioned above, the meat of the argument is the Gaussian conditioning technique, which roughly says the following: A Gaussian random matrix A, conditioned on a set of equations of the form y = A x or y = A^T x with deterministic x's and y's, is distributed as E + Pi_1 A' Pi_2, where E is some deterministic matrix, Pi_1 and Pi_2 are some orthogonal projection matrices, and A' is an iid copy of A. See Lemma G.7. This lemma allows us to inductively reason about a weight-tied neural network by conditioning on all the computation done before a particular step in the induction. However, this technique is not available if the weights are not sampled from a Gaussian.Now, you are right this result should apply to any reasonable non-Gaussian initialization as well, as seen from experiments. There are standard techniques for swapping out Gaussian variables with other "reasonable" variables (see the section on "Invariance Principle" in [3]), so it becomes roughly an exercise in probability theory. I think most folks who have seen such "universality" results would guess that Gaussians can be swapped with uniform variables, etc. From a machine learning perspective, perhaps this is not as interesting as showing the universality in architecture, especially as new architectures are invented like a flood, and old theoretical results become irrelevant quite quickly. More importantly, the tensor program framework gives an automatic way of converting an architecture to a GP, and I believe this is a tool many folks in the ML community will find useful.[1] Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z[2] Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-[3] O'Donnell, Ryan. Analysis of boolean functions.
 OK, let's say NNs are GPs. What can we do with this information?
 The paper says more than just that they are GPs, I believe it shows that the NTK (Neural Tangent Kernel https://arxiv.org/abs/1806.07572) of a large class of ANN converges at initialization. This NTK allows one to gain insight on the training of NN : speed of convergence, artifacts which can appear (checkerboard patterns can be explained with NTK), generalisation of the NN...Would be nice to get the same result during training for all these architectures and I believe it will be the next paper of G. Yang and I eager to read it.
 You can use it to estimate model uncertainty, Yarin Gal has some nice writeups on this: https://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801a... (in this case using dropout networks as GP approximations).
 How would we use a property of networks with random weights to estimate uncertainty of trained models in which which the weights are (as much as we can) trained to be not random?
 I'd guess it lets us map information processes to physical processes. Information processes follow a graph which follow a power law. Physical processes of course follow GP.Depending on how complete the map is it may let you know us come up with 'physical' laws of information. I am rooting for something which I call Boltzmann convergence.
 This means there is a way to convert any or perhaps just Gaussian markovian model to an ANN and vice versa.This is interesting because markovian processes are much easier to intuit about.
 How?
 Probably just use a normal distribution instead of a NN.
 Gaussian Process != simple normal distribution
 So, if I understand this correctly...Gaussian Processes are a way of trying every possible function to fit a set of data points, being constrained a bit more with each new data point.All neural networks in the list, given sufficient size, are essentially Gaussian in their behavior, and thus share the same features and limitations.Right?
 I don't quite follow what is a Gaussian process. What do you use, node weights, node inputs/outputs, or...?
 This is beautiful.

Search: