Hacker News new | past | comments | ask | show | jobs | submit login

Without reading the paper I bet it comes down to the central limit theorem, which itself comes down to the fact that 3rd order and higher terms don't matter asymptotically. The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea).

edit: re higher order terms i'm talking about the proof of the classic clt https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...

theres a taylor series expansion of the characteristic of the centered rv that's truncated to second order.

You're right about the central limit theorem appearing, but series expansions didn't appear; instead it is the fact that the weights are initialized to random values that seems to carry the day.

I couldn't find any mention about a trained NN, this is strictly about the initial state. Yang does reference a few papers that supposedly leverage the GP correspondence to gain some insight about how to better initialize a NN, for example this: https://arxiv.org/abs/1803.01719

Yes. I will have things to say about training, but that requires building up some theoretical foundations. This paper is the first step in laying it out. Stay tuned! :)

So do you already have any intuition of what training does to the initial GP? Obviously the training adjust the various weights in complicated ways, which to me feels like it should correspond to some sort of marginalisation on the GP, but I'm not really aware if that's a thing people do (undoubtedly someone has tried it though).

@fgabriel mentioned this below: if the network is parametrized in a certain way, then the GP evolves according to a linear equation (if trained with square loss). In this linear equation, a different kernel shows up, known as the Neural Tangent Kernel. An intuitive way to think about this is to Taylor expand the parameters-to-function map around the initial set of parameters: f = f_0 + J d\theta, where J is the Jacobian of the neural network function against the parameters. Following this logic, the change in parameters affects the neural network function roughly linearly, as long as the parameters don't venture too far away from their original values. The Neural Tangent Kernel is then given by JJ^T.

In addition to the paper mentioned by @fgabriel, this paper [1] explains it in more detail as well, and the equations you are looking for are 14, 15, and 16.

[1] https://arxiv.org/abs/1902.06720

Well, if there's a 1-1 map between GP and NNs, shouldn't we be able to determine the effect of a gradient descent step on a GP by combining the two maps?

I have only barely glanced at the paper mind-you, so I couldn't say the details but still.

For trained wide neural networks, you can have a look at "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (https://arxiv.org/abs/1806.07572) where we explain the training of very wide ANN. This sparked a numerous amount of research and in the few last results about training wide ANN you can have a look at: https://arxiv.org/abs/1909.08156 and https://arxiv.org/abs/1904.11955

Hi, author here. Thanks for your interest!

The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works [1][2]. But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.

The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.

99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.

> The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea)

Check out the github repo [3] attached to this paper and look at plot (E) in the README. It shows the empirical rate of convergence for different architectures, but in general the Frobenius norm of the deviation from limit decays like 1/sqrt(width).

[1] Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z

[2] Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-

[3] https://github.com/thegregyang/GP4A

A good paper mentioned in the sources is "Neural Tangent Kernel: Convergence and Generalization in Neural Networks".

> At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods.

I'm not familiar with ignoring higher order terms, besides approximations and bounds like Chebyshev's inequality and Chernoffs.

If you take the sum of a large, but finite number of random variates from a probability distribution you can show that the resulting distribution is approximately a Gaussian perturbed by Hermite polynomials. The size of these perturbations decays inversely with the order of the polynomial divided by two minus one. This is the so-called Edgeworth expansion. I found this chapter to have a good explanation of the Edgeworth expansion if you're interested in more detail: http://web.math.ku.dk/~erhansen/bootstrap_05/doku/noter/Edge...

I had a little workshop paper earlier this year showing that you can apply the Edgeworth expansion to wide, but finite neural networks: https://arxiv.org/abs/1908.10030

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact