Hacker News new | past | comments | ask | show | jobs | submit login

Hi, author here. Thanks for your interest!

The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works [1][2]. But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.

The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.

99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.

> The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea)

Check out the github repo [3] attached to this paper and look at plot (E) in the README. It shows the empirical rate of convergence for different architectures, but in general the Frobenius norm of the deviation from limit decays like 1/sqrt(width).

[1] Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z

[2] Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-

[3] https://github.com/thegregyang/GP4A

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact