edit: re higher order terms i'm talking about the proof of the classic clt https://en.wikipedia.org/wiki/Central_limit_theorem#Proof_of...
theres a taylor series expansion of the characteristic of the centered rv that's truncated to second order.
I couldn't find any mention about a trained NN, this is strictly about the initial state. Yang does reference a few papers that supposedly leverage the GP correspondence to gain some insight about how to better initialize a NN, for example this: https://arxiv.org/abs/1803.01719
In addition to the paper mentioned by @fgabriel, this paper
 explains it in more detail as well, and the equations you are looking for are 14, 15, and 16.
I have only barely glanced at the paper mind-you, so I couldn't say the details but still.
The CLT would be a good guess at approaching this problem, and indeed it is the approach of prior works . But in this paper, the key answer is actually law of large numbers, though CLT would feature more prominently if we allow weights to be sampled from a non-Gaussian distribution.
The TLDR proof goes like this: via some recursive application of law of large numbers, we show that the kernel (i.e. Gram matrix) of the final layer embeddings of a set of inputs will converge to a deterministic kernel. Then because the last layer weights are Gaussian, the convergence of the kernel implies convergence of the output distribution to a Gaussian.
99% of the proof is on how to recursively apply the law of large numbers. This uses a technique called Gaussian conditioning, which, as its name suggests, is only applicable because the distribution of weights is Gaussian.
> The question is always when do you start appreciably approaching the asymptote (answer: we often have no idea)
Check out the github repo  attached to this paper and look at plot (E) in the README. It shows the empirical rate of convergence for different architectures, but in general the Frobenius norm of the deviation from limit decays like 1/sqrt(width).
 Deep Neural Networks as Gaussian Processes. https://openreview.net/forum?id=B1EA-M-0Z
 Gaussian Process Behaviour in Wide Deep Neural Networks. https://openreview.net/forum?id=H1-nGgWC-
> At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods.
I'm not familiar with ignoring higher order terms, besides approximations and bounds like Chebyshev's inequality and Chernoffs.
I had a little workshop paper earlier this year showing that you can apply the Edgeworth expansion to wide, but finite neural networks: https://arxiv.org/abs/1908.10030