Hacker News new | past | comments | ask | show | jobs | submit login

You're right about the central limit theorem appearing, but series expansions didn't appear; instead it is the fact that the weights are initialized to random values that seems to carry the day.

I couldn't find any mention about a trained NN, this is strictly about the initial state. Yang does reference a few papers that supposedly leverage the GP correspondence to gain some insight about how to better initialize a NN, for example this: https://arxiv.org/abs/1803.01719




Yes. I will have things to say about training, but that requires building up some theoretical foundations. This paper is the first step in laying it out. Stay tuned! :)


So do you already have any intuition of what training does to the initial GP? Obviously the training adjust the various weights in complicated ways, which to me feels like it should correspond to some sort of marginalisation on the GP, but I'm not really aware if that's a thing people do (undoubtedly someone has tried it though).


@fgabriel mentioned this below: if the network is parametrized in a certain way, then the GP evolves according to a linear equation (if trained with square loss). In this linear equation, a different kernel shows up, known as the Neural Tangent Kernel. An intuitive way to think about this is to Taylor expand the parameters-to-function map around the initial set of parameters: f = f_0 + J d\theta, where J is the Jacobian of the neural network function against the parameters. Following this logic, the change in parameters affects the neural network function roughly linearly, as long as the parameters don't venture too far away from their original values. The Neural Tangent Kernel is then given by JJ^T.

In addition to the paper mentioned by @fgabriel, this paper [1] explains it in more detail as well, and the equations you are looking for are 14, 15, and 16.

[1] https://arxiv.org/abs/1902.06720


Well, if there's a 1-1 map between GP and NNs, shouldn't we be able to determine the effect of a gradient descent step on a GP by combining the two maps?

I have only barely glanced at the paper mind-you, so I couldn't say the details but still.


For trained wide neural networks, you can have a look at "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (https://arxiv.org/abs/1806.07572) where we explain the training of very wide ANN. This sparked a numerous amount of research and in the few last results about training wide ANN you can have a look at: https://arxiv.org/abs/1909.08156 and https://arxiv.org/abs/1904.11955




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: