It's a high-dimensional correlation machine. In other words, it's an attempt to learn "how to recognize patterns" by learning how to represent each one as orthogonally as possible to each other one. This happens at each layer, and how many layers you need depends on how "mixed up" the transformed space is with respect to the appropriate labels following each linear transformation. Once they are suitably linearly separable following the feedforward pass through the network, you need one more layer to identify how the pattern maps to the output space.
Another way to think of it is that each layer learns a maximally efficient compression scheme for translating the data at the input layer to that at the output layer. Each layer learns a high-dimensional representation of the output that uses minimum bits for maximum information reconstruction capacity. There was a great talk given recently by Naftali Tishby where he explains this in great detail.[1]
Having the math is great to know how it works on a granular level. I've found that also explaining it in such holistic terms serves a great purpose by fitting "linear algebra + calculus" into an understanding of NNs that is greater than the sum of their parts.
That's a cool video, I'm subscribing. It's surprising that the embedded sphere has an unbounded radius. He didn't mention in the video that this problem is due to, and is a sort of dual or inverse to the fact that the volume of an N-dimensional sphere goes to zero as N goes to infinity. That hurt my head a little the first time I learned about it!
It's him that made me understand linear algebra. College? No fucking clue what a matrix was. Now? a clear, intuitive understanding. I'm still struggling with concepts like trace, but he provided a base from which to I'm able to climb myself.
Another way to think of it is that each layer learns a maximally efficient compression scheme for translating the data at the input layer to that at the output layer. Each layer learns a high-dimensional representation of the output that uses minimum bits for maximum information reconstruction capacity. There was a great talk given recently by Naftali Tishby where he explains this in great detail.[1]
Having the math is great to know how it works on a granular level. I've found that also explaining it in such holistic terms serves a great purpose by fitting "linear algebra + calculus" into an understanding of NNs that is greater than the sum of their parts.
[1] https://www.youtube.com/watch?v=bLqJHjXihK8