Very interesting read and a rather "obvious" one. I can't believe I didn't see this before. Obviously... A perceptron layer is a bunch of dot products followed by comparison. Every graphics programmer knows this is a check of which side of a plane you're on.
Of course, the relu unit is also a
passing on information when the result is on one side of the plane, making this a spline.
As others have said... Can we learn the separating planes without the backward gradient propagation? I don't know but seeing it in this new way may help.
I would love to see a cross section of these two ideas....
I have been surprised that in the past few weeks, I have seen several posts on HN where, while separate, unrelated posts here - there have been related characteristics and if you look at them for a sec, you can see how having AIs GPT both studies/papers - immediate connections worth looking at further are revealed.
If even for the sake of just a more informed tapestry of knowledge in a particular area...
> if we scale the activations in a particular layer in a non-linear network, some neurons in later layers may ‘activate’ or ‘deactivate’.
Normalization removes this problem. Magnitude information can still be encoded separately in a log form so differentiation can still happen when scale matters, but scaling doesn't have much impact by default (small initial weights following magnitude element).
There are many interesting efforts — going back quite a few years —- to this goal, many of which in the PAC setting (which automatically means MLP is out, for theoretical guarantees). E.g [0]and its related references come to mind as an interesting place to look into it!
Of course, the relu unit is also a passing on information when the result is on one side of the plane, making this a spline.
As others have said... Can we learn the separating planes without the backward gradient propagation? I don't know but seeing it in this new way may help.