Does this essentially mean that any multi-layer RNN can be reasonably approximated by a 1-layer network (something like a perceptron) for the "playback" purposes, that is, for recognition / transformation, not learning?
This may have colossal practical implications, as long as the approximation stays good enough.
I would be skeptical about thinking of networks this way without empirically verifying it yourself.
The only useful trick I’ve found like that, is that a stack of linear layers with no activation function is equivalent to a single larger layer. Sometimes it enables some clever optimizations on TPUs, since you want one of the dimensions to be a multiple of 128. (I haven’t actually used that trick, but it’s in my back pocket.)
But thinking of an entire model as a single layer seems strange. A single layer has to have some kind of meaning. To me, it means “a linear mapping followed by a nonlinear activation function.” So is the claim that there exists a sufficiently complicated activation function that approximates any given model? Because that sounds an awful lot like the activation function itself might be “the model”. Except that makes no sense, because activation functions don’t use model weights; the linear multiply before the activation does that.
So it quickly takes me in circles. I don’t have a good intuition for models yet though.
This may have colossal practical implications, as long as the approximation stays good enough.