Does this essentially mean that any multi-layer RNN can be reasonably approximat...

talolard · on Dec 7, 2020

Hmmm, I think that's not precise and my use of "architecture" was misleading.

If we're thinking in terms of "universal aproximators", an RNN is a way to make a sequence of approximate functions for a sequence of inputs.

But it's still a sequence of functions, not a single function.

For a 1 layer network to have the same ability as an RNN (take an unbounded amount of context) it would need to have infinite width which is a no-go.

sillysaurusx · on Dec 7, 2020

I would be skeptical about thinking of networks this way without empirically verifying it yourself.

The only useful trick I’ve found like that, is that a stack of linear layers with no activation function is equivalent to a single larger layer. Sometimes it enables some clever optimizations on TPUs, since you want one of the dimensions to be a multiple of 128. (I haven’t actually used that trick, but it’s in my back pocket.)

But thinking of an entire model as a single layer seems strange. A single layer has to have some kind of meaning. To me, it means “a linear mapping followed by a nonlinear activation function.” So is the claim that there exists a sufficiently complicated activation function that approximates any given model? Because that sounds an awful lot like the activation function itself might be “the model”. Except that makes no sense, because activation functions don’t use model weights; the linear multiply before the activation does that.

So it quickly takes me in circles. I don’t have a good intuition for models yet though.

dkural · on Dec 7, 2020

Wouldn't this one-layer network be a lot less "compressive" than the multi-layer net, and in some sense "duplicate" subnetworks in earlier layers?