I am trying to identify model architecture candidates that could, like transformers, have "emergent" properties when they are scaled (see https://arxiv.org/abs/2206.07682).
Some contenders I already know about are:
* Monarch Mixer (https://arxiv.org/pdf/2111.00396.pdf)
* Hyena (https://hazyresearch.stanford.edu/blog/2023-03-07-hyena)
Thanks for your help.
Lets start with a fully connected n layer deep neural network. Very inefficient, but the simplest, most generic architecture.
Now look at conv nets. Conv layers are essentially fully connected deep neural networks with a whole bunch of zeros in the matrix.
Now look at transformers. Transformers do really well in a wide array of things, but not because of their particular architecture, because instead of weightsvalues, they do weights(values)values, i.e the weights are a function of the values. However if you look at the math, you can represent the entire transformer with just a fully connected multi layer net. Q, K, V values are all linear transformations, Q*K matrix multiply can be represented as one layer, softmax is another layer, e.t.c.
In the end, since any architecture has to be differentiable for backprop to work, you can always find a fully connected n layer representation for it.
So the question to really ask is what is a generic way to represent architectures, and can a network self adjust its own architecture?
For example, you can have a XOR gate representation with the simplest neural network of 1 linear layer size 2 with sigmoid activation. Or you could represent it with a much more complex neural net. Is there a way to gradient descent not just the weights, but architecture itself to move from the complex network to the 2 neuron one?
To do this you have generalize the architecture in such a way that the right set of parameters ends up as a transformer, while another set ends up as a conv net, e.t.c. and so on. Then, once you have this, and its differentiable, you will start to see a whole shitload of emergent behaviour.