Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Transformer alternatives that could have emergent properties when scaled
6 points by s_r_n 6 days ago | hide | past | favorite | 3 comments
I am trying to identify model architecture candidates that could, like transformers, have "emergent" properties when they are scaled (see https://arxiv.org/abs/2206.07682).

Some contenders I already know about are:

* Monarch Mixer (https://arxiv.org/pdf/2111.00396.pdf)

* Hyena (https://hazyresearch.stanford.edu/blog/2023-03-07-hyena)

Thanks for your help.

Slightly off topic but researching architectures is really a moot task IMO.

Lets start with a fully connected n layer deep neural network. Very inefficient, but the simplest, most generic architecture.

Now look at conv nets. Conv layers are essentially fully connected deep neural networks with a whole bunch of zeros in the matrix.

Now look at transformers. Transformers do really well in a wide array of things, but not because of their particular architecture, because instead of weightsvalues, they do weights(values)values, i.e the weights are a function of the values. However if you look at the math, you can represent the entire transformer with just a fully connected multi layer net. Q, K, V values are all linear transformations, Q*K matrix multiply can be represented as one layer, softmax is another layer, e.t.c.

In the end, since any architecture has to be differentiable for backprop to work, you can always find a fully connected n layer representation for it.

So the question to really ask is what is a generic way to represent architectures, and can a network self adjust its own architecture?

For example, you can have a XOR gate representation with the simplest neural network of 1 linear layer size 2 with sigmoid activation. Or you could represent it with a much more complex neural net. Is there a way to gradient descent not just the weights, but architecture itself to move from the complex network to the 2 neuron one?

To do this you have generalize the architecture in such a way that the right set of parameters ends up as a transformer, while another set ends up as a conv net, e.t.c. and so on. Then, once you have this, and its differentiable, you will start to see a whole shitload of emergent behaviour.

Most probably, in time we will find that most models capable of "free speak" and deep reasoning, have properties that in biological entities we strongly associate with "conscious thinking".

There's a chance that even relatively weak systems with strong dinamic adaptability (capable of hundreds to thousands of decisions, that would be kind of a pedestrian reasoning capability, but reasoning capability anyway), are capable of some emergence related to "human consciusness". Advanced HFT systems could have been, then, capable of "conscius thinking" back in 2010. And we could not have noticed (or maybe some people noticed it and continued that line of research in strong stealth mode, advancing the capability of emergence in the models we saw just popping up out of the blue..), but I'm loosing the point..

We, as humans, tend to correlate "consciusness" with a pseudo-continuous state of mind or mood, which most probably isn't real. More probably most humans have an exceedingly capable "auto-pilot mode" state, in which they can operate most of the time when they're dealing with repetitive tasks, and consciusness is just an esporadic event, emerging - properly speaking and correlating it with similar states in AI - just when it's required by some contexts.

Thinking models as systems capable of running the "human software", which is our - as biological entities - programming: the human language and its asociated cognitive maps, obviously could replicate the human traits intrinsecally embedded in the software.

The pseudo-continous state of "consciusness" is one of them, and some flaws "detected" as "non-self-conscius thinking" and/or hallucinations are probably just what you can see daily in every human in the planet: you just can random ask anyone out of the blue "what are you doing" and be marvellous that most people a second or three to actually engage "consciousness" and explain to you why they're there and what they're doing.

That latency to switch from autonomous behavior to conscius thinking is what would be the next frontier for AI: how many complex task - even talking - could somehow be relayed to "non conscius" but intelligent cognitive processing in models.

Are those really “transformer alternatives?” or just different ways to implement transformers by replacing parts of transformers with alternate parts?

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact