> removing it was probably more of a hack to make things parallelizable
But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.
It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation.
This doesn't match with the common knowledge on the topic, which is that model size is more important than the architecture. And training size is even more important, which is why single digit billion parameters are strongers than hundreds-of-billion ones from several years early when “Chinchilla optimal training” was in fashion.
SSM are literally the proof that all that really matters is training scalability.
The Universal approximation theorem doesn't care about the architecture after all.
If you parse my words a bit more carefully, you'll realize to test my claim there's a simple thought experiment (or real experiment) you can do which is this:
Take our "current large size" (my words from last post) LLMs, as they are currently today, and then simply remove the Self-Attention wiring, and see if that destroys the emergent intelligence aspect or not. I claim it would. But at the same time this doesn't mean you can just stick Self-Attention onto a small model and expect intelligence to once again emerge.
You are wildly overestimating the “emergent capabilities” of current models, and underestimate alternative architectures's (namely SSM) performance at the same size.
Also, performance of the modern “small” models show that your last sentence isn't really true either.
> wildly overestimating the “emergent capabilities”
How could I be "overestimating" the emergent capabilities when I never even quantified those capabilities other than to call them "emergent" and impressive?
> “small” models show that your last sentence isn't true either.
I never said that even a perfect architecture would make small models "intelligent". However to the extent that even smaller LLMs can exhibit surprising capabilities, that's more evidence IN FAVOR OF everything I've said, not evidence against.
EDIT: But in that last sentence (of prior reply) by "small" what I meant was genuinely small, meaning non-LLM, and you seem to have interpreted it as "a smaller LLM"
Even 1B parameters model show “impressive capabilities” for anyone not accustomed to the current state of the art. And there are plenty of relatively small models that perform as well as ChatGPT 3.5 when it was first released and felt like magic.
“All” that was needed to get there was “just” feeding it more data. The fact that we were actually able to train billion parameters models on multiple trillion tokens is the key property of the transformers, there's no magic beyond that (it's already cool enough though): it's not so much that they are more intelligent, it's simply that with them we can brute-force in a scalable fashion.
Yes even the original Transformers model had only millions of parameters and nonetheless showed "impressive capabilities" because it also had Self-Attention.
If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.
There aren't many multi-billion-parameters non-transformer models because of path dependence, but that doesn't mean that only transformers can achieve this kind of results.
My statements (which you disagreed with, without exception) haven't been about Transformers v.s. non-Transformers. Everything above has been about the importance of the Self-Attention part of it. We could remove Self-Attention from Transformers and still have a functional (but dumb) NN, and that was my point.
Your position was that the Self-Attention is a less important part (because UAT, yadda yadda), and my position was that it's the key ingredient. Every statement above that I made, that you called wrong, was correct. lol.
You are moving the goalpost. The discussion has always been about transformers vs non transformers.
You claimed that self attention was needed to achieve the level of intelligence that we've seen with GPT 3.5:
> without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. (Verbatim quote from you https://news.ycombinator.com/item?id=41986010)
This is the claim I've been disputing, by responding that the key feature of the intelligence of tranformer models come from their scalability. And now that we have alternative that scale equally well (SSM and RWKV) unsurprisingly we see them achieve the same level of reasoning abilities.
> Every statement above that I made, that you called wrong, was correct. lol.
In the quote you're calling wrong (41986010), you're interpreting "scaling up" as "scaling up, including changing architecture". Scaling up transformers just means scaling up transformers, and keeping everything else the same. In other words you're interpreting "parameter size" as "parameter size, independent of architecture", and I meant parameter size of a Transformer (in the context of with v.s. without Self-Attention).
There's no staw-man, and you are now at the point of trying to re-invent the definition of words in order to somehow “win the argument ” without even respecting your own previous position. This behavior is legit pathetic, it's not an insult it's a fact. Respect yourself.
I stand by every word: 1) Self-Attention is more important than scale, and 2) to test that claim, simply remove SA from a transformer and see if it destroys the "intelligence" or not. There's nothing confusing about that, but thanks for your concerns and your polite words.
No that wasn't your argument and this new one is off course a much waker one that you fell back onto to be “technically right”.
That attention heads are mandatory for transformers is a tautology (without it a transformer is just an MLP…) so of course this statement is going to be correct, by definition.
But when you move the goal post to land on a tautology then you've surrendered your abilities to argue anything and you are just ridiculing yourself. Take this question of your for instance:
> If you know of any models that have had success (even at the GPT-2 level) without Self-Attention, I'd be interested to know what they are, because I don't know of any.
Which is a legit, non-ridiculous, one.
If you replace it with your later much weaker argument:
> > If you know of any MLP that have had success (even at the GPT-2 level), I'd be interested to know what they are, because I don't know of any.
Then it becomes a dumb question given that MLP have no way of encoding context and can't process sequences of words in the first place.
So when you argue that it was your argument all along, it's particularly embarrassing because you're just arguing that your previous arguments were equally dumb even when they weren't.
That's why I said you're disrespecting your earlier argumentation by retreating to your later tautology.
Your ad hominem ratcheted up again. lol. It's ok. No prob. Learn what a tautology is tho bro. It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed (and everything else kept constant), as an experiment, to refute someone's bizarre claim that the SA part isn't doing the real magic in them. Insofar as the actual other networks you've mentioned they fail to beat Transformers, and will continue to fail, until something analogous to SA is built into them, because language comprehension simply cannot be done without sensitivity to word context, especially over "long ranges" in the input sequences.
Tautology: a statement that is true by virtue of its logical form alone (Merriam Webster). Fits perfectly the “a neural network that can't process text is less good at processing text that one that can”
> It's perfectly legit to discuss how a Transformer would perform if only the Self-Attention part was removed
It only shows that you don't understand the topic at all (but hey, you talked about closed-form solutions and quantum computing elsewhere in this discussion with others so why I am even surprised…)
> Insofar as the actual other networks you've mentioned they fail to beat Transformers
They don't “fail to beat transformers”, they beat transformers that aren't the state of the art and are less good that the ones that are. And that's not really a surprise given that they are more recent and have much less manpower working on them. I don't expect them to replace transformers until they make some hypothetic breakthrough that'd makes them significantly better than transformers. That's what path dependence is. But they are still a good illustration to the point that you don't need to have attention heads to exhibit the capabilities of LLMs. (Remember you set the bar at GPT-2 level, and they are far beyond that)
> because language comprehension simply cannot be done without sensitivity to word context
And these models actually have a way to represent context so this criticism completely miss the mark. That's really hilarious that you make this kind of claim in an HN thread about SSM. How come you have no idea at all about what a state-space model is and then feels confident enough to come and argue in the comment section…
No, that's what you're missing from the beginning: the breakthrough of transformers was scalability. Now we have other models that are equally scalable and as such roughly equally performant (and that's not a surprise).
But the ship has sailed and nobody is gonna switch to something else than transformers if it's not significantly better, and as such the other approaches are going to stay behind because every marginal improvement come to transformers first (because that's what practically everyone is working on) and alternative models are playing catch-up.
This is a remarkable example of path dependence.
Interpreting this as “transformers are fundamentally superior” is the mistake I'm trying to help you correct.
The breakthrough of transformers was scalability. The next breakthrough of equivalent importance will be entirely different or it won't be.
But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.