3Blue1Brown is really a godsend. I had been trying and failed to get an intuition for the self-attention mechanisms, and also why you want to stack you transformer layers. I went through a lot of material (Key-Query-Value interpretations, SVD of these, etc.) but his comment in chapter 5 video [1] made it click for me: "But you should think of the primary goal of this network that it (the word embedding) flows through as being to enable each one of those vectors to soak up a meaning that's much more rich and specific than what mere individual words could represent".
Here is my intuition:
Transformers are merely a Convnet for token sequences. In the sense that information is first local and each data means something wrt its immediate neighbors. You extract data locally, and then you dezoom to and redo exactly the same at a higher level. Convnet have been shown to reproduce edge detecting kernels first (vertical edge), then assemble them into basic shapes (vertical+oblique: corner), then bigger shapes (something round), then link shapes together (round+round separated by a horizontal shape with rectangles on top), and then it's a car! [2]
Transformers do the same. They assemble meaning among tokens ("ju" + "mps" : a conjugated verb), then assemble these at a higher level ("cat jumps" -> noun+verb), then it's a sentence "the cat jumps over the wall", then a paragraph, then a chapter, then a book. All differentiable so any deviation from something coherent can tweak its knobs higher up the chain at any level of detail.
Convnets just relate to 2D data topology, Transformers to long 1D sequences. This begs the question of 3D data btw: will Nerfs/Gaussian-Splats have their hierarchical-dezoom moment? (have they already done so?)
So, why haven't LLMs appeared before? RNN and LSTM modelled sequences accurately earlier, but these are not "BigData". They are small, by design, and must forget the long tail of residual meaning. Transformers are a hardware-friendly way of keeping all the joint-interactions across tokens in large windows. Because sometimes text needs it: you mention a cat at the beginning, and 3 chapters later you can refer to it without saying "cat". Same for code: "import java.sql.Date" will have a pretty important meaning 5000 tokens down the line.
LLMs are a BigData instance all over again. Dumb algorithms that scale better do better than complex ones, just out of the sheer volume of ingested examples. That's why -to my knowledge- you still have logistic regression at the core of placing ads. You don't model 1 person clicking one thing, you just let 100 teach you how they do it. You just crush the problem with data. It's also been called "The Bitter Lesson" [3]. And LLMs, as Moore's law progressed, were the first structure to ingest a sizeable portion of the internet.
Here is my intuition:
Transformers are merely a Convnet for token sequences. In the sense that information is first local and each data means something wrt its immediate neighbors. You extract data locally, and then you dezoom to and redo exactly the same at a higher level. Convnet have been shown to reproduce edge detecting kernels first (vertical edge), then assemble them into basic shapes (vertical+oblique: corner), then bigger shapes (something round), then link shapes together (round+round separated by a horizontal shape with rectangles on top), and then it's a car! [2]
Transformers do the same. They assemble meaning among tokens ("ju" + "mps" : a conjugated verb), then assemble these at a higher level ("cat jumps" -> noun+verb), then it's a sentence "the cat jumps over the wall", then a paragraph, then a chapter, then a book. All differentiable so any deviation from something coherent can tweak its knobs higher up the chain at any level of detail.
Convnets just relate to 2D data topology, Transformers to long 1D sequences. This begs the question of 3D data btw: will Nerfs/Gaussian-Splats have their hierarchical-dezoom moment? (have they already done so?)
So, why haven't LLMs appeared before? RNN and LSTM modelled sequences accurately earlier, but these are not "BigData". They are small, by design, and must forget the long tail of residual meaning. Transformers are a hardware-friendly way of keeping all the joint-interactions across tokens in large windows. Because sometimes text needs it: you mention a cat at the beginning, and 3 chapters later you can refer to it without saying "cat". Same for code: "import java.sql.Date" will have a pretty important meaning 5000 tokens down the line.
LLMs are a BigData instance all over again. Dumb algorithms that scale better do better than complex ones, just out of the sheer volume of ingested examples. That's why -to my knowledge- you still have logistic regression at the core of placing ads. You don't model 1 person clicking one thing, you just let 100 teach you how they do it. You just crush the problem with data. It's also been called "The Bitter Lesson" [3]. And LLMs, as Moore's law progressed, were the first structure to ingest a sizeable portion of the internet.
[1] https://youtu.be/wjZofJX0v4M?si=Zdz4sQAvch5B-QA1&t=1177
[2] https://www.cs.cmu.edu/~epxing/Class/10708-19/notes/lecture-...
[3] http://www.incompleteideas.net/IncIdeas/BitterLesson.html