If you'd rather prefer something readable and explicit, instead of empty handwaving and uml-like diagrams, read "The Transformer model in equations" [0] by John Thickstun [1].
The original paper is quite deceptive and hard to understand, IMHO. It relies on jumping between several different figures and mapping between shapes, in addition to guessing at what the unlabeled inputs are.
Just a few more labels, making the implicit explicit, would make it far more intelligible. Plus, last time I went through it Im pretty sure that there's either a swap on the order of the three inputs between different figures, or that it's incorrectly diagrammed.
Done 1. It is a drawdropper! Especially if you have done the rest of the series and seen results of older
architectures. And I was like “where is the rest of it, you ain’t finished!” … and then … ah I see why they named the paper attention is all you need.
But even the crappy (small 500k param
IIRC) Transformer model trained on a free colab in a couple of minites was relatively impressive. Looking at only 8 chars back and train on a HN thread it got the structure / layout of the page pretty good, interspersed with drunken looking HN comments.
Maybe this is more of a general ML question but I faced it when transformers became popular. Do you know of a project-based tutorial that talks more about neural net architecture, hyperparameters selection and debugging? Something that walks through getting poor results and make explicit the reasoning for tweaking?
When I try to use transformers or any AI thing on a toy problem I come up with, it never works. And there's this blackbox of training that's hard to debug into. Yes, for the available resources, if you pick the exact problem, the exact NN architecture and exact hyperparameters, it all works out. But surely they didn't get that on the first try. So what's the tweaking process?
but the general idea of "get something that can overfit first" is probably pretty good.
In my experience getting the data right is probably the most underappreciated thing. Karpathy has data as step one, but in my experience, also data representation and sampling strategy does quite the miracle.
In Part II of our book we do an end-to-end project including e.g. a moment where nothing works until we crop around "regions of interest" to balance the per-pixel classes in the training data for the UNet. This has been something I have pasted into the PyTorch forums every now and then, too.
Thanks for linking me to that post! Its much better at expressing what I'm trying to say. I'll have a careful read of it now.
I think I'm still at a step before the overfit. It doesn't converge to a solution on its training data (fit or overfit). And all my data is artificially generated so no cleaning is needed (though choosing a representation still matters). I don't know if that's what you mean by getting the data right or something else. Example problems that "don't work": fizzbuzz, reverse all characters in a sentence.
it really is amazing. to be fair if you actually are following along and writing the code yourself, you have to stop and playback quite frequently, and the parts around turning the attention layer into a "block" is a little hard to grok because he starts to speed up around 3/4 through, but yeah this is amazing. I went through it week before starting as lead prompt engineer at an AI startup, and it was super useful and honestly a ton of fun. Reserve 5 hours of your life and go through it if you like this stuff! It's an incredibly great crash course for any interested devs
Masochists! In a good way! I recommend you do the full course not jump into that video. I did the full course, paused to do some of a university course around lecture 2 to really understand some stuff then came back and finishing it off.
Bu the end of you would have done stuff like hand working out back-propagation though sums, broadcasting, batchnorm etc. Fairly intense for a regular programmer!
From looking at the video probably someone who has good working knowledge of PyTorch, familiarity with NLP fundamentals and transformers, and somewhat of a working understanding of how GPT works.
I feel like I have a schematic understanding of how transformer models process text. But I struggle to understand how the same concept could be used with images or other less linear data types. I understand that such data can be vectorized so it’s represented as a point or series of points in a higher dimensional space, but how does such a process capture the essential high level perceptual aspects well enough to be able to replicate large scale features? Or equivalently, how are the dimensions chosen? I have to imagine that simply looking at an image as an RGB pixel sequence would largely miss the point of what it “is” and represents.
Just to add, the first user who replied to you is quite wrong. You can use CNNs to get features first...but it doesn't happen anymore. It's unnecessary and adds nothing.
Pixels don't get fed into transformers but that's more expense than anything. Transformers need to understand how each piece relates to every other piece. That gets very costly very fast when the "pieces" are pixels. Images are split into patches instead with positional embeddings.
as for how it learns the representations anyway. Well it's not like there's any specific intuition to it. after all, the original authors didn't anticipate the use case in Vision.
and the fact that you didn't need CNNs to extract features first didn't really come into light till this paper - https://arxiv.org/abs/2010.11929
It basically just comes down to lots of layers and training
The original commenter is trying to build intuition so my statement was simply to help the commenter understand that Transformers can operate on patches.
As for
> after all, the original authors didn't anticipate the use case in Vision.
To quote the original "Attention is all you need" paper:
"We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video"
To say that the original authors did not anticipate the Transformers use in Vision is false.
I'm talking about using transnformers on their own, beating SOTA CNNs. People thought you needed CNNs even with transformers...until that was shown wrong. The point is that there isn't any special intuition that makes this fit vision.
Why would you feed pixels into transformers? Surely you could translate the overall image into frequencies using Fourier transforms and then into embeddings which feed the LLM?
If I understand what you're asking, the Transformer isn't initially treating the image as a sequence of pixels like p1, p2, ..., pN. Instead, you can use a convolutional neural network to respect the structure of the image to extract features. Then you use the attention mechanism to pay attention to parts of the image that aren't necessarily close together but that when viewed together, contribute to the classification of an object within the image.
Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org/abs/2010.11929). You could but it's not necessary and it doesn't add anything so it doesn't happen anymore.
Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.
They split the image into small patches (I think 16x16 is standard), and then treat each patch as a token. The image becomes a sequence of tokens, which gets analyzed the same as if it was text.
Doing it this way is obviously throwing away a lot of information, but I assume the advantages of using a transfomer outweigh the disadvantages.
This is a very useful answer, thanks. So is every possible 16x16 grid of pixels a legal token value and they are used more or less verbatim or are they encoded in some other way?
Prompted by some of the other replies I’ve read up a bit on positional embeddings. That they are used even with text transformers which otherwise lack the linearity of RNNs etc helps tremendously to clarify things.
If you say "Transformer" and "Megatron" in the same sentence, you have my full-attention for at least enough time to start making a serious observation about ML, or frankly, anything.
Great piece. At the end though there is a concerning bit.
> researchers are studying ways to eliminate bias or toxicity if models amplify wrong or harmful language. For example, Stanford created the Center for Research on Foundation Models to explore these issues
We've seen numerous times now, censoring/railroading degrades quality. This is why SD 2 was so bad at the human form over 1.4
There's one thing I haven't quite grasped about transformers. Are the query, key, and value vectors trained per token? Does each token have specific QVK vectors in each head, or does each attention head have one set that are trained using a lot of tokens?
The q,k,v projections are trained, usually per head. But each token goes through the same projection (per head). In some architectures (see the Falcon model) the k, v projections are shared across the heads.
During the forward pass a "query" is "created" for each token using the query projection (again, one projection per head, all the tokens run through the same projection). The keys are created using the key projection, the values are created using the dot product of query with the keys for each "token" then projected using the value projection.
But again, different models do different things. Some models bring in the positional encoding into the attention calculation rather than adding it in earlier. Practically every combination of things has been tried since that paper was published.
Models have the initial N embeddings passed through, processed by the attention+linear blocks which add something to the input. Sort of memory bus. After the last block we still have array of the same dimensions like N initial embeddings, but they mean something else now. How do we select the next token, based on the last element in that array?
Another question, that bus technically doesn't have to have the same width, right? It should be possible, for the same model's size, to trade bus width for the number of heads. Even have sort of U-Net.
And the last, on each loop resulting embedding (1) is converted into token, which is being added to the input. Converted to embedding first, it should be possible to just reuse embedding (1), and use token only for user output, right?
PS: not sure about every combination tried, we just started. Some of the problems still don't have satisfying solutions. Like hallucinations, online training. Or responsibility.
in go X tokens. Out comes a probability distribution across every possible next token.
In your analogy, the bus could be any width. In practice people tend to trade "bus width" for heads, but it need not be that way.
I'm not sure understand the last part.
It's quite easy to try stuff with LLMs/transformers. The fact that a paper hasn't been written on every combination doesn't mean the haven't been tried in some way. It's not as though the architecture is the only thing.
While it is largely obsolete for practical purposes, learning about them is still valuable as they illustrate the natural evolution in the thought process behind the development of transformers.
Quite possible! But given how ChatGPT hallucinates, and my general lack of knowledge about LLMs in general and ChatGPT in particular, I would be hesitant to take what it says at face value. I'm especially hesitant to trust anything it says about itself in particular, since much of its specifics are not publicly documented and are essentially unverifiable.
I wish there were some way for it to communicate that certain responses about itself were more or less hardcoded.
[0] https://johnthickstun.com/docs/transformers.pdf
[1] https://johnthickstun.com/docs/