Wait, is this implementation just wrong? The "num_channel" is the embed_d, the K, Q should be applied to every embedded vector, here the embedded vector has length 1 so K, Q and V should be a single scalar and being applied like torch.matmul(x_i, self.K) [sequence_dim, 1] x [1, 1], this is also intuitive because the sequence can have different size and self attention needs to work anyway.
If you want insead treat the "num_inputs" as embed_d then the attention matrix W would be 1x1 because there is only one 768 vector but instead the matrix W here is [ num_inputs, num_inputs ].
Now that you have updated the code I think you have just flipped the error, it's still there, just doing the opposite (still wrong) thing: for example, if "num_steps" is 768 and "num_inputs" is 1 then K, Q, V is 1x1, okay but the the attention matrix cannot also be 1x1, it needs to be 768x768; you can also do the opposite K, Q, V being 768x768 and the attention matrix 1x1, before the error was that you made both the K, Q, V and the attention matrix 768x768 by applying the matrices on the wrong dimension, now the error is that you have made the the K, Q, V and the attention matrix 1x1.
For a simple fix just make the attention matrix W [num_steps, num_steps] instead of [num_inputs, num_inputs] then "y_i = torch.matmul(x_v_i, w_i)" would be "y_i = torch.matmul(w_i, x_v_i)" and should be correct.
I just checked and the attention matrix is 768x768. Basically, I reshape each sample to be a single step with an embed size of 768. I hope this makes sense. I may rewrite the README later tonight to make this point clear.
If you have a single embedded vector, it should be clear that the self-attention matrix cannot be an arbitrary NxN but only a 1x1, follow the fix I have added to my previous comment.
Also if K, Q, V and the self-attention matrix have the same dimensions even when the embed_d and sequence dim have different sizes there is something wrong.
I made the same mistake recently! It doesn’t help that there are so many poorly documented repos that make similar mistakes (and worse yet, half baked medium articles that blatantly give false information). Directly translating papers to PyTorch is hard too, I really wish there was a better API to express these models in that aligns better with how they’re actually documented in papers.
No one here actually checks anything before upvoting it. Having said that, this repo seems worthwhile and I appreciate the effort of OP. It's open source which empowers people to fix such problems themselves and make a PR, or at least an issue.
Once I got this feedback from GaggiX, I was actually looking in HN for a way to remove my post, but then discovered I could not! So I have tried to fix it as soon as possible.
For the best, IMO. That was a great interaction, and anybody running into issues in the future related to this problem actually stand a chance of running into this thread now.
This MIT lecture on recurrent neural networks provides a lot of context on why attention was so groundbreaking, and how things were solved before: https://www.youtube.com/watch?v=ySEx_Bqxvvo
And for the Attention Is All You Need paper itself, Andrej Karpathy's "GPT From Scratch" talk is a really great walkthrough of how to make this work in practice, all condensed into two hours of live coding: https://www.youtube.com/watch?v=kCc8FmEb1nY
OP, just to balance out these comments, you hooked me with the readme. DNNs have been moving way faster than pedagogy can keep up and these sorts of learning experiments are super helpful.
I wrote an article that attempts to explain attention in a more intuitive manner, without any PyTorch or other deep learning specific context: https://jaykmody.com/blog/attention-intuition/
TBH I don't find this exposition enlightening. If you don't already know the pytorch API, the code will be pretty obscure. And if you do understand the API, it's still not very clarifying.
Instead, it's a lot easier IMO to understand self attention in just plain English. If you understand the words, and also understand the torch API (or tensorflow or numpy, they're all more or less fungible), then it should also be straightforward to implement it.
Here's how I explain it to my grad students:
You have a collection of N things, with some fixed number of features each (doesn't matter how many, but call it d_f). Call that collection {x_i}. You have two learnable matrices Q and K, which learn to project those items into some collection of "questions" and "answers", respectively. Doesn't really matter what those questions or answers are, the NN will figure out what they should be during training.
These projection matrices map x_i -> k_i, and x_i -> q_i. k and q are each d_k-dimensional vectors (d_k is a number you choose) of features, so Q and K are d_k-by-d_f matrices. To measure the "compatibility" between questions q and answers k, we take the dot product of them. Ones which are similar (large, positive dot products) means the NN likes the answer k for question q.
So, here's what you do. For each token i, you compare it's "question" against every other token j's answer. I.e. you compute dot(q_i, k_j), which itself can be considered an NxN matrix of scalar numbers, qk_{ij}. Each element of this matrix contains the answer to the question "how well did item j answer the question of item i?"
Applying softmax to this matrix just converts the dot product values into a score from 0-1 over all the items. I.e., you convert the question to "on a scale from 0-1, how well did item j answer the question of item i?". The only subtlety here is that all the scores over all items j have to sum to 1 for each question i.
Finally, we have a third projection matrix, V, which maps x_j -> v_j. The v_j are d_v-dimensional vectors (doesn't have to equal d_k, and again something you arbitrarily choose). These represent the values that the NN would like to pass on from item j to any other item which decides it "likes" the answer of item j.
So now, for each item i we have a score for "how well does item j answer my question i?". And we also have the list of values v_j that each item j would like to pass to item i. So, to compute the new information to add to item i, we take the weighted sum of the values v_j that item i liked, using these compatibility scores as a weight. In math, this reads:
y_i = sum_over_j [ softmax(qk_ij) v_j ]
* Technically it's also empirically found that normalizing the argument of softmax by a factor 1/sqrt(d_k) helps. But it's not really germane to understanding and also not strictly necessary since a NN is of course free to simply learn to include an overall factor proportional to 1/sqrt(d_k) in the parameters of one of the projection matrices. Actually the use of softmax at all is also arbitrary, but encourages the attention to be sparse, which again is empirically known to be helpful.
> You have a collection of N things, with some fixed number of features each (doesn't matter how many, but call it d_f). Call that collection {x_i}. You have two learnable matrices Q and K, which learn to project those items into some collection of "questions" and "answers", respectively. Doesn't really matter what those questions or answers are, the NN will figure out what they should be during training.
These projection matrices map x_i -> k_i, and x_i -> q_i. k and q are each d_k-dimensional vectors (d_k is a number you choose) of features, so Q and K are d_k-by-d_f matrices.
Wait: this is the plain English version? I’ve just been presented an alphabet soup of symbols, most of which haven’t been defined:
N, d, f, x, i, Q, K, and q and k (lowercase). The only one I’m sure I understand is N.
The different letters seem to be drawn from certain regions of the alphabet, so I guess that’s significant in some way?
(I do understand vector, matrix, softmax, dimensionality, etc.)
I’m embarrassed to admit this here, but I run into this issue almost any time concepts are explained in math, including in the textbooks that teach math. What am I missing?
Apparently, you're missing the part where they're all defined :)
It does take some getting used to, communicating math in English (and visa-versa).
For example: "two learnable matrices Q and K ... are d_k by d_f". What more do you need to know about them? Also, "a fixed number of features ... call it d_f", I don't know how to more clearly define what d_f is?
I could have called the matrices Joe and Curly, and the feature dimension could have been called Sarah, I suppose. But it's not clear that that's any more helpful... and to anyone with any passing familiarity with math (such as my grad students), giving instances of mathematical objects concise and symbolic names is generally easier to parse, especially when it comes to translating the words into actual math or code.
I wasn’t trying to be critical - sorry if it came across that way!
> Also, "a fixed number of features ... call it d_f", I don't know how to more clearly define what d_f is?
Perfect example. I would expect a fixed number to be represented by a single letter, like x. But here, a the fixed number seems to need two letters to represent it: d and f. When I see this, I wonder “what’s d?” “What’s f?” So I scan ahead: I see there is a d_k, so I think there must be some sibling relationship between f and k, but I can’t see what it is. As far as I can tell, Q/q and K/k are the ones that belong in a pair.
It gets very confusing trying to decipher the meaning and relationships.
No offense taken, I have twice un-deaded your comments because you seem to be posting in good faith :)
Again, you get used to it. Note, the underscore would represent a subscript (in the standard math-writing tool of LaTeX, but also in many variants of Markdown).
Here, d stands for dimension. If you talk about dimensionality a lot, this is one of a few common letters to use to describe it (others popular contenders, especially when speaking about pairs of dimensions would would be n and m, p and q, r and s. But N is already used in a different context [weirdly n/N seems to be an exception to the following paragraph]). And since there are a few different dimensions floating around, we label them all with d, and give them subscripts to indicate which one we're talking about. d_f symbolically means "dimension of the feature space", d_k means "dimension of the key space" (also the same as the query space. in my description I called them question/answer instead of query/key). d_v means "dimension of the value space."
In many contexts, the use of capital letters implicitly denotes matrices. So if you have a matrix that turns x (your input variable), into q (a query vector), naming it Q makes a lot of sense. Especially when we have three matrices floating around, it would have been much more confusing to keep track of if we named the matrices M, N, O. Which one projects to which vector? Much easier to understand that Q makes the q vectors, K makes the k vectors, V makes the v vectors.
Thanks for stepping me through that. I was able to reread and understand your original explanation.
To be honest, though, I was really just trying to understand what my roadblock is in general. As I mentioned, I experience this kind of friction with most math-related writing. And, sadly, I do have at least a "passing familiarity" with math. I don't really struggle with the concepts. For me it's the communication layer: the variety of conventions that exist never seem to jell into a coherent system in the way that natural languages do - even with all of their idiomatic and arbitrary features.
> Finally, we have a third projection matrix, V, which maps x_j -> v_j. The v_j are d_v-dimensional vectors (doesn't have to equal d_k, and again something you arbitrarily choose). These represent the values that the NN would like to pass on from item j to any other item which decides it "likes" the answer of item j.
> So, to compute the new information to add to item i, we take the weighted sum of the values v_j that item i liked
I don't get this part. If the v_k are d_v dimensional vectors, and if the input items i are all d_f dimensional vectors, then how are you "adding" these values back to the inputs when d_f != d_v?
The linked script doesn't seem to do any adding like this - instead, they take the output values and pass them through a linear layer, presumably throwing away the inputs. But your comment hints at the existence of a more residual sort of approach.
The dot product of two d_k-dimensional vectors is a scalar. So the entire NxN matrix qk_ij is just a bunch of scalar numbers. The rows are used to create a weighted sum (i.e. a linear combination) of the d_v-dimensional vectors.
Correct me if I'm wrong. Here Q, K, and V are trainable. d_k and d_v are selectable metaparameters(?) d_f - input vectors' size, N is their number, or window's size. Next time we usually process another non-overlapping set of input vectors.
Correct. In the context of LLM's, the "items" I refer to would be tokens. As a particle physicist, in the transformers I work with the "items" often instead represent either particles, or detector measurements.
Thanks, I'm trying to understand all this mechanics. Matrix multiplication is an equivalent to passing through a single fully connected convolution layer. Which means we can probably beef up and make it a small network. Also it's easy to make it work not in fixed windows, but in scrolling, if the output is processed sequentially. Even if not it may make sense too. It can be implemented efficiently. The idea is that in switching window first and last elements are being influenced only from one side. Which is not a good thing in text processing. Only middle elements are influenced from both sides. With scrolling window all elements currently being processed are in the middle. Except for the ends of the dataset.
Here’s my one-liner explanation: attention is the sum of a dict of functions evaluated on input data. But the data is matrices, the dict keys are matrices, and the functions that are found in the dict are also matrices.
If you want insead treat the "num_inputs" as embed_d then the attention matrix W would be 1x1 because there is only one 768 vector but instead the matrix W here is [ num_inputs, num_inputs ].