I like the way he uses a low-rank decomposition of the Value matrix instead of V...

imjonse · on April 15, 2024

It is the first time I hear about the Value matrix being low rank, so for me this was the confusing part. Codebases I have seen also have value + output matrixes so it is clearer that Q,K,V are similar sizes and there's a separate projection matrix that adapts to the dimensions of the next network layer. UPDATE: He mentions this in the last sections of the video.