There is a particularly nice geometric interpretation of attention I just realis...

jdthedisciple · 2024-11-01T22:26:06 1730499966

It looks fascinating, but i don't understand it. I'm haven't gone yet deeply into the theory of attention networks.

Can you explain the desmos plot in simple terms?

szcs · 2024-11-02T06:19:59 1730528399

Attention is a 3 matrix product, s(QK)V where s is softmax. Each matrix has as many rows (Q and V) or columns (K) as many tokens you have in your context. The plot looks at the processing of a single row of Q (predicting a single token from previous ones) called q. q is a 2 element vector and is visualised as the draggable dot (imagine a line from the origin to the dot). The K matrix is shown as green dots, each previous token in the context window is represented as a separate dot. The distance of a blue dot from a corresponding green dot represents how much information from that token gets mixed into the output of the query. The green dots form a hypersphere, a 1D manifold in 2D space. In a real network it would be more like e.g. a 127D manifold in 128D space but the analogy works there as well. You can see how the query gathers information stored on the surface of the manifold by specifying a region and volume of space specified through q's direction and magnitude respectively.

jdthedisciple · 2024-11-02T08:36:46 1730536606

oh wow that makes more sense now!

and what are the orange dots? sorry if I missed that

szcs · 2024-11-03T09:37:52 1730626672

That's just the same distribution laid out along a line instead of a circle

liuliu · 2024-11-01T17:04:04 1730480644

Yeah, I believe this intuition first introduced by the Neural Turing Machine line-of-work and later simplified into AIAYN paper (NTM maintains "external memory" a.k.a. weight_keys, weight_values here).

Disclaimer: these are from my memory, which can be wrong entirely.