Hopf algebra subsumes like all all currently fashionable architectures (transfor...

cscheid · on April 25, 2023

Interesting paper.

I have a question about the claim in 6.2 that attention matrices are SPD, if you don't mind my asking.

It seems to me that accepting the empirical result that the eigenvalues are positive isn't enough to get a Fourier Transform interpretation. Specifically, I don't understand the assumption that all attention matrices are symmetric. (I'm sure you know that positive eigenvalues are not enough by themselves, but for other folks reading, [[1 1/2] [1/3 1]] is a simple concrete example.)

Consider Fig. 17 here: https://lilianweng.github.io/posts/2018-06-24-attention/ (this is Fig.1 in Attention is all you need). I understand that you get symmetric attention matrices for the self-attention matrix in the input stream, as well as the masked attention matrix in the output stream (the first block). But I don't understand how you claim symmetry for the final attention mechanism that combines input and output.

And if you don't get symmetry, you don't get the Fourier Transform interpretation and all the nice algebra that follows.

adamnemecek · on April 25, 2023

I don’t mention Fourier transform.

cscheid · on April 25, 2023

Sorry. You do mention a linear systems response, and that's what I meant.

In that setting, the eigenvectors work as a generalized forward and inverse fourier transform, and the eigenvalues form the transfer function you allude to in the bold sentence

"The attention mechanism’s role is the same as that of a transfer function in a linear time-invariant system, namely it calculates the frequency response of the transformer model,"

Specifically, it seems to me that this requires a _symmetric_ attention matrix. Which you get from the self-attention mechanisms (two of the three places where they're used in transformers), but not all of them, notably not the one that combines the output of the first two attention mechanisms (one input, and one output)

adamnemecek · on April 25, 2023

I think that the magic comes from the antipode which makes things symmetric. The Ising model is somewhat similar.

Rereading the paper, quite a bit has changed in my understanding of all this. My conclusions still stand, but some of the reasoning needs to be explained better.