Hopf algebra subsumes like all all currently fashionable architectures (transformers, convnets, SSM, diffusion models) and then some. Hopf convolution is a surprisingly general concept, Groebner bases are the thing used to calculate the antipode.
Hopf algebra is tensor bialgebra, i.e. tensor with a built in feedback (which subsumes autodiff).
I have a question about the claim in 6.2 that attention matrices are SPD, if you don't mind my asking.
It seems to me that accepting the empirical result that the eigenvalues are positive isn't enough to get a Fourier Transform interpretation. Specifically, I don't understand the assumption that all attention matrices are symmetric. (I'm sure you know that positive eigenvalues are not enough by themselves, but for other folks reading, [[1 1/2] [1/3 1]] is a simple concrete example.)
Consider Fig. 17 here: https://lilianweng.github.io/posts/2018-06-24-attention/ (this is Fig.1 in Attention is all you need). I understand that you get symmetric attention matrices for the self-attention matrix in the input stream, as well as the masked attention matrix in the output stream (the first block). But I don't understand how you claim symmetry for the final attention mechanism that combines input and output.
And if you don't get symmetry, you don't get the Fourier Transform interpretation and all the nice algebra that follows.
Sorry. You do mention a linear systems response, and that's what I meant.
In that setting, the eigenvectors work as a generalized forward and inverse fourier transform, and the eigenvalues form the transfer function you allude to in the bold sentence
"The attention mechanism’s role is the same as that of a transfer function in a linear
time-invariant system, namely it calculates the frequency response of the transformer
model,"
Specifically, it seems to me that this requires a _symmetric_ attention matrix. Which you get from the self-attention mechanisms (two of the three places where they're used in transformers), but not all of them, notably not the one that combines the output of the first two attention mechanisms (one input, and one output)
I think that the magic comes from the antipode which makes things symmetric. The Ising model is somewhat similar.
Rereading the paper, quite a bit has changed in my understanding of all this. My conclusions still stand, but some of the reasoning needs to be explained better.
Hopf algebra is tensor bialgebra, i.e. tensor with a built in feedback (which subsumes autodiff).
I have written a paper on this recently https://arxiv.org/abs/2302.01834
This paper is also good https://arxiv.org/abs/1206.3620
I have a discord channel https://discord.cofunctional.ai.