Hacker News new | past | comments | ask | show | jobs | submit login
Transformer learning explained: Coinductive guide to inductive transformer heads (arxiv.org)
91 points by adamnemecek on Feb 28, 2023 | hide | past | favorite | 28 comments



I don't know anything about algebraic geometry or category theory or haskell, but I know some stuff about the normal boring kind of linear algebra, and in some places the author relies on statements like "since the eigenvalues are real and positive we can conclude that it is a symmetric positive definite matrix" which I know are just wrong (it wouldn't have to be symmetric). This is particularly bad because the rest of the argument (for example in 6.2.2) explains why it's so cool that we've shown it's symmetric (we haven't).


For example, [[1, 1], [0, 1]] is a non-symmetric matrix with positive real eigenvalues.


You are right, I will rectify that. The statement still holds but I need to provide a justification.


I think that sentence is confusing but mathematical conclusion still stands. For quadratic forms, we are only concerned with symmetric matrices, or the symmetric part of any real matrix, because we can always replace any square matrix in a quadratic form, x^T A x, by x^T (A + A^T)/2 x, the symmetric part of the matrix, and the quadratic form still retains the original value. Then, for symmetric matrices, their eigenvalues being positive implies that the matrices are positive definite. Symmetric square matrices have real eigenvalues; one can prove this by taking the complement of quadratic forms.


Just FYI when you lookup visualizations of the attention matrices, they tend to be somewhat symmetric.


I'm the author. I have recently made the realization that Hopf algebras are a surprisingly good abstraction for machine learning and this paper is an invitation for others to look into them as they can provide a unified foundation for convnets, transformers and diffusion models.

Hopf algebras are a tensorial bialgebra, meaning they are both a tensor and a cotensor at once. What's the co- prefix? You can think of it as "a thing with recurrence relationships". This matters since recurrence relationships can be thought of as a generalization of autodiff. The two ML building blocks, tensors and autodiff, are then unified under one umbrella.

Looking at transformers through the lens of Hopf algebra allowed me to identify their learning mechanism. Namely, the residual stream corresponds to what my paper calls unit impulse path and the attention corresponds to the convolution path of a Hopf algebra. The transformer learns by enforcing an invariant called Hopf coherence between the two paths.

Hopf convolution is a generalization of the standard convolution and transformer attention emerges as a part of this convolution. Transformer models can then be understood as linear time-invariant systems.

Furthermore, I believe that Hopf coherence provides a better way of doing backprop, one that occurs within the single layers as opposed to across the whole graph, but more research is needed.

This approach also opens the door for verified machine learning, which I will discuss in future work.

I'm currently in the process of implementing a simple machine learning framework based on Hopf algebra but it might take some time.

I have a discord channel if you want to discuss this further or keep track of our progress https://discord.cofunctional.ai


Is there any chance you can answer in simple terms whether the "Hopf coherence" method would be any faster way to do a back propagation than current methods (gradient descent)? My math skills are bit rusty now so I can't tell from looking at your paper alone.


Yes, that's the reason why I'm talking about it. It should be faster since it works locally (within the single layers) as opposed to across the whole graph.


Could you quantify it (Twice, three times, etc)? Another question, does it apply to only attention heads learning or the whole shebang?


It's going to be more than that, like significantly. Hopf coherence theoretically makes away with backprop altogether.

I'm in the process of implementing it but it might take sometime.

I'm aware of Hinton's forward-forward, but I need more time to provide a comparison.


I don't understand the claims well enough to evaluate them. But the way this document is written, and where it choose to put its attention (no pun intended) strike me as odd. The sections on the transformer and its components seem like they should be backed with a lot more specifics tying back to the (Combinatorial) Hopf algebra described in the preceding sections. The Hopf algebra section defines a bunch of notation and structure, and introduces the label "Hopf Coherence" for some invariant, which is expressed in terms of that pile of notation. The Combinatorial Hopf Algebra section introduces a "shuffle product".

But we never actually get a clear explanation of how this stuff lines up with parts of the transformer / attention mechanism. There's an _assertion_ of Hopf coherence between two paths, and in 6.2 there's a description of the QK and OV circuits as having a co-product product relationship, which isn't especially clear but is at least a direction. What is the antipode in the transformer context? If Hopf coherence is m(id x S)\Delta = \epsilon * u, and OV is (or is like?) the product m, and QK is the coproduct \Delta ... what are the unit u, counit \epsilon, and antipode S?

It looks like the author never spells it out, and certainly never proves the assertion.

Instead, there are these hard-to-comprehend claims in prose, interspersed with math defining the most trivial things. In 6.2.1, it reminds us of what it means for a row to sum to 1. Then it quotes a definition of co-injection. In 6.2.2 they remind us what a symmetric matrix is.

The shuffle product defined in 5.1 doesn't get used, except to say that it will be part of future work.

I'm not saying there isn't any insight here, but I think because the author shows in sections 3-5 that they expect many readers will not have this background, they should do a clearer job drawing the relevant connections explicitly, and backing up claims.


You are right, the paper can and will be improved. I think that I assume too much that people have read the Pang2014 paper.

Hopf coherence is not new, it's a name for an old concept that did not have a name.

Unit and counit together make the residual stream.

The antipode is a part of the OV circuit (see section 6.2.2).

You are right re: basic definitions, I wanted the definitions in there and briefly considered a separate section that provides an overview (as is customary) but there were not enough of them to warrant that but they do seem somewhat out of place sometimes.


When I read in the abstract that "we argue that all building blocks of transformer models can be expressed with a single concept: combinatorial Hopf algebra", it reminded me of "we introduce a modern Hopfield network with continuous states and a corresponding update rule. [...] the new update rule is equivalent to the attention mechanism used in transformers" in another paper called "Hopfield Networks is All You Need."

So Hopf algebras and Hopfield networks are both equivalent to transformers. Cool. I didn't know what either of them actually were, so I checked if they were the same as each other, like if Hopf was short for Hopfield. No, they are completely different things, except that apparently both are equivalent to transformers.

Something about that coincidence made me less confident about both of them. I can't really explain it. It has the same vibe as if Stephen Wolfram says that transformers are really cellular automata and if the lesswrong people say that transformers are really bayesian updates. It gives the feel of going into a building where four people all say they are napoleon bonaparte and it makes me wonder what kind of building it is.


I like the Hopfield network paper.

How do you think a transformer model learns?

There is an overlap between Hopfield and Hopf, IIRC Ising is related to Hopf antipode.

If you want to discuss this further, join my discord https://discord.cofunctional.ai.


> How do you think a transformer model learns?

Sorry I don't know much about transformers, it's why I was reading! My small understanding is that transformers allow a kind of 'differentiable indirection' that has become practical to train now that we have these modern GPUs and automatic differentiation frameworks coupled with training algorithms.


You are right but here's the kicker: they learn even without autodiff! So my paper tries to explain that. I argue that Hopf coherence is a better autodiff.

You should peep the Pang2014 paper, I think you will understand better what's happening.


Is it possible to explain what you are doing like I'm a normie who doesn't know and isn't willing to learn algebraic geometry, category theory, or control theory? In other words, explain only in terms of plain finite dimensional linear algebra like eigenvalues, left- and right-eigenvectors, row- and column-stochastic matrices, conjugate transpose, trace, matrix inverse, matrix exponential, etc.?


Hopf algebras are analogous to electrical circuits with feedback and incrementally "learn" eigenvalues by processing incoming data without needing additional processing.


Hey Adam, if you're interested in smart people looking at your ideas and providing criticism you should submit your papers to an ML conference for a peer review.


I will, this is the first approximation. I'm working on an implementation that will provide further guidance for the paper.


The Geometric Deep Learning theory developed by Bronstein ea. is also a very nice unifying theory for deep learning, using Graph Neural Networks. Did you study the interactions with these?


I did, yes, and there is good amount of overlap. In fact, I will be speaking with one of the authors later this week.

However, I find that my approach relies on fewer concepts. There are some concepts that our approaches share but what I like about my approach is that all the concepts are just parts of the Hopf algebra whereas in GDL, they are somewhat ad hoc.

My approach also ties nicely with other things like linear logic and PL theory and therefore provides concrete steps on how to build the next gen machine learning framework which might very provide verifiable machine learning.

My approach fundamentally says that transformer models, diffusion modes and convnets can be expressed with Hopf convolution which is a generalization of the standard convolution.

Also my approach provides a new learning approach which does away with backwards pass and replaces it with something resembling feedback like the one found in electronic circuits.


The book you mention is quiet easy to read and didactically structured, the paper was hard to follow.

Just to encourage people to read the book, even if they did not grokk the paper.


You comment elsewhere that the Hopf coherence-based optimization procedure works locally. How does it compare to local predictive coding techniques?

E.g.: https://openreview.net/pdf?id=PdauS7wZBfC


I have to read the paper in more detail but I imagine there's a good amount of overlap.


As an EE, trying to understand what this was about fried my brain.


EE is actually a good place to be. Hopf algebras are like LTIs. There's a Laplace transform/z-transform relationship between the unit impulse path and the convolution path.


I think the person you are replying to is talking about the naming of the title. Transformers, electrically, are used to change voltages and work by electromagnetic induction. I was super confused by it also since it has to do with machine learning and not electrical applications.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: