I did my master's thesis on the Gaussian Elimination part of Gröbner Bases making multiple optimizations to the Faugère-Lachartre method (working with Faugere himself) and had the first C (C++ish) public implementation at the time. The thesis can be found here https://hpac.imag.fr/gbla/data/bib/Mart12.pdf.
I was at a conference in Germany where Faugère presented his work. The room was stunned, but didn't understand. Hendrik Lenstra asked me to spend several days with Faugère, then give an expository talk. My marching orders: Assume nothing: groups, rings, fields.
I ended up lecturing on Gröbner Bases and Faugère's algorithm to an audience that included Buchberger and Faugère. Faugère appreciated my efforts. Buchberger was like a shook up soda bottle; he made a twenty minute speech during the question period.
I coauthored the computer algebra system Macaulay, that first introduced Gröbner Bases to algebraic geometry, opening up the field to applications. At the time there was a separate community in France studying "standard bases", from which I learned a lot. Their focus was on power series not polynomials, but the translation is easy.
Algebraic geometry is a great subject, and computerizing it was a noble cause, but it thrust me into a backwater where the difference between polynomials and power series is considering significant. I got out.
Author here. Gröbner bases are powerful tools for us to understand polynomials. I started on this article when I encountered them during research, and decided to write down some notes. After a few months, these notes turned into this blog post. It's quite a long read, and let me know if you have any questions.
A tip for your latex: I would use \langle and \rangle to surround the generators of an ideal, which looks better than using < and >. And I think they are called angle brackets, not square brackets (which would be [,]).
Inequalities can be transformed into equations quite easily, by adding variables:
Consider a strict inequality with real variables x_i:
f(x_1,...,x_n) > 0
By adding the variable y we obtain an equivalent equation:
f(x_1,...,x_n) = y^-2
This is correct because y^-2 is always strictly positive.
If f is a polynomial, the above can be written as a polynomial equation like so:
f(x_1,...,x_n) * y^2 = 1
To transform a non-strict inequality into an equation, on the other hand, the procedure is the same, just use 2 instead of -2 as the exponent. That is, this inequality:
Trouble starts when your field doesn't always have a solution for f(x_1,...,x_n) = y^2, like rational numbers (but I'm sure you can find more examples).
But maybe that can be mitigated as well?
I'm not an expert in this area, but, precisely for the reason you mention, I'd expect it's easier to solve a polynomial inequality over ℚ by solving it over ℝ and intersecting down to ℚ, rather than by working directly over ℚ.
For systems of polynomial inequalities, the appropriate tool is Cylindrical Algebraic Decomposition (this tool generalizes to systems of exponential inequalities as well).
I tried using Groebner bases in grad school to solve implicit polynomial equations (part of an optimization problem) but found quickly that it didn’t scale. I ran into trouble with a system of about 300 polynomials and this was with Maple which implemented state of the art algorithms at the time.
The computational complexity of Groebner is doubly exponential in the number of variables.
I have been looking into them lately in the context of ML and I have a hunch that in the future, all ML will be running a Grobner basis algorithm on a large scale.
Hopf algebra subsumes like all all currently fashionable architectures (transformers, convnets, SSM, diffusion models) and then some. Hopf convolution is a surprisingly general concept, Groebner bases are the thing used to calculate the antipode.
Hopf algebra is tensor bialgebra, i.e. tensor with a built in feedback (which subsumes autodiff).
I have a question about the claim in 6.2 that attention matrices are SPD, if you don't mind my asking.
It seems to me that accepting the empirical result that the eigenvalues are positive isn't enough to get a Fourier Transform interpretation. Specifically, I don't understand the assumption that all attention matrices are symmetric. (I'm sure you know that positive eigenvalues are not enough by themselves, but for other folks reading, [[1 1/2] [1/3 1]] is a simple concrete example.)
Consider Fig. 17 here: https://lilianweng.github.io/posts/2018-06-24-attention/ (this is Fig.1 in Attention is all you need). I understand that you get symmetric attention matrices for the self-attention matrix in the input stream, as well as the masked attention matrix in the output stream (the first block). But I don't understand how you claim symmetry for the final attention mechanism that combines input and output.
And if you don't get symmetry, you don't get the Fourier Transform interpretation and all the nice algebra that follows.
Sorry. You do mention a linear systems response, and that's what I meant.
In that setting, the eigenvectors work as a generalized forward and inverse fourier transform, and the eigenvalues form the transfer function you allude to in the bold sentence
"The attention mechanism’s role is the same as that of a transfer function in a linear
time-invariant system, namely it calculates the frequency response of the transformer
model,"
Specifically, it seems to me that this requires a _symmetric_ attention matrix. Which you get from the self-attention mechanisms (two of the three places where they're used in transformers), but not all of them, notably not the one that combines the output of the first two attention mechanisms (one input, and one output)
I think that the magic comes from the antipode which makes things symmetric. The Ising model is somewhat similar.
Rereading the paper, quite a bit has changed in my understanding of all this. My conclusions still stand, but some of the reasoning needs to be explained better.
Gröbner basis methods do not scale well as the number of variables increases. The computational complexity becomes too high. Numerical Algebraic Geometry methods give the correct answer in many cases where there is a practical reason to solve the equations, but not a provably correct answer.
One random thing I wonder: in the lexicographic ordering, is there any reason why the convention is that x_2 < x_1, while x^2 > x^1? Where _ denotes subscript, ^ superscript exponents, in other words that x > y uses opposite of alphabetic ordering, while the exponents are done in non-opposite numeric order? I don't think the naming of the variables matters for the mathematical result (you can use any subscript or letter for any variable as long as the corresponding ones remain the same between the equations), so the convention x_1 < x_2 would work just as well, and be less confusing I think.
This is a good point. I think it's just due to convention --- people tend to put terms involving x only at the front, and in the textbook by Cox et. al. that I referenced they use this convention as well.
In doctoral school I spent some time applying the state-of-the-art methods to trying to break lightweight symmetric ciphers. The idea was that the system of polynomials generated from a number of plaintext/ciphertext pairs might be solvable via Gröbner bases methods if the number of rounds of the cipher was low enough.
Quickly ran out of steam after a couple of rounds and ~200 polynomials or thereabouts (doubly exponential)
Naive question, but gotta ask: can Gröbner basis get over the "in general, a polynomial of degree >=5 can't be solved algebraically" hump?
Given that the above is a theorem (from Galois IIRC) I'd assume the answer is no.
And given that any system of polynomial equations in multiple variables - even of low degree - eventually subsume to a univariate polynomial of very high degree , then ... what are Gröbner actually useful for?
You know how someone walking down the street might give you a collection of vectors over a field, and ask questions about the subspace the vectors generate? You might make an orthonormal basis set of vectors which generate the same subspace, as a means to more easily answer questions about the original vectors.
Groebner bases play a role similar to a minimal orthonormal basis for a set of arbitrary vectors in a vector field, except the set of things to be linearly combined, and about which the linear combinations will be asked questions, are not tuples of floats (vectors) but instead are polynomials.
>You know how someone walking down the street might give you a collection of vectors over a field, and ask questions about the subspace the vectors generate?
they can be very insistent but legally you don't have to answer them.
> You know how someone walking down the street might give you a collection of vectors over a field, and ask questions about the subspace the vectors generate
what on earth...? I can't make sense of the last 2/3ds and with the first 1/3rd it is just weird. What does any of it mean anyway?
Most of these concepts (except Groebner bases) are basic stuff you would learn in a first course in linear algebra. If you don't know what they mean, this post is likely not relevant to you.
Galois actually proved, that you cannot express the solutions as a set of certain expressions (composition of polynomials, division and taking roots). Using Gröbner basis, it is possible to construct an algorithm that computes results (now, I'm not sure if it's always possible, I'm not an expert. But this distinction is important nonetheless).
If you can't algebraically solve the univariate polynomial, you can still get some solutions to the system as long as you can obtain some solutions to the univariate polynomial numerically.
Many ideas from the thesis ended up in the GBLA library used for fast Gaussian Elimination over these large matrices: https://dl.acm.org/doi/10.1145/2930889.2930914
This is a nice presentation from Christian Eder explaining how the algorithm works and the different optimization techniques implemented in GBLA https://www-user.rhrk.uni-kl.de/~ederc/download/gbla-aca-201...