
Transformers Are Graph Neural Networks - Anon84
https://graphdeeplearning.github.io/post/transformers-are-gnns/
======
treaheraher
Sorry but I hate it. Graph->graph transform can describe almost any operation,
it's a superset of linear transforms. When all you know about your model is
that 'it's a graph', you barely know anything at all. And transformers are
just glorified AST parsers.

~~~
cochne
> And transformers are just glorified AST parsers.

Can you explain this?

~~~
chaitjo
I am guessing he means that Transformers 'discover' parse trees in sentences,
as they operate on fully-connected graphs but end up focusing on the most
important connections through attention.

------
syntaxing
The top level site is pretty neat [1]! I've been trying to find a site like
this to learn more about the applications of graph NN. I particularly like the
one about combinatorial optimization/discrete optimization.

[1]
[https://graphdeeplearning.github.io/](https://graphdeeplearning.github.io/)

~~~
ivan_ah
You might also be interested in the project-based course on GNNs by
octavian.ai which is here: [https://www.octavian.ai/machine-learning-on-
graphs-course](https://www.octavian.ai/machine-learning-on-graphs-course) You
don't need to sign up -- the notebooks get posted to this page. Signup is only
if you want to receive notifications.

------
StandardFuture
> Reading new Transformer papers makes me feel that training these models
> requires something akin to black magic when determining the best learning
> rate schedule, warmup strategy and decay settings.

This is the slow decline of the machine learning field because most
researchers are too busy procuring positions in various institutions instead
of asking and pursuing more creative and more difficult lines of
question/thought.

I am glad the author chooses to call out disconcerting behavior.

> these papers showed that Transformer heads can be ‘pruned’ or removed after
> training without significant performance impact.

The larger the model the less important or just purely redundant the various
modules will become.

~~~
IAmEveryone
The ML community has seen spectacularly productive in the last few years,
comparable only to the inventors of _dark mode_. The results frequently
capture the public imagination. For every brogrammer griping about how “it’s
all basic statistics” there are hundreds of people thinking magic must be
involved.

Correlation may be != causation. But absent other evidence, I would still be
careful with accusations of any systemic issues with their work culture, or
predictions about impending doom.

I seem to remember some of these seemingly secondary parameters at one point
being the sole reason making the model work. Wasn’t it a new initialization
schedule that kicked of the current boom?

In any case, recent history should be a good example of how “just” uncreative
pursuits such as increasing model depth can have results of dramatically
different quality.

It also feels strange to take issue with people being motivated by
publication. As far as inducing altruistic behavior goes, publications are
second only to cheap medals handed posthumously to the children of dead
soldiers. And in terms publication criteria being aligned with some abstract
sense of “good research”, I have little doubt that creative ideas with _some_
interesting results will find an interested audience. There may well remain
the universal problem that unsuccessful “out-there” efforts may leave you with
little when they fail, but risk is as inherent to such efforts as the chance
to make it big. It’s almost tautologically impossible to adequately reward
failed efforts, because there are no measures to asses them; indeed, where
there are measure, they are no longer deemed to have failed.

To then make it easier to advance along new lines of thinking, we would want
to come up with new yardsticks to judge results: coming up with new standard
problem sets where current efforts fail dramatically. Thinking back over the
last years, I’m not entirely sure that isn’t exactly what we’ve been doing.

~~~
YeGoblynQueenne
>> Correlation may be != causation. But absent other evidence, I would still
be careful with accusations of any systemic issues with their work culture, or
predictions about impending doom.

No lesser man than Geoff Hinton himself thinks there are systemic issues with
machine learning publications, although he doesn't foresee impending doom:

 _WIRED: The recent boom of interest and investment in AI and machine learning
means there’s more funding for research than ever. Does the rapid growth of
the field also bring new challenges?_

 _GH: One big challenge the community faces is that if you want to get a paper
published in machine learning now it 's got to have a table in it, with all
these different data sets across the top, and all these different methods
along the side, and your method has to look like the best one. If it doesn’t
look like that, it’s hard to get published. I don't think that's encouraging
people to think about radically new ideas._

 _Now if you send in a paper that has a radically new idea, there 's no chance
in hell it will get accepted, because it's going to get some junior reviewer
who doesn't understand it. Or it’s going to get a senior reviewer who's trying
to review too many papers and doesn't understand it first time round and
assumes it must be nonsense. Anything that makes the brain hurt is not going
to get accepted. And I think that's really bad._

 _What we should be going for, particularly in the basic science conferences,
is radically new ideas. Because we know a radically new idea in the long run
is going to be much more influential than a tiny improvement. That 's I think
the main downside of the fact that we've got this inversion now, where you've
got a few senior guys and a gazillion young guys._

[https://www.wired.com/story/googles-ai-guru-computers-
think-...](https://www.wired.com/story/googles-ai-guru-computers-think-more-
like-brains/)

------
lexpar
Yeah, I guess this is fine if by "Transformers are Graph Neural Networks" we
mean Transformers < GNN, rather than Tranformers == GNN.

"Sentences are a fully connected graph". Ok fine, but that's a graph with
basically no information embedded in its structure. GNNs are supposed to be
useful for graphs that have interesting structure, right?

~~~
chaitjo
Not necessarily. In fact, applying the Transformer/GNN on a full graph is seen
by some people as 'discovering' or identifying some useful underlying graph
structure. We had an interesting discussion on this on Twitter:
[https://twitter.com/PetarV_93/status/1233820829852618754](https://twitter.com/PetarV_93/status/1233820829852618754)

------
darrenoc
It says a lot about my knowledge of ML that I went into this post fully
expecting it be about Autobots and Decepticons.

------
ialyos
Does the use of positional embeddings mess with the gnn for formulation? I'm
not familiar with the requirements for something to be a gnn, but positional
embeddings mean the graphs have to capture order of occurrence, and the graph
in the shared page doesn't seem to do that.

~~~
chaitjo
In graph terms, Positional encodings are useful for adding sequential/temporal
properties to each node in the graph. Indeed, there are works on position-
aware GNNs.

------
justlexi93
Graph attention networks were inspired by transformers. I am pretty sure they
explicitly say it in the paper.

~~~
chaitjo
Indeed, the two papers came out within months of each other iirc. The GAT
paper discusses Transformers in the context of stabilizing the learning of
attention mechanisms.

Of course, this connection may be trivial to most people, but I hadn't seen a
post on this before. So I decided to write one for myself as I studied these
architectures.

