
Attention Is All You Need (Neural Networks) - idibidiartists
https://arxiv.org/abs/1706.03762
======
rerx
This is super interesting. I believe the general expectation was that
convolutional neural networks would soon surpass recurrent neural networks in
machine translation tasks, but this is an entirely novel approach.

~~~
visarga
This, and graph based neural nets are very different from CNN and LSTM. They
learn to split a scene into objects and then learn how they interact. In this
way a lot of variation in the input is factorized out and only relations
between compatible types of objects are learned. It leads to stronger
generalization.

If you think about it, when we are going to do full reasoning, how is the data
to be represented? Embeddings and flat lists/matrices are not appropriate for
the way objects interrelate. It has to be a kind of graph. Here they used
multiple attentions instead, which kind-of work the same way as graphs,
attention heads being similar to links between objects.

Once we have data represented as graphs we can also do simulation - we apply
the rules of each object iteratively on the graph. The graph can be seen as an
automata, where each object updates its state by integrating information from
its neighbors. Automata are general Turing machines - they can represent and
simulate any computation. With simulation we can do optimal solutions search.
It opens a lot of doors for AI.

My money is on simulation and graphs for the next level of AI.

~~~
gmitscha
I do not think graphs is where we're heading. I think flat vectors are fine,
and I would argue multi-head attention is not THAT different from gated RNNs
like LSTM. The multiplication with weights, which are the outcome of a
softmaxed dot-product, is similar to the input gate of LSTM.

