
Attention Is All You Need - espeed
https://papers.nips.cc/paper/7181-attention-is-all-you-need
======
eggie5
This paper has a lot of prerequisites to understand. A good paper to read is
precursor to this paper released a year ago:
[https://arxiv.org/abs/1606.01933](https://arxiv.org/abs/1606.01933)

------
bthornbury
I expect we'll be seeing many shakeups on what has (perhaps prematurely)
become the established norms for NN architectures (CNN and RNN) throughout the
next few years.

Its a great time to be alive!

------
pilooch
See the Google blog post from last summer,
[https://research.googleblog.com/2017/08/transformer-novel-
ne...](https://research.googleblog.com/2017/08/transformer-novel-neural-
network.html) A novel simplified architecture for sequences and translation.

------
ferros
Can somebody assist in breaking this down?

~~~
RangerScience
I'm seconding this. I could not find a good resource to understand what
"attention" actually _is_.

(The next step for me would be to follow the citation trail to the original
paper, but that might not be the best place to come to an understanding of the
thing.)

~~~
visarga
Attention is just a weighted sum over a set of vectors, where the weights sum
to one. Attention weights are usually created by neural nets. The word
"attention" might seem more grandiose than what it actually does.

~~~
speedplane
That may be true on a mathematical level, but that's also the answer to just
about any neural net question... it's all just a a weighted sum. My
understanding of "attention" on a higher level is the ability to concentrate
more neurons on "important" areas of an image than less important ones.

An imperfect analogy is how the human visual system has better resolution at
your eye-line's center than at it's edges. In this analogy, your brain should
not waste effort processing image details in your peripheral vision.

~~~
visarga
The key element is that we use neural nets to compute the attention weights,
so attention itself is learnable.

------
IncRnd
Reading the headline I thought the article would be about mindfulness, which
would have been nice. Reading the article I was pleasantly surprised to find a
different subject that I also enjoy. :)

------
chriswarbo
Would this have implications for using ANNs on recursive structures (trees and
graphs)? Their "position encoding" seems a little contrived, but may be
amenable to a more complex positioning scheme (e.g. paths from a root node).

Whilst there are "standard" approaches in computer vision ("CNNs applied to
<foo>") and sequence processing ("LSTM RNNs applied to <foo>"), there doesn't
seem to be any "standard" for variable-size, recursively-structured data. Sure
there's recursive ANNs, backpropagation-through-structure, etc. but they all
seem like one-off inventions, rather than accepted problem-solving tools.

~~~
sdenton4
Seq2Seq is kind of a standard, but also strikes me as pretty hacky. The
network has an encoder and a decoder mode, reads until it finds an end of
input signal, then switches to decode mode. This is how absolutely nothing
works in nature.

------
phkahler
Is this really significant? I'm not an NN kind of guy but I find it an
interesting thing to follow from a distance. From the abstract, this sounds
like an important paper. Is it?

------
jorgemf
This paper was uploaded to arxiv 6 months ago (June). With the fast pace in
translation in the last years it might be outdated already

------
m3kw9
I wonder how Capsule nets can evolve using Attention model like this

~~~
eref
Capsules basically do a kind of self-attention. But there the parent features
compete for a coupling, not the child features.

------
imurray
I suggest changing the link from the .pdf to the web page:
[https://papers.nips.cc/paper/7181-attention-is-all-you-
need](https://papers.nips.cc/paper/7181-attention-is-all-you-need)

It's one click to get the pdf from there. But you also get a plain webpage
with abstract, citation details, and so on, which you can't get back to from
the PDF.

In general it's good to knock the ".pdf" off the end of all papers.nips.cc
links. Similarly turn /pdf/ links on arXiv into /abs/ links, and replace
"pdf?" in openreview.net links with "forum?".

~~~
popcorncolonel
Agreed. Plus, its really annoying to open it on mobile and my phone starts
downloading the pdf file immediately, which I don't want to have to manually
delete in the future.

------
hjjiehebebe
Abstract:

The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and
convolutions entirely. Experiments on two machine translation tasks show these
models to be superior in quality while being more parallelizable and requiring
significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English- to-German translation task, improving over the existing best results,
including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature.

