
Transformer: A Neural Network Architecture for Language Understanding (2017) - sonabinu
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
======
yorwba
(2017)

EDIT: discussion at the time:
[https://news.ycombinator.com/item?id=15144573](https://news.ycombinator.com/item?id=15144573)

------
albertzeyer
Transformer has gained lots of traction since it was introduced, and has
become kind of a standard model for translation now, being better in many/some
cases for machine translation. The transformer is different to the previous
recurrent neural network (RNN) model (LSTM usually) (together with attention)
in that it does not use RNNs/LSTMs at all, but just attention instead (that is
why the paper was called "attention is all you need"
([https://arxiv.org/abs/1706.03762](https://arxiv.org/abs/1706.03762) )).

However, note that there was an update by Google recently, "The Best of Both
Worlds: Combining Recent Advances in Neural Machine Translation"
([https://arxiv.org/abs/1804.09849](https://arxiv.org/abs/1804.09849)). This
paper now shows that you can get similar or better performance with RNNs,
while training is more stable and can be parallelized more easily (however,
requires more computing hardware, if you want to train it faster).

~~~
MrEldritch
Most recently, there's "Transformer-XL: Language Modeling with Longer-Term
Dependency"
([https://openreview.net/forum?id=HJePno0cYm](https://openreview.net/forum?id=HJePno0cYm)),
which uses a rather different approach to apply a recurrent element to the
Transformer architecture and thereby attain very long context lengths.

