
The Illustrated Transformer - ghosthamlet
https://jalammar.github.io/illustrated-transformer/
======
angel_j
that's a great arxiv translation, a model for ML elucidation

re: Transformers, see also:

[https://towardsdatascience.com/the-fall-of-rnn-
lstm-2d1594c7...](https://towardsdatascience.com/the-fall-of-rnn-
lstm-2d1594c74ce0)

which suggests that Transformers have been supplanted by simple conv2d
networks that span both the input and the out; also mentioned are
"hierarchical neural attention encoders", but no links; q.v.
[https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-
atten...](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-
networks.pdf)

------
activatedgeek
This was a really great post!

Quick question: What does the decoder attend to right at the start? I still
can't figure this part out. Perhaps I am missing something very simple.

~~~
angel_j
From the article:

> _In the decoder, the self-attention layer is only allowed to attend to
> earlier positions in the output sequence. This is done by masking future
> positions (setting them to -inf) before the softmax step in the self-
> attention calculation._

In other words, the output logits (i.e. word translations) of the decoder are
fed back into that first position, with future words at each time-step masked.

I'm not quite sure how it all flows, b/c with several rows representing words
all going through at once (a matrix), it seems like you would need to run the
whole thing forward several times per sentence, each time moving the decoded
focal point to the next output word...

