Hacker News new | past | comments | ask | show | jobs | submit login
The Illustrated Transformer (jalammar.github.io)
65 points by ghosthamlet on Nov 1, 2018 | hide | past | favorite | 4 comments



that's a great arxiv translation, a model for ML elucidation

re: Transformers, see also:

https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c7...

which suggests that Transformers have been supplanted by simple conv2d networks that span both the input and the out; also mentioned are "hierarchical neural attention encoders", but no links; q.v. https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-atten...


This was a really great post!

Quick question: What does the decoder attend to right at the start? I still can't figure this part out. Perhaps I am missing something very simple.


From the article:

> In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

In other words, the output logits (i.e. word translations) of the decoder are fed back into that first position, with future words at each time-step masked.

I'm not quite sure how it all flows, b/c with several rows representing words all going through at once (a matrix), it seems like you would need to run the whole thing forward several times per sentence, each time moving the decoded focal point to the next output word...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: