> In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
In other words, the output logits (i.e. word translations) of the decoder are fed back into that first position, with future words at each time-step masked.
I'm not quite sure how it all flows, b/c with several rows representing words all going through at once (a matrix), it seems like you would need to run the whole thing forward several times per sentence, each time moving the decoded focal point to the next output word...
> In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
In other words, the output logits (i.e. word translations) of the decoder are fed back into that first position, with future words at each time-step masked.
I'm not quite sure how it all flows, b/c with several rows representing words all going through at once (a matrix), it seems like you would need to run the whole thing forward several times per sentence, each time moving the decoded focal point to the next output word...