
Making Transformer networks simpler and more efficient - strin
https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/
======
strin
> In our experiments with Transformers, we observed that not all the attention
> heads utilize their attention span to the fullest. In fact, in a task of
> character-level language modeling, most of the heads were using only a small
> portion of their attention span. If we can take advantage of this property
> during training, we can reduce the computation time and memory footprint
> significantly

