
Attention Mechanism in Deep Learning - ReDeiPirati
https://blog.floydhub.com/attention-mechanism/
======
kuu
In case you find Attention (and specially transformers) interesting I have
some saved links introducing the topic:

[http://www.peterbloem.nl/blog/transformers](http://www.peterbloem.nl/blog/transformers)

[https://nostalgebraist.tumblr.com/post/185326092369/the-
tran...](https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-
explained)

[https://papers.nips.cc/paper/7181-attention-is-all-you-
need....](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)

[http://jalammar.github.io/illustrated-
transformer/](http://jalammar.github.io/illustrated-transformer/)

[https://arxiv.org/pdf/1807.03819.pdf](https://arxiv.org/pdf/1807.03819.pdf)

~~~
BiasRegularizer
While Transformer's Self Attention(SA) is great, there are many applications
where SA doesn't apply. For a more comprehensive overview of attention
mechanisms, I often find myself coming back to Lilian Weng's post:

[https://lilianweng.github.io/lil-log/2018/06/24/attention-
at...](https://lilianweng.github.io/lil-log/2018/06/24/attention-
attention.html)

------
abhgh
In the context of attention, there is a very interesting recent paper that
warns against conflating attention and token importance - "Is Attention
Interpretable?" [1]. This is an accepted paper in ACL-2019:

[1]
[https://www.aclweb.org/anthology/P19-1282](https://www.aclweb.org/anthology/P19-1282)

~~~
stochastic_monk
See also Attention is Not Explanation [0].

[0]
[https://www.aclweb.org/anthology/N19-1357](https://www.aclweb.org/anthology/N19-1357)

~~~
physicsyogi
There's a rebuttal to this as well: Attention is not not Explanation.
[https://arxiv.org/abs/1908.04626](https://arxiv.org/abs/1908.04626)

------
Yuval_Halevi
>When we think about the English word “Attention”, we know that it means
directing your focus at something and taking greater notice. The Attention
mechanism in Deep Learning is based off this concept of directing your focus,
and it pays greater attention to certain factors when processing the data.

I actually think they should rename it to 'Focus Mechanism'

~~~
elcomet
Why do you think Focus Mechanism is more appropriate than Attention?

~~~
nerdponx
Less anthropomorphism in machine learning is good IMO.

~~~
bitL
We are literally talking about intelligence, for which the best model in
nature are humans. It's difficult not to be anthropomorphic in general.

~~~
nerdponx
The article is about deep learning, not AI.

~~~
bitL
There is very limited vocabulary for concepts you see in Deep Learning; even
anthropomorphic ones are usually badly used, but you aren't going to have many
fans if you start talking about key-value weighting of intermediate layers
instead of "attention".

~~~
JoeSamoa
Thank you for defending this point so rigorously. I agree.

------
phkahler
Would it be more accurate to use the word "importance" than "attention"? I
feel like the later is encroaching on "intention" and conciousness more than
these techniques warrant.

~~~
BiasRegularizer
"Importance" is a fairly overused word in DL. e.g. importance sampling and
importance weighted gradients.

"Attention" works by creating inductive bias for the upstream network, which
is analogous to human attention, and the word itself is much more intuitive.

Keep in mind machine learning is largely a descriptive science(modeling the
behavior), whereas neuroscience is more prescriptive. So from the behavioral
perspective, attention is better suited than importance.

~~~
msamwald
On the other hand the word "attention" in deep learning is often used non-
intuitively, e.g. "one token attends to another token" in self attention is
hardly analogous to human cognitive processes.

------
abakus
For a very easy to understand explanation of the transformer and attention
mechansim, see:

[https://blue-season.github.io/transformer-in-5-minutes/](https://blue-
season.github.io/transformer-in-5-minutes/)

------
eanzenberg
Is there any work on attention for cnns or other computer vision algo's?

~~~
lucidrains
[https://arxiv.org/abs/1904.09925](https://arxiv.org/abs/1904.09925)

------
ilaksh
Did a FAANG company patent it already? If so, can we safely assume that since
such a patent is ridiculous, it should be ignored in relation to any
commercial service that might use these techniques?

