Simplified Intro Version: Imagine you have a smart assistant that can understand...

dannyw · 2024-04-08T04:58:01.000000Z

I understand this is ELI5, but doesn’t attention already do this, in the way you described? It pays specific focus to the most contextual words in the prior sequence.

imtringued · 2024-04-08T11:18:38.000000Z

Not from a computational perspective. To calculate the attention score you have to calculate every token against every other token. That is quadratic. Every article like one, the, a, etc will have to be calculated against every other word even though they are only revelvant within a short distance of the word they are attached to.

randcraw · 2024-04-08T17:50:09.000000Z

Isn't that factorial, and much more costly than quadratic?

dzamo_norton · 2024-04-10T14:32:55.000000Z

N choose 2 = N! / 2!(N-2)! = N(N-1) / 2.

xcv123 · 2024-04-08T05:39:06.000000Z

The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.

The innovation here is to prioritize tokens so that some tokens have more or less processor time.