Hacker News new | past | comments | ask | show | jobs | submit login

Simplified Intro Version:

Imagine you have a smart assistant that can understand and process the words you say to it. Usually, this assistant pays equal attention to every word you say, no matter how important or unimportant each word is to the overall meaning of your message.

Now, imagine that we found a way to teach the assistant to be smarter about how it uses its "brain power." Instead of giving equal attention to every word, the assistant learns to focus more on the words that are most important for understanding what you mean. It can even adjust this focus on the fly, paying more attention to different words depending on the context of your message.

To make sure the assistant doesn't get overwhelmed, we also set a limit on how much total "brain power" it can use at any given time. It's like giving the assistant a budget and saying, "You can only spend your brain power on a certain number of words at a time." The assistant then has to decide which words are most important to focus on.

Even with this limit, the assistant is still flexible in how it uses its brain power. It might spend more on certain words and less on others, depending on what you're saying. This means that while we always know the total amount of brain power the assistant is using, it can adapt to different situations and prioritize what's most important.

When we teach the assistant using this method, it not only learns to focus its attention intelligently but also does so very efficiently. It can understand you just as well as an assistant that pays equal attention to every word, but it uses less brain power overall. This makes the assistant much faster at responding to you and processing new information.




I understand this is ELI5, but doesn’t attention already do this, in the way you described? It pays specific focus to the most contextual words in the prior sequence.


Not from a computational perspective. To calculate the attention score you have to calculate every token against every other token. That is quadratic. Every article like one, the, a, etc will have to be calculated against every other word even though they are only revelvant within a short distance of the word they are attached to.


Isn't that factorial, and much more costly than quadratic?


N choose 2 = N! / 2!(N-2)! = N(N-1) / 2.


The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.

The innovation here is to prioritize tokens so that some tokens have more or less processor time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: