The "Attention is All You Need" paper introduced a new way for AI to read and un...

dang · on May 18, 2023

Please don't post generated text into HN comments. HN threads are for human discussion and we ban accounts that violate this.

QuantumGood · on May 23, 2023

Apologies. Seems like a good policy.

QuantumGood · on May 17, 2023

Explaining it for a slightly older audience, a transformer is a type of artificial neural network designed for processing sequences, like sentences in a text. It's especially known for its use in natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.

The Transformer is unique because it uses a mechanism called "attention" to understand the relationships between words in a sentence, which works like this:

(1) Encoding: First, the Transformer turns each word in a sentence into a list of numbers, called a vector. These vectors capture information about the word's meaning.

(2) Self-Attention: Next, for each word, the Transformer calculates a score for every other word in the sentence. These scores determine how much each word should contribute to the understanding of the current word. This is the "attention" part. For example, in the sentence "The cat, which is black, sat on the mat," the words "cat" and "black" would get high scores when trying to understand the word "black" because they are closely related.

(3) Aggregation: The Transformer then combines the vectors of all the words, weighted by their attention scores, to create a new vector for each word. This new vector captures both the meaning of the word itself and the context provided by the other words in the sentence.

(4) Decoding: Finally, in a task like translation, the Transformer uses the vectors from the encoding phase to generate a sentence in the target language. It again uses attention to decide which words in the original sentence are most relevant for each word it's trying to generate in the new sentence.

One key advantage of the Transformer is that it can calculate the attention scores for all pairs of words at the same time, rather than one at a time like previous models. This allows it to process sentences more quickly, which is important for large tasks like translating a whole book.

QuantumGood · on May 17, 2023

The importance of the "Attention is All You Need" paper by Vaswani et al., in 2017 is that it introduced the Transformer type of model architecture.

The model is so named because it "transforms" one sequence into another. For example, in a machine translation task, it can transform a sentence in one language into a sentence in another language.

The key innovation of the Transformer model is the use of self-attention mechanisms. This means that instead of processing the input sequence word by word, the model considers all the words in the sequence at the same time and learns to pay "attention" to the most important ones for the given task.

In essence, the Transformer model is a design for building network architectures that can process data in parallel and focus on different parts of the data depending on the task at hand. The Transformer model has proven to be highly effective and flexible, and has been adopted in many variants and applications, including BERT, GPT, T5, and many others.

andreyk · on May 18, 2023

Just a quick clarification, attention of the same sort transformers used was already being employed in RNNs for a while. Thus the name "attention is all you need", it turned out you can just remove the recurrent part which makes it hard to train the NN.

mirekrusin · on May 18, 2023

Innovation is "attention", not just "self-attention" (cross-attention for ie. translation <<encoder>>, self-attention for generation <<decoder>>).

It's general computation model, does't have to work on text only.

It's also general in the sense that you can mask it - ie. with lower triangular matrix so future doesn't influence past (decoder, generation); leave it unmasked (ie. in encoder, ie. in text translation you want attention to have access to full input text) or anything else really.

hackernewds · on May 17, 2023

great explanation. thank you all for contributing to our learning!

> in a machine translation task, it transforms a sentence in one language into a sentence in another language.

here English is being translated to which language - I'm assuming vectors? might be a silly question, I'm assuming that's where the origin of the word "Transformer" lies

QuantumGood · on May 18, 2023

> in a machine translation task, it can transform a sentence in one language into a sentence in another language.

This means when it is translating between two human languages "a machine translation task".

UncleOxidant · on May 18, 2023

Someone who read the paper pointed out to me recently that there's an aspect to transformers/attention that uses the sin or cos function to determine which words to pay attention to or the spacing between them (I'm probably not expressing this correctly, so please correct me if I'm wrong). It seems really unintuitive that sin and/or cos would be a factor in human language - can you explain this?

TaurenHunter · on May 18, 2023

That sounds like a reference to the concept of cosine similarity.

Imagine that words are spread out in the space. Cosine similarity is a measure of similarity between two vectors (each word is encoded as a vector).

By measuring the cosine of the angle between the two vectors we can get:

1) whether 2 vectors have the same angle (2 words have the same meaning or close enough) when the cosine is close to 1

2) whether 2 vectors are perpendicular (2 words don't have anything to do with each other) when the cosine is close to zero

3) whether 2 vectors are opposite in direction (2 words have opposite meanings in some aspect) when the cosine is close to -1

Cosine similarity is like comparing two people's interests. If two people have similar interests, the angle between them is small, and the cosine similarity value will be high. If two people have completely different interests, the angle between them is large, and the cosine similarity value will be low. So, cosine similarity is a way to measure how similar two things are by looking at the angle between them.

fwlr · on May 18, 2023

Not as much of an expert as others commenting here, but I believe the sine/cosine stuff comes in just because it’s a standard and very efficient way of comparing vectors.

(“Vector” is just an alternate way of talking about a coordinate point - you can say “a point is at (x,y)”, or equivalently you can also say “turn x degrees and then travel y units forward”, either method gives enough information to find the point exactly.)

I don’t think sine and cosine are actually factors in human language - rather, the process of turning words into vectors captures whatever are the factors in human language, translates them into vectors, and in that translation process the factors get turned into something that sine/cosine measurements of vectors is good at picking up.

A toy example would be that arithmetic doesn’t seem to be a factor in political orientation, but if you assess everyone’s political orientation with some survey questions and then put them on a line from 0 to 10, then you could do some subtraction and multiplication operations to find numbers that are close together - ie doing arithmetic to find similar politics. The reason that works is not because arithmetic has anything to do with political orientation, it’s because your survey questions captured information about political orientation and transformed it into something that arithmetic works on.

I guess this explanation doesn’t do much except push the unintuitiveness into the embedding process (that’s the process of turning words into vectors in a way that captures their relationship to other words).

LudwigNagasena · on May 18, 2023

It is explained in the paper.

> In this work, we use sine and cosine functions of different frequencies:

> PE(pos,2i) = sin(pos/10000^{2i/d_model})

> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})

> where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.

rshm · on May 18, 2023

Someone else can better explain this. Based on one of the video suggested in one of the replies here. sin and cos doesn’t have any inherent properties specific to language, they were chosen because a simple linear function was needed in that step of optimization. Any other function could fit the bill as well.

vitno · on May 18, 2023

Sin or cosine was chosen explicitly because they are non-linear functions

sn41 · on May 18, 2023

complete newbie here: what is the intuition behind the conclusion that "cat" is highly related to "black" as opposed to, say, "mat"?

nborwankar · on May 18, 2023

Attention and the Transformer make it possible to recognize that the probability of “black” applying to the cat is much much higher than to the mat due to the phrasing “which is” in between the cat and black.

sn41 · on May 18, 2023

Thank you. So this is based on the training data, I assume.

aGHz · on May 18, 2023

It is a lot harder to take the black out of the cat than it is to take the mat out from under it.

djbusby · on May 18, 2023

Humans know that, how does transform know that? Based on training data?

Accujack · on May 18, 2023

Sort of. Part of the training for a model includes telling it which parts of a sentence are important... a human points and clicks.

testrun · on May 18, 2023

This is extremely important to know. That the relationships between words in the sentence are actually trained by human evaluation.

breezeTrowel · on May 18, 2023

They are not.

mollerhoj · on May 18, 2023

No, thats incorrect. The connections are automatically deduced from the training data (which is just vast amounts of raw text).

majormajor · on May 18, 2023

> For example, in the sentence "The cat, which is black, sat on the mat," the words "cat" and "black" would get high scores when trying to understand the word "black" because they are closely related.

So what does that actually mean in terms of looking at new text? How does it know the relationships? Does it have to be bootstrapped on labeled data for a specific language up front?

Is that something done in the training process - providing example sentences and illustrating the connections between words - or is that earlier?

myfistoffleas · on May 17, 2023

Is there a way to have recuesively constructed attentional architectures? It would seem like the same process that you describe could be even more useful if it could be applied at the level of sentences, paragraphs, etc.

panarky · on May 18, 2023

> each word in a sentence ...

Does each sentence stand alone, or is the meaning of the sentence, and the words in the sentence, influenced by the sentences that come before and after it?

rosebay · on May 17, 2023

Great explanation

_gfwu · on May 17, 2023

It's chat-gpt generated. It's even leaked part of the prompt in the intro.

I especially disagree with:

> natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.

armchairhacker · on May 18, 2023

If GPT4 wrote this it did a great job, and highlights how incredible and useful it can be.

Although I'm sure "ELI5 what is a transformer" was one of the RHLF prompts which got handcrafted responses from OpenAI engineers whose bread and butter is transformers, so...still a great response.

uhfcjgx · on May 17, 2023

Do you disagree with anything else? That sounds like a simplification and not too bad given the target audience.

yawnxyz · on May 17, 2023

how is "attention" different from using tokens > vector database > cosine similarity?

QuantumGood · on May 18, 2023

In the context of natural language processing, the attention mechanism used in Transformer models and the process of converting tokens to vectors and calculating cosine similarity have similarities but serve different purposes.

When you convert words (tokens) into vectors and calculate cosine similarity, you're typically doing what's called "word embedding". This process captures the semantic meaning of words in a high-dimensional space. Words that have similar meanings have vectors that are close to each other in this space. Cosine similarity is a measure of how similar two vectors are, which in this context equates to how similar the meanings of two words are.

On the other hand, the attention mechanism in Transformer models is a way to understand the relationships between words within a specific context. It determines how much each word in a sentence contributes to the understanding of every other word in the sentence. It's not just about the semantic similarity of words, but also about their grammatical and contextual relationships in the given sentence.

Here's an analogy: imagine you're trying to understand a conversation between a group of friends. Just knowing the meaning of their words (like word embeddings do) can help you understand some of what they're saying. But to fully understand the conversation, you also need to know who's speaking to whom, who's agreeing or disagreeing with whom, who's changing the topic, and so on. This is similar to what the attention mechanism does: it tells the model who's "talking" to whom within a sentence.

So while word embeddings and cosine similarity capture static word meanings, the attention mechanism captures dynamic word relationships within a specific context. Both are important for understanding and generating human language.

soVeryTired · on May 18, 2023

Just a guess: is this answer GPT output?

cwillu · on May 18, 2023

From another one of his responses that he has since deleted from this thread: “Despite these challenges, researchers have found that hierarchical attention can improve performance on tasks like document classification, summarization, and question answering, especially for longer texts. As of my last training cut-off in September 2021, this is an area of ongoing research and development.”

brianjking · on May 18, 2023

Probably, it is a good explanation, though.

nycdatasci · on May 17, 2023

If you're specifically focused on semantic similarity, I would say that attention adds to the dimensionality of the vector space. Distances between tokens can vary depending on context.

kevinwang · on May 17, 2023

It's orthogonal, right? How do you go from tokens to vector? Fully connected NN? lstm? Or transformer?

noobcoder · on May 18, 2023

I had a blog on it on overview of it with codes and explanation (more like eli15): https://medium.com/analytics-vidhya/googles-t5-transformer-t...