That's a great summary, but it's important to understand that much more goes into training these models. The architecture is not any kind of secret sauce, or special in any way. It's just a typical Transformer. I call this "architecture porn" - people love looking at neural net architectures and think that's the key to success. If only you know the algorithm! It's so simple!
But reality is usually much messier. The real training code will be littered with hundreds of ugly little tricks to make it work. A large part of it will be input preprocessing and data engineering, tricks to deal with exploding/vanishing gradients, monitoring, learning rate schedules and optimizer cycling, complexity for distributed training, regularization tricks, changing parts of the architecture for performance reasons (like attention), and so on.
I'm far from an expert in this field, but based on my conversations with people who are I think this is getting less true. Normally these models are trained with straightforward optimizers (basically naive SGD) since advances like batch normalization and residual connections make the more fancy stuff unnecessary. I think the learning rate schedules used for these big networks tend to be simple as well, just two or three steps.
I work in this field (PhD candidate), and what you say is true for smaller models, but not GPT-3 scale models. Training large scale models involved a lot more, as the OP said. It's not just learning rate schedulers, it's a whole bunch of stuff.
Seems like majority of problems in this log are devops problems, which seems to be combination of ML people doing devops work while not having experience with devops work and really bad cloud vendor. I've been running multiple bare metal nodes with 8 GPUs each running 24/7 for months with almost 100% utilization and had 100x less problems than they had.
You put the finger on exactly what I find incredible about the recent progress in ML - the reason I wrote this post was to see how much I could de-mystify these state-of-the-art models for myself, and the conclusion is that (after the model is trained) it all really boils down to a couple of matrix multiplications! All the impressive results we see, they're not coming from an extremely complicated system ('complicated' like a fighter jet is, with many different subsystems, which you'd need to read many books to memorize).
Of course, there's all the secret sauce to actually getting the models to learn anything, and all the empirical progress we make to make the training more efficient (ReLUs, etc). But how many of those are fundamental, vs. simply efficiency shortcuts? And: if you'd asked me 10 years ago what I thought it would take to get the kind of output these large models are getting these days, I would not have guessed anything nearly as simple as what those models actually are.
Don't know. Karpathy has a very compact implementation of GPT [0] using standard technology (could be even more compact but is reimplementing for example the attention layer for teaching purposes) and while he presumably has no access to how the real model was trained exactly, if there would be more to it I think he would be the kind of person to point it out.
I‘ve recently come to the conclusion that the magic of fully connected neural networks is that there are almost no tricks to reach close to sota. Dense layers + relu + adam = it just works
Sorry but this is just wrong, using only fully connected layers would result in pretty bad performance on images, text, audio, etc., or at the very least require much more data to perform well. At least use the right type of architecture for each data modality, then I agree that the basic version won't perform much worse than sota in the real world.
I think part of parent is wrong but part is correct.
There are many rules of thumb that took the last 5+ years to discover but are now quite standard. You are nit picking on fully connected, but if we add dropout, weight initialization, and adaptive learning rate to what they said, then we are fairly close to being able at least get a deep architecture to overfit a toy dataset and be off to the races for then applying it to a larger dataset.
The smart money should be on research on current shortcomings that will become deal breakers when AI is fully pervasive in society. For example, addressing catastrophic forgetting seems to me to be a very profitable research aim.
Maybe I wasn't clear enough but of course I'm not implying that you can reach sota on image classification with fcnns. There are many problems where the input space is not as noisy, redundant and structure bearing as with images.
I used to work in data engineering for ML and yes, I'd say 90% of our technical expertise on both the science and engineering side went into designing the datasets.
Just getting plain text out of the web without getting flooded with boilerplate, noise, SEO spam, duplication, infinity pages like calendars etc is already a hard data engineering problem.
You're only thinking of the training data. But the pre-trained model is like a newborn, trashing and yelling and not listening. It needs a second level of training made of a mix of about 1800 supervised tasks. Now it has progressed a little, you can get it to listen, but it's still not ok, it's like a 5 year old. You need to label more data with human preferences and fine-tune the model to align it with what we think is good behaviour. Now it behaves like a 10 year old.
In the original dataset you already combine dozens of sources - web scrapes, book collections, paper collections, materials in many languages, etc. In the second stage you have thousands of small supervised datasets. In the third stage you have to label. So I think the dataset building phase is pretty difficult.
Note on the sinusoidal encoding, the reason it’s used is generally speaking twofold:
1 - To encode position somehow (which author details)
2 - Because sine is easy “noise” for the network to learn.
There’s also a bunch of cool tricks here even down to the PyTorch implementation to optimize this encoding by exploiting the nature of sine/cosine which is an added reason for its popularity in Transformer architectures. If you like math I recommend diving into it as it’s quick but fun!
(Side note it’s also falling out of fashion for other encoding methods. e.g. rotary positional encoding is vastly popular in the reformer branch of transformers)
Not an expert, but took my masters in data science some years ago:
My interpretation is the sinusoidal converts the values from 0..N to 0..1. This may result in more predictable changes than using large integer numbers, such that the same word in different positions doesn't lose all its meaning.
Using integers and having a layer to compute this operation could also work, so maybe this is an optimization, eg it reduces training time or yields better results.
according to the Attention is all you need paper there wasn't huge differences between using sin/cos and trainable weights. But the paper is a little bit old, so I don't know what is the current sota regarding positional embeddings
I’m new to this stuff, but as I understand it, the “Attention is all you need” paper stated that training the positional encoding weights didn’t improve results for language models specifically, but other papers found that vision transformers performed better with trainable weights.
If one were to make a markov chain with the same amount of input data, would the result be the same? Markov chain chatbots have been a thing for years, just on a much more limited set of data.
No. The state size is 50257^2048. The vast majority of states have never been seen and will never be seen in all of humanity.
For an example, if your training set consists of the words rain and thunder used interchangeably a lot, but the word "today" is only used once in the sentence "there is no rain today", then a Markov chain based on the data would never output "there is no thunder today", but a transformer might.
In other words, information compression (eg. equating rain with thunder) isn't just for practicability, it's a necessary requirement for (the current generation of) good language models.
I wonder what is the reason different blocks are summed even though they semantically represent very different things - e.g. Positional encoding plus embedding. Intuitively wouldn't you expect the architecture to work better if you kept those things as separate inputs to the next layer? Won't the network have some blind spots where a given word will effectively be invisible if it appears in the wrong position?
It does not "support" arbitrary precision. (I had to change the inputs a little because hacker news removes asterisks)
Example 1:
Input: 6984654984165498 x 83749872394871982798
Output: The result of the multiplication is 58307472585676078521357388975872646506.
Expected: 584963963646067047073619488641103404
Example 2:
Input: 45649898465897645132987645146987156456978 x 879846498164168468465465169816989441
Output: I'm sorry, but I'm not able to perform calculations involving such large numbers. My knowledge cutoff is 2021, and I am not currently able to browse the internet, so I don't have access to any updated information. Is there something else I can help you with?
For example, how does the explanation in the article produce a module that can solve this:
“ To calculate the hypotenuse of a triangle with one side that is 12 inches long and another i S side that is 36 centimeters long, a 6th grader might say something like this:
"First, we need to convert the 36 centimeters into inches so that both sides of the triangle are in the same units. We can do this by dividing 36 by 2.54, which is the number of centimeters in one inch. This gives us 14.173228 inches.
Then, we can use the Pythagorean theorem to find the length of the hypotenuse. The Pythagorean theorem says that in a right triangle, the square of the length of the hypotenuse is equal to the sum of the squares of the lengths of the other two sides. So we
can use this formula to find the length of the hypotenuse: a^2 + b^2 = c^2.
In our triangle, the length of one side is 12 inches, and the length of the other side is 14.173228 inches. So we can plug those numbers into the formula like this: 12^2 +
14.173228^2 = c^2.
Then we just need to do the math to find the value of c. 12^2 is 144, and 14.173228^2 is
201.837296. So if we add those two numbers together, we get 346.837296. And if we take the square root of that number, we get the length of the hypotenuse, which is 18.816199 inches.”
Well, it's wrong for one: It correctly gets the division (I assume it has that fact memorized) but 14.173228^2 is 200.88 not 201.83. it then also does the addition wrong, and the square root is also wrong.
You gotta be REAL careful with ChatGPT output that sounds convincing and technical. It's very good at convincingly making stuff up, even math-y science-y sounding stuff.
It's still hard for me to comprehend that it gets as close as it did. Was there enough training data on 14 or so squared that it let it get close at 200.88?
Q (Query) is like a search query. K (Key) is like a set of tags or attributes of each word that the query can look for.
Imagine the self-attention scores for the sentence ”The chicken crossed the road because it wanted to get to the other side.”
Let’s look specifically at the word “it”. We can imagine the matrix Q_it as a representation of attributes of the word we want “it” to gain context from. For example, “it” is a pronoun, and therefore gains context from nouns. Therefore, the matrix Q_it might have some representation of “noun” in it, because we want it to take context from nouns.
In other words, one of the goals during the training process of a transformer network is to train a weight W_q to map a word to a matrix representation of attributes of words it gains context from. So W_q should map any word that’s a pronoun to a noun query.
Similarly, K can be thought of as a representation of attributes of each word. So K_chicken should also have a “noun” tag. So one of the goals during training is to train W_k, that maps the latent-space representation of a word (chicken) to a matrix representation (K_chicken) of it’s important attributes (noun).
When we take the dot product of Q and K, what we’re finding is the similarity between those 2 matrices. In other words Q_it * K_chicken should have a high value because “it” is looking for nouns in it’s query, and “chicken” is a noun.
Obviously this is a very human-centric explanation and how the weights W_q and W_k are trained in practice may not align perfectly with human interpretable concepts, but hopefully helps with understanding.
Hi! I'm the author (Daniel). I used OneNote on some old surface tablet I had lying around, but these days I'm not sure I would use it again (for example because it doesn't support exporting parts of a page to .svg)
But reality is usually much messier. The real training code will be littered with hundreds of ugly little tricks to make it work. A large part of it will be input preprocessing and data engineering, tricks to deal with exploding/vanishing gradients, monitoring, learning rate schedules and optimizer cycling, complexity for distributed training, regularization tricks, changing parts of the architecture for performance reasons (like attention), and so on.