* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)
I hate when people don't include approximation for traning before final hyperparameters are found as its most costly part of whole process most of the time.
Just yes we train it for so long etc. but they never speak about tens or even hundres of runs before they finalize the model parameters and architecture -.-
> we used 2048 A100-80GB for a period of approximately 5 months
Do we know how much total energy a human consumes from birth to 20 yo? Something like 2000 calories integrated over 20 years. How does it compare to the GPUs above?
Wolfram Alpha:
- human - 17 MW/h ((2000 calories per day) over 20 years in MWh)
- GPUs - 3000 MW/h ((2048 * 400) W over 5 months in MWh)
We still have the edge.
LOL, I'm being downvoted, I wonder way. Some don't like the question.
You have to include our evolutionary history too. A considerable amount of our sophisticated behavior doesn't require specific training, as it is encoded in our genetic and epigenetic systems. We aren't starting from zero.
Then you would need to include the our history in the GPU calculation. GPUs require evolutionary bootstrapping - they didn't materialize alongside the first few hydrogen atoms post BB.
Depends on what you're doing. A human is much smarter than one of these models, but the model has approximate knowledge of orders of magnitude more things. And the energy costs per word of output are a lot closer.
A thing to keep in mind is that 1 MWh of raw calories takes much more than 1 MWh to produce (fuel for tractors, inefficiency of meat etc). The GPU energy is also easier to make renewable.
I did an extremely rough calculation recently that the training of GPT-3 is comparable to one transatlantic flight (all passengers combined) in terms of emissions, very depending on the energy mix of course.
That's the entire problem. There's so much more energy that goes into a modern human beyond just what they eat. Beyond physical items you've listed like clothing there's also education and healthcare. Those two institutions are critical in making a modern human and they both have their own dependency chains of physical resource, energy, and the input of even more humans.
I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.
At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs
90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.
A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.
>* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.*
what do you mean by this ? The OpenAI papers talk roughly about model performance scaling by parameters. does this show the other way ?
umm...so does OpenAI. In fact this is OpenAI discovery from [1]:
>Convergence is inefficient: When working within a fixed compute budget C but without any other restric-
tions on the model size N or available data D, we attain optimal performance by training very large models
and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would
therefore be far more sample efficient than one might expect based on training small models to convergence,
with data requirements growing very slowly as D ∼ C0.27 with training compute. (Section 6)
>We have also tested our models on a set of additional text data distributions. The test loss on these datasets
as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2
dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct
parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the
in-distribution validation loss, and does not depend on the duration of training or proximity to convergence.
We also observe no dependence on model depth (see Appendix D.8)
This is the old scaling laws paper. The scaling laws in it turned out to be wrong and superseded by the Chinchilla DeepMind paper: https://arxiv.org/abs/2203.15556
By "parameters" they probably mean float32s, and 65B of those is 0.25 TB of data - more than enough to memorize a 1.5T sequence of "tokens" (3 letter triplets?). This begs the question: are these models better than a fuzzy hash table?
Yes and no. Information theoretically, tokens are pretty well compressed, and you can't get another 6x losslessly.
Moreover, anything even kind of looking like a hash table in the input/output space is ruled out by the observed facts that the models can extremely respond frequently to samples crafted to not be in the training set and that it takes into account many long-range dependencies (i.e., the hash table would have to be exponentially larger than it is to match the model's performance).
That said, they are just statistical party tricks. The magic happens because the lookup tables are in a latent space. That's why you can drop in garbage like "uberworldchefinatormichelingodfoodpleasureorgasmmaestro" when asking for recipes and food recommendations and get an experience planets apart from queries excluding the nonsense phrases. The model is just pulling together some token associations, and throwing in the right tokens can take advantage of those in situations where a thinking person would barely be able to parse what you're asking.
Your question feels like it has a motive though. What are you really asking?
LLMs need a baseline to compare with. I suspect that when they get compared with a fuzzy hash table of a similar size (that returns a range of probabilities), their performance will become unimpressive.
You can just directly calculate what would happen. To respond to novel words (which these demonstrably do) it needs to be equivalent to a character-wise hash table, and to be the same size as LLaMA you can do lookups on around 4 characters (and you have to deal with the data sparsity in constructing many of those tuples). If you want worse output but a better hash table on the output that remains, you could hash words or common words and get contexts of up to a few words rather than a few letters.
LLMs can track mid-range dependencies though. Consider the following input
> Translate the phrase "the lazy brown fox jumped over the thorny brambles" into French, write the translation, and then write the second through fourth words of that translation.
Looking at any one word of the output you need to track many of the input words to get it correct, and the relative positions of those necessary input words is not consistent from one output word to the next. ChatGPT solves the task flawlessly (aside from its habit of explaining what it's doing before doing it). Any hash table solution, at a minimum, would need a complicated heuristic for determining which words/characters to look up.
Doing so brings us back closer to the state of language models before transformers. You had a lot of hand-tuned features, formal grammars, complicated orders of operations, expert lookup tables, and whatnot. Performance was still much, much worse than what we're getting now with deep learning.
None of that is to say that philosophically we're doing anything more than mishmashing probabilities or that something better doesn't exist, but without significant innovation rule-guided fuzzy hash tables aren't it.
The fuzzy hash table would use 8192 long token sequences of tokens as keys, and when requested to fetch a key, it would find the nearest keys and return that distribution. The internal representation of this hash table is a cloud of tokens in a 8192×sizeof(token) dimensional space.
The procedure of constructing this table would be just getting all the 1.5 trillion subsequences, each 8192 tokens long, and inserting it: table[seq8192] = token8193 (the next token). Arranging this data efficiently to allow fast lookups is the problem.
Edit: I missed this on the first pass, but I'm totally lost as to where 1.5T comes from. Even if you only have two tokens there are vastly more 8192-length subsequences than that (something like 2^8151.5 times more), and if we're just trying to replicate the same space as something like GPT3.5 or LLaMA then you only get on the order of 0.065T to 0.175T entries to play with, much less when you consider that you have a full probability distribution to store (divide by your unique token count, and again by at least 2 if we store at least IEEE f16 probabilities).
There are lots of interpretations. I actually like KNN for a lot of tasks. My gut says that it still wouldn't perform well here (and for the record, there are efficient data structures for the idea you're describing unless you have some nonstandard modifications, so "arranging the data efficiently to allow fast lookups" is definitely not the core problem), but I admittedly don't have proof of that yet.
For some intuition, imagine the following tasks:
> Repeat the following phrase exactly twice: "sdflhasdflhasdf"
> Repeat the following phrase exactly twice: "sdflhasdflhasdg"
Your fuzzy dictionary or geospatial map can't possibly have enough keys to distinguish the requests (or if it distinguishes those, you can adversarially select different keyboard mashes), and so the result, no matter what it is, would have the same probability distribution for both prompts. Since the desired results are different, at least one of those would be have some unavoidable wrongness.
The GPT family, on the other hand, has few issues with random phrase duplication since positional information is something it explicitly considers and is capable of prioritizing over other token information.
* All variants were trained on 1T - 1.4T tokens; which is a good compared to their sizes based on the Chinchilla-metric. Code is 4.5% of the training data (similar to others). [Table 2]
* They note the GPU hours as 82,432 (7B model) to 1,022,362 (65B model). [Table 15] GPU hour rates will vary, but let's give a range of $1 to $4. The 7B model would have cost ~$82-329k and the 65B something in the range of ~$1-4M. They also note their total time spent for all models: "we used 2048 A100-80GB for a period of approximately 5 months" [sec 6, pg 10]
* 65B model's performance is broadly comparable to PALM-540B. Not a small feat, but also could indicate the benefits of good model-vs-token size ratios [Tables 3,4,5,6]. Their conjecture for underperforming on MMLU (multitask language understanding) compared to PALM-540B and Chinchilla-70B is smaller fraction of books and academic training data.
* Math and code tasks: Math tasks they are substantially worse than Minerva (comparing their 65B to Minerva 62B; they hands down fail against Minerva 540B) [Table 7]. Code tasks they are broadly competitive with PALM-540B (HumanEval and MBPP evals) [Table 8]
* Surprising that instruction fine tuning takes such a small part of the paper (sec 4, pg. 7)