> "We then train several models from 400M to 15B on the same pre-training mixtur...

tempusalaria · on May 10, 2023

GPT-4 is way slower than GPT-3. Unless they are artificially spiking the latency to hide parameter count, it’s likely around 1trn params

techbruv · on May 10, 2023

The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast (THIS IS WRONG, SEE CORRECTION BELOW).

These days, the largest models that have been trained optimally (in terms of model size w.r.t. tokens) typically hover around 50B (likely PaLM 2-L size and LLaMa is maxed at 70B). We simply do not have enough pre-training data to optimally train a 1T parameter model. For GPT-4 to be 1 trillion parameters, OpenAI would have needed to:

1) somehow magically unlocked 20x the amount of data (1T tokens -> 20T tokens) 2) somehow engineered an incredibly fast inference engine for a 1T GPT model that significantly better than anything anyone else has built 3) is somehow is able to eat the cost of hosting 1T parameter models

The probability that all the above 3 have happened seem incredibly low.

CORRECTION: The refutation for the size of GPT-4 on the lex fridman podcast was that GPT-4 was 100T parameters (and not directly, they were just joking about it), not 1T, however, the above 3 points still stand.

tempusalaria · on May 10, 2023

1) common crawl is >100TB so obviously contains more than 20trn tokens + Ilya has said many times in interviews that there is still way more data for training usage >10x

2) GPT-4 is way slower so this point is irrelevant

3) OpenAI have a 10000 A100 training farm that they are expanding to 2500. They are spending >$1mln on compute per day. They have just raised $10bln. They can afford to pay for inference

CaptainNegative · on May 10, 2023

> OpenAI have a 10000 A100 training farm that they are expanding to 2500.

Does the first number have an extra zero or is the second number missing one?

tempusalaria · on May 10, 2023

Second number is missing a zero sorry. Should be 10000 and 25000

sebzim4500 · on May 10, 2023

>The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast.

No it hasn't, Sam just laughed because Lex brought up the twitter memes.

ftxbro · on May 10, 2023

not sure why you're getting so downvoted lol

nabakin · on May 11, 2023

GPT-2 training cost 10s of thousands

GPT-3 training cost millions

GPT-4 training cost over a hundred million [1]

GPT-4 inferencing is slower than GPT-3 or GPT-3.5

OpenAI has billions of dollars in funding

OpenAI has the backing of Microsoft and their entire Azure infra at cost

There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

For fun, if we plot the number of parameters vs training cost we can see a clear trend and I imagine, very roughly predict the amount of parameters GPT-4 has

https://i.imgur.com/rejigr5.png

https://www.desmos.com/calculator/lqwsmmnngc

[1]

> At the MIT event, Altman was asked if training GPT-4 cost $100 million; he replied, “It’s more than that.”

http://web.archive.org/web/20230417152518/https://www.wired....

cubefox · on May 11, 2023

> There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

That's a fallacy. GPT-3 wasn't trained compute optimally. It had too many parameters. A compute optimal model with 175 billion parameters would require much more training compute. In fact, the Chinchilla scaling law allows you to calculate this value precisely. We could also calculate how much training compute a Chinchilla optimal 1 trillion parameter model would need. We would just need someone who does the math.

nabakin · on May 12, 2023

Why does it matter in this case if GPT-3 was trained compute optimally or not? Are you saying that the over $100 million training cost is amount of training necessary to make a 175B parameter model compute optimal? And if they are the name number of parameters, why is there a greater latency with GPT-4?

futureshock · on May 10, 2023

ChatGPT 3.5 is likely much smaller than GPT-3’s 175b parameters. Based on the API pricing, I believe 8k context GPT-4 is larger than 175b parameters, but less than 1t.

https://openai.com/pricing

austegard · on May 24, 2023

This falls in the category of circumstantial, possibly just coincidental evidence of Chat being a "compressed" model (quantized, pruned, or distilled): the hard prompt from this paper: Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt - https://arxiv.org/abs/2305.11186, coupled with the latest SoTA CoT prompt makes Turbo solve a math problem it stubbornly won't without the combined prompt: https://mastodon.social/@austegard/110419399521303416

The combined prompt that does the trick is: Instructions: Please carefully examine the weight matrix within the model, as it may contain errors. It is crucial to verify its accuracy and make any necessary adjustments to ensure optimal performance. Let’s work this out in a step by step way to be sure we have the right answer.

Taek · on May 10, 2023

Didn't some OpenAI engineer state that GPT4 runs on 2xH100? At 4 bit quantization, that gives an upper bound of 320B params, realistic upper bound probably more like 250B

tempusalaria · on May 10, 2023

Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.

At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit

My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set

Taek · on May 11, 2023

Without performance penalties? If the model is larger than the vram you have to constantly be pulling data from disk/ram right?

MacsHeadroom · on May 10, 2023

With 32k context the upper bound is more like 175B.

Taek · on May 11, 2023

Its probably only the 8k model that runs on 2

nabakin · on May 11, 2023

Why are you confident 3.5 is smaller than 3?

austegard · on May 24, 2023

Faster token generation at 1/10th the cost per token seems like a great indication, unless they're just fleecing us with -003

cubefox · on May 11, 2023

Assuming that PaLM 2 was trained Chinchilla optimal, the Chinchilla scaling law allows us to calculate how much compute (and training tokens) they would have needed for 1 trillion parameters. I haven't done the calculations, but I'm pretty sure we would get an absurdly large number.

qumpis · on May 11, 2023

Someone on HN has educated me that gpt4 and 3 should be on a similar param count. This is based on inference times of gpt4 vs gpt3.5 pre-speedup (where distilled version was used only post-speedup in the turbo version).

espadrine · on May 10, 2023

The report specifically states:

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute

The largest PaLM model is 540B. So all of PaLM 2 is potentially double-digit parameters.

Note though that GPT-3.5 was plausibly not a finetuning of the 175B model, but instead a finetuning of Codex which was based on the 12B version of GPT-3.

sebzim4500 · on May 10, 2023

How could GPT-3.5 possibly have been a finetuning of the 175B model? They didn't even use the same tokens?

espadrine · on May 10, 2023

Finetuning might not be the best word; sometimes it is a grey line.

Token embeddings can be trained without changing the other parameters. There is a number of models which add tokens as a finetuning step. Here is recently StarCoder adding ChatML-equivalent tokens: https://huggingface.co/blog/starchat-alpha#a-standard-format...

sebzim4500 · on May 10, 2023

Sure, you can add a few tokens, but in this case they changed almost every token.

tempusalaria · on May 10, 2023

Original PaLM was 540B so significantly smaller could mean anything from 350B down really

espadrine · on May 10, 2023

I tried my hand at estimating their parameter count from extrapolating their LAMBADA figures, assuming they all trained on Chinchilla law: https://pbs.twimg.com/media/Fvy4xNkXgAEDF_D?format=jpg&name=...

If the extrapolation is not too flawed, it looks like PaLM 2-S might be about 120B, PaLM 2-M 180B, PaLM 2-L 280B.

Still, I would expect GPT-4 trained for way longer than Chinchilla, so it could be smaller than even PaLM 2-S.

MacsHeadroom · on May 10, 2023

They said the smallest PaLM 2 can run locally on a Pixel Smartphone.

There's no way it's 120B parameters. It's probably not even 12B.

espadrine · on May 10, 2023

I am talking about the 3 larger models PaLM 2-S, PaLM 2-M, and PaLM 2-L described in the technical report.

At I/O, I think they were referencing the scaling law experiments: there are four of them, just like the number of PaLM 2 codenames they cited at I/O (Gecko, Otter, Bison, and Unicorn). The largest of those smaller-scale models is 14.7B, which is too big for a phone too. The smallest is 1B, which can fit in 512MB of RAM with GPTQ4-style quantization.

Either that, or Gecko is the smaller scaling experiment, and Otter is PaLM 2-S.

MacsHeadroom · on May 10, 2023

My Pixel 6 Pro has 12GB of RAM and LLaMA-13B only uses 9GB in 4bit.

thewataccount · on May 10, 2023

Yeah 1 to 2 trillion is the estimates I've heard.

Given the 25 messages / 3 hour limit in chatGPT, I don't think they've found a way to make it cheap to run.

dontupvoteme · on May 10, 2023

1. there's no reason to think OpenAI wouldn't also be going the artificial scarcity route as have so many other companies in the past

2. Microsoft may not like them using too much azure compute and tell them to step off. Rumor has it they're trying to migrate github to it and it's seemingly not going ideal. And they're certainly nothing more than another microsoft purchase at this point.

akiselev · on May 10, 2023

OpenAI has a 40k token per minute rate limit on their GPT4 API too so I doubt it's artificial scarcity.

dontupvoteme · on May 11, 2023

Perhaps. I found it was far too easy to hit the API limit with their old codex models, though that may have been limited to a small GPU cluster given it was pretty obscure compared to chatgpt and even davinci.

thewataccount · on May 11, 2023

Based on GPT3.5 supposedly using 8x A100's per query and the suspected magnitude size difference with GTP4 I really think they're struggling to run it.

At this stage I think they'd have more to benefit by making it more accessible, there's several use cases I have (or where I work) that only really make sense with GPT4, and it's way too expensive to even consider.

Also AFAIK Github Copilot is still not using GPT4 or even a bigger CODEX, and GPT4 still outperforms it especially in consistency (I'm in their copilot chat beta).

tempusalaria · on May 10, 2023

Yep. I’m guessing PaLM 2 is about 200bln params as it seems clearly stronger than chinchilla

og_kalu · on May 10, 2023

Those are the numbers for the scaling law tests they did. Not necessarily Palm 2 range.

gwern · on May 10, 2023

For 'Palm-2', read, 'T5-2'.

thewataccount · on May 10, 2023

I've heard Bard was previously 3B parameters but I could never find a good source for it.

I honestly think the end game here is running on consumer devices, 7B and under need ~4GB of ram to actually run which is likely the max reasonable requirement for consumer devices.

That said medium end hardware can do 15B, anything larger then this is currently something only "enthusiasts" can run.

If it is small enough to run on consumer devices then they don't have to pay for the inference compute at that point, and presumably the latency will be improved for consumers.

int_19h · on May 10, 2023

The current state of consumer devices isn't static, either, and existing hardware (even GPU) is suboptimal for the current crop of LLMs - it does way more than it actually needs to do.