Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> "We then train several models from 400M to 15B on the same pre-training mixture for up to 1 × 1022 FLOPs."

Seems that for the last year or so these models are getting smaller. I would be surprised if GPT-4 had > the number of parameters as GPT-3 (i.e. 175B).

Edit: Seems those numbers are just for their scaling laws study. They don't explicitly say the size of PaLM 2-L, but they do say "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute.". So likely on the range of 10B - 100B.



GPT-4 is way slower than GPT-3. Unless they are artificially spiking the latency to hide parameter count, it’s likely around 1trn params


The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast (THIS IS WRONG, SEE CORRECTION BELOW).

These days, the largest models that have been trained optimally (in terms of model size w.r.t. tokens) typically hover around 50B (likely PaLM 2-L size and LLaMa is maxed at 70B). We simply do not have enough pre-training data to optimally train a 1T parameter model. For GPT-4 to be 1 trillion parameters, OpenAI would have needed to:

1) somehow magically unlocked 20x the amount of data (1T tokens -> 20T tokens) 2) somehow engineered an incredibly fast inference engine for a 1T GPT model that significantly better than anything anyone else has built 3) is somehow is able to eat the cost of hosting 1T parameter models

The probability that all the above 3 have happened seem incredibly low.

CORRECTION: The refutation for the size of GPT-4 on the lex fridman podcast was that GPT-4 was 100T parameters (and not directly, they were just joking about it), not 1T, however, the above 3 points still stand.


1) common crawl is >100TB so obviously contains more than 20trn tokens + Ilya has said many times in interviews that there is still way more data for training usage >10x

2) GPT-4 is way slower so this point is irrelevant

3) OpenAI have a 10000 A100 training farm that they are expanding to 2500. They are spending >$1mln on compute per day. They have just raised $10bln. They can afford to pay for inference


> OpenAI have a 10000 A100 training farm that they are expanding to 2500.

Does the first number have an extra zero or is the second number missing one?


Second number is missing a zero sorry. Should be 10000 and 25000


>The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast.

No it hasn't, Sam just laughed because Lex brought up the twitter memes.


not sure why you're getting so downvoted lol


GPT-2 training cost 10s of thousands

GPT-3 training cost millions

GPT-4 training cost over a hundred million [1]

GPT-4 inferencing is slower than GPT-3 or GPT-3.5

OpenAI has billions of dollars in funding

OpenAI has the backing of Microsoft and their entire Azure infra at cost

There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

For fun, if we plot the number of parameters vs training cost we can see a clear trend and I imagine, very roughly predict the amount of parameters GPT-4 has

https://i.imgur.com/rejigr5.png

https://www.desmos.com/calculator/lqwsmmnngc

[1]

> At the MIT event, Altman was asked if training GPT-4 cost $100 million; he replied, “It’s more than that.”

http://web.archive.org/web/20230417152518/https://www.wired....


> There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

That's a fallacy. GPT-3 wasn't trained compute optimally. It had too many parameters. A compute optimal model with 175 billion parameters would require much more training compute. In fact, the Chinchilla scaling law allows you to calculate this value precisely. We could also calculate how much training compute a Chinchilla optimal 1 trillion parameter model would need. We would just need someone who does the math.


Why does it matter in this case if GPT-3 was trained compute optimally or not? Are you saying that the over $100 million training cost is amount of training necessary to make a 175B parameter model compute optimal? And if they are the name number of parameters, why is there a greater latency with GPT-4?


ChatGPT 3.5 is likely much smaller than GPT-3’s 175b parameters. Based on the API pricing, I believe 8k context GPT-4 is larger than 175b parameters, but less than 1t.

https://openai.com/pricing


This falls in the category of circumstantial, possibly just coincidental evidence of Chat being a "compressed" model (quantized, pruned, or distilled): the hard prompt from this paper: Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt - https://arxiv.org/abs/2305.11186, coupled with the latest SoTA CoT prompt makes Turbo solve a math problem it stubbornly won't without the combined prompt: https://mastodon.social/@austegard/110419399521303416

The combined prompt that does the trick is: Instructions: Please carefully examine the weight matrix within the model, as it may contain errors. It is crucial to verify its accuracy and make any necessary adjustments to ensure optimal performance. Let’s work this out in a step by step way to be sure we have the right answer.


Didn't some OpenAI engineer state that GPT4 runs on 2xH100? At 4 bit quantization, that gives an upper bound of 320B params, realistic upper bound probably more like 250B


Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.

At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit

My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set


Without performance penalties? If the model is larger than the vram you have to constantly be pulling data from disk/ram right?


With 32k context the upper bound is more like 175B.


Its probably only the 8k model that runs on 2


Why are you confident 3.5 is smaller than 3?


Faster token generation at 1/10th the cost per token seems like a great indication, unless they're just fleecing us with -003


Assuming that PaLM 2 was trained Chinchilla optimal, the Chinchilla scaling law allows us to calculate how much compute (and training tokens) they would have needed for 1 trillion parameters. I haven't done the calculations, but I'm pretty sure we would get an absurdly large number.


Someone on HN has educated me that gpt4 and 3 should be on a similar param count. This is based on inference times of gpt4 vs gpt3.5 pre-speedup (where distilled version was used only post-speedup in the turbo version).


The report specifically states:

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute

The largest PaLM model is 540B. So all of PaLM 2 is potentially double-digit parameters.

Note though that GPT-3.5 was plausibly not a finetuning of the 175B model, but instead a finetuning of Codex which was based on the 12B version of GPT-3.


How could GPT-3.5 possibly have been a finetuning of the 175B model? They didn't even use the same tokens?


Finetuning might not be the best word; sometimes it is a grey line.

Token embeddings can be trained without changing the other parameters. There is a number of models which add tokens as a finetuning step. Here is recently StarCoder adding ChatML-equivalent tokens: https://huggingface.co/blog/starchat-alpha#a-standard-format...


Sure, you can add a few tokens, but in this case they changed almost every token.


Original PaLM was 540B so significantly smaller could mean anything from 350B down really


I tried my hand at estimating their parameter count from extrapolating their LAMBADA figures, assuming they all trained on Chinchilla law: https://pbs.twimg.com/media/Fvy4xNkXgAEDF_D?format=jpg&name=...

If the extrapolation is not too flawed, it looks like PaLM 2-S might be about 120B, PaLM 2-M 180B, PaLM 2-L 280B.

Still, I would expect GPT-4 trained for way longer than Chinchilla, so it could be smaller than even PaLM 2-S.


They said the smallest PaLM 2 can run locally on a Pixel Smartphone.

There's no way it's 120B parameters. It's probably not even 12B.


I am talking about the 3 larger models PaLM 2-S, PaLM 2-M, and PaLM 2-L described in the technical report.

At I/O, I think they were referencing the scaling law experiments: there are four of them, just like the number of PaLM 2 codenames they cited at I/O (Gecko, Otter, Bison, and Unicorn). The largest of those smaller-scale models is 14.7B, which is too big for a phone too. The smallest is 1B, which can fit in 512MB of RAM with GPTQ4-style quantization.

Either that, or Gecko is the smaller scaling experiment, and Otter is PaLM 2-S.


My Pixel 6 Pro has 12GB of RAM and LLaMA-13B only uses 9GB in 4bit.


Yeah 1 to 2 trillion is the estimates I've heard.

Given the 25 messages / 3 hour limit in chatGPT, I don't think they've found a way to make it cheap to run.


1. there's no reason to think OpenAI wouldn't also be going the artificial scarcity route as have so many other companies in the past

2. Microsoft may not like them using too much azure compute and tell them to step off. Rumor has it they're trying to migrate github to it and it's seemingly not going ideal. And they're certainly nothing more than another microsoft purchase at this point.


OpenAI has a 40k token per minute rate limit on their GPT4 API too so I doubt it's artificial scarcity.


Perhaps. I found it was far too easy to hit the API limit with their old codex models, though that may have been limited to a small GPU cluster given it was pretty obscure compared to chatgpt and even davinci.


Based on GPT3.5 supposedly using 8x A100's per query and the suspected magnitude size difference with GTP4 I really think they're struggling to run it.

At this stage I think they'd have more to benefit by making it more accessible, there's several use cases I have (or where I work) that only really make sense with GPT4, and it's way too expensive to even consider.

Also AFAIK Github Copilot is still not using GPT4 or even a bigger CODEX, and GPT4 still outperforms it especially in consistency (I'm in their copilot chat beta).


Yep. I’m guessing PaLM 2 is about 200bln params as it seems clearly stronger than chinchilla


Those are the numbers for the scaling law tests they did. Not necessarily Palm 2 range.


For 'Palm-2', read, 'T5-2'.


I've heard Bard was previously 3B parameters but I could never find a good source for it.

I honestly think the end game here is running on consumer devices, 7B and under need ~4GB of ram to actually run which is likely the max reasonable requirement for consumer devices.

That said medium end hardware can do 15B, anything larger then this is currently something only "enthusiasts" can run.

If it is small enough to run on consumer devices then they don't have to pay for the inference compute at that point, and presumably the latency will be improved for consumers.


The current state of consumer devices isn't static, either, and existing hardware (even GPU) is suboptimal for the current crop of LLMs - it does way more than it actually needs to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: