> The model requires ~264GB of RAM This feels as crazy as Grok. Was there a gene...

breezeTrowel · 2024-03-27T19:32:07

Cranking up the parameter count is literally how the current LLM craze got started. Hence the "large" in "large language model".

Jackson__ · 2024-03-27T18:16:48

If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.

From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.

espadrine · 2024-03-27T22:22:03

Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.

In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.

wrs · 2024-03-27T17:03:41

Isn’t that pretty much the last 12 months?