If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.
From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.
Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.
In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.
This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?