I don't understand what you mean. To improve performance using the same basic ar...

og_kalu · 2024-04-03T22:17:02 1712182622

Gpt-4 was allegedly trained on 6T tokens for 2 epochs (i.e twice). 6T is far from exhausting the text data we have.

Here's a 5 language 30T token dataset from web scrapped data.

https://github.com/togethercomputer/RedPajama-Data

And that is just web scrapped data. There are trillions of valuable tokens worth of text from the likes of pdfs/ebooks, academic journals and other documents that essentially has no web presence otherwise.

https://annas-archive.org/llm

>To improve performance using the same basic architecture the model need to scale both compute and training data.

What you are trying to say here is that you need to scale both parameters and data as scaling data increases compute.

That said, it's not really true. It would be best if you scaled both data and parameters but you can get increased performance just scaling one or the other.

geysersam · 2024-04-04T16:50:31 1712249431

The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.

og_kalu · 2024-04-04T21:54:18 1712267658

We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.