Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand what you mean. To improve performance using the same basic architecture the model need to scale both compute and training data. Where are we going to find another 10x the web sized text training corpus?



Gpt-4 was allegedly trained on 6T tokens for 2 epochs (i.e twice). 6T is far from exhausting the text data we have.

Here's a 5 language 30T token dataset from web scrapped data.

https://github.com/togethercomputer/RedPajama-Data

And that is just web scrapped data. There are trillions of valuable tokens worth of text from the likes of pdfs/ebooks, academic journals and other documents that essentially has no web presence otherwise.

https://annas-archive.org/llm

>To improve performance using the same basic architecture the model need to scale both compute and training data.

What you are trying to say here is that you need to scale both parameters and data as scaling data increases compute.

That said, it's not really true. It would be best if you scaled both data and parameters but you can get increased performance just scaling one or the other.


The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.


We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: