I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.
OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.
I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.