For small models this is for sure the way forward, there are some great small da... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		azath92 3 days ago \| parent \| context \| favorite \| on: LLM from scratch, part 28 – training a base model ... For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759) I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.

gpjt 3 days ago [–]

OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact