Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have wondered if the very big models trained on a Big Pile of Everything can be used to curate smaller, higher quality data sets that lead to high performing models with smaller parameter counts. Not only are smaller models easier to distribute and faster at inference time, but it offers a licensing escape hatch if future copyright law changes or court rulings make it hard to publicly offer models trained on non-permissively licensed material.

1) Train an initial big model on everything you can get, yielding a capable but tainted-in-some-jurisdictions model. Keep that model private.

2) Use the big tainted model to narrow or distill the source data. One way is by identifying the document subset that can be used freely (old public domain works, user generated content uploaded to your own service that users already assented to your own company's ToS on, government documents, things with unrestricted Creative Commons licensing...) The other way is by using it to build "just the facts" distillations from restrictively licensed material.

3) Train an untainted model using just the factual distillations and/or the permissively licensed material.



Sounds vaguely like the paper “textbooks are all you need”? Though they are not explicitly trying to remove the copyright taint.

https://arxiv.org/abs/2306.11644


Not sure on the licensing but yes you can do that technically.

Phi-1 and therefore phi-1.5 are partially trained on gpt3.5 generated synthetic textbooks.


The premise here is specifically not to train it on generated output of the bigger model but to merely use the bigger model to better curate non-generated (and thereby untainted) inputs for the training set of the smaller model.


That's what I proposed in my article in Alternative Models. Except, I wanted to use public-domain works (eg Gutenberg) for the base model so it's legally clear. Then, for one with proprietary content, K-12-college textbooks, encyclopedias, and specialist works licensed for that purpose. Train the base like we train kids. Then, use it to generate or evaluate the rest.

https://heswithjesus.com/tech/exploringai/index.html


I have also wondered if OpenAI are going to train a private model with all the ChatGPT history, and then use that to train a public model.


Doesn't that lead to model collapse?


The trick is training a little, then augmenting with documents using RAG. The idea that a model alone can handle complex use cases is common, but usually wrong.


RAG?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: