I have wondered if the very big models trained on a Big Pile of Everything can b...

theptip · on Sept 16, 2023

Sounds vaguely like the paper “textbooks are all you need”? Though they are not explicitly trying to remove the copyright taint.

https://arxiv.org/abs/2306.11644

IanCal · on Sept 15, 2023

Not sure on the licensing but yes you can do that technically.

Phi-1 and therefore phi-1.5 are partially trained on gpt3.5 generated synthetic textbooks.

saurik · on Sept 15, 2023

The premise here is specifically not to train it on generated output of the bigger model but to merely use the bigger model to better curate non-generated (and thereby untainted) inputs for the training set of the smaller model.

nickpsecurity · on Sept 16, 2023

That's what I proposed in my article in Alternative Models. Except, I wanted to use public-domain works (eg Gutenberg) for the base model so it's legally clear. Then, for one with proprietary content, K-12-college textbooks, encyclopedias, and specialist works licensed for that purpose. Train the base like we train kids. Then, use it to generate or evaluate the rest.

https://heswithjesus.com/tech/exploringai/index.html

castles · on Sept 16, 2023

I have also wondered if OpenAI are going to train a private model with all the ChatGPT history, and then use that to train a public model.

3abiton · on Sept 15, 2023

Doesn't that lead to model collapse?

kordlessagain · on Sept 15, 2023

The trick is training a little, then augmenting with documents using RAG. The idea that a model alone can handle complex use cases is common, but usually wrong.

bordercases · on Sept 16, 2023

jdkee · on Sept 16, 2023

Retrieval augmented generation.

See https://research.ibm.com/blog/retrieval-augmented-generation...