Look into Microsoft's Phi papers. The whole idea here is that if you train model...

throwthrowuknow · 2024-09-21T14:51:13 1726930273

The trivia include information about many things: grammar, vocabulary, slang, entity relationships, metaphor, among others but chiefly they also constitute models of human thought and behaviour. If all you want is a fancy technical encyclopedia then by all means chop away at the training set but if you want something you can talk to then you’ll need to keep the diversity.

visarga · 2024-09-21T16:00:05 1726934405

> you’ll need to keep the diversity.

You can get diverse low quality data from the web, but for diverse high quality data the organic content is exhausted. The only way is to generate it, and you can maintain a good distribution by structured randomness. For example just sample 5 random words from the dictionary and ask the model to compose a piece of text from them. It will be more diverse than web text.

throwthrowuknow · 2024-09-21T22:59:23 1726959563

not exhausted, just not currently being collected. Generating via existing models is ok for distilling a better training set or refining existing low quality samples but won’t break out of distribution without some feedback mechanism. That’s why simulation is promising but it’s pretty narrow at the moment. There’s still a lot of space to fill in the point cloud so coming up with novel data collection methods is important. I think this is off topic though, my original contention was if you take too thin of a slice you won’t get a very useful model.

deegles · 2024-09-21T14:45:38 1726929938

You're not just memorizing text though. Each piece of trivia is something that represents coherent parts of reality. Think of it as being highly compressed.

snovv_crash · 2024-09-21T15:35:44 1726932944

From what I've seen Phi does well in benchmarks but poorly in real world scenarios. They also made some odd decisions regarding the network structure which means that the memory requirements for larger context is really high.

ComputerGuru · 2024-09-21T21:34:43 1726954483

> I've often wondered if broad memorization of trivia is really a sensible use of precious neurons.

I agree if we are talking about maxing raw reasoning and logical onference abilities, but the problem is that the ship has sailed and people expect llms to have domain knowledge (even more than expert users are clamoring for LLMs to have better logic).

I bet a model with actual human “intelligence” but no Google-scale encyclopedic knowledge of the world it lives in would be scored less preferentially by the masses than what we have now.