This article is 3 years old, right before GPT-3's launch. So it doesn't have all the recent info. We're not using meta-learning much, the problem of using few examples to train a model has been solved by few-shot prompting, zero shot, or fine-tuning with a small dataset. The problem with meta-learning is that it tries to back-propagate through the inner loop (the model) to the outer loop (meta-model), and this is hard to do.
Since LLMs have such huge and diverse training sets, almost everything is in-distribution. LLMs are conditional probability distributions that can be steered with prompts in a much easier way, no need to use a meta-model.
If the problem is training on many related tasks and then making it easy to train a new one, then we better read this very recent paper and accompanying benchmark that sheds light on the inner workings of task compositionality for LLMs.
- A Theory for Emergence of Complex Skills in Language Models (theoretical part)
This benchmark might also solve the confidence crisis in LLM metrics. They show Mistral can handle up to 2 skills in combination, LLaMA can go to 3, and GPT-4 can compose 4-5. These skills are uniformly drawn from a set of 100 skills on 100 topics. It's so hard to find benchmarks that clearly differentiate 7B-models that pretend to have 70B-model skills.
Their argument is that combinations of skills are growing exponentially, so there is small chance of having a random combination covered in the training set, especially on not so large topics. So if models can solve these tasks, it means they actually demonstrate combinatorial generalization.
This is the main chart if you don't have the time to see the papers
Meta-learning is also the most important tool for humans, because you know that in your life you'll always need to learn but you don't know what precisely you will need to learn
Since LLMs have such huge and diverse training sets, almost everything is in-distribution. LLMs are conditional probability distributions that can be steered with prompts in a much easier way, no need to use a meta-model.
If the problem is training on many related tasks and then making it easy to train a new one, then we better read this very recent paper and accompanying benchmark that sheds light on the inner workings of task compositionality for LLMs.
- A Theory for Emergence of Complex Skills in Language Models (theoretical part)
https://arxiv.org/abs/2307.15936
- Skill Mix: a Flexible and Expandable Family of Evaluations for AI models (the benchmark)
https://arxiv.org/abs/2310.17567
This benchmark might also solve the confidence crisis in LLM metrics. They show Mistral can handle up to 2 skills in combination, LLaMA can go to 3, and GPT-4 can compose 4-5. These skills are uniformly drawn from a set of 100 skills on 100 topics. It's so hard to find benchmarks that clearly differentiate 7B-models that pretend to have 70B-model skills.
Their argument is that combinations of skills are growing exponentially, so there is small chance of having a random combination covered in the training set, especially on not so large topics. So if models can solve these tasks, it means they actually demonstrate combinatorial generalization.
This is the main chart if you don't have the time to see the papers
https://i.imgur.com/6eDTf1u.png