Can you explain how the distilled models are generated? How are they related to ...

Springtime · 2025-01-28T23:29:24 1738106964

My understanding of distilling is one model 'teaching' another, in this case the main R1 model is fine-tuning the open weight Llama model (and a Qwen variant also). I'm not sure of a comparative analysis of vanilla Llama though they benchmarked their distilled version to other models on their Github readme and the distilled Llama 70B model scores higher than Claude 3.5 Sonnet and o1-mini in all but one test.