Hacker News new | past | comments | ask | show | jobs | submit login

Can you explain how the distilled models are generated? How are they related to deepseek R1? Are they significantly smarter than their non distilled versions? (llama vs llama distilled with deepseek).





My understanding of distilling is one model 'teaching' another, in this case the main R1 model is fine-tuning the open weight Llama model (and a Qwen variant also). I'm not sure of a comparative analysis of vanilla Llama though they benchmarked their distilled version to other models on their Github readme and the distilled Llama 70B model scores higher than Claude 3.5 Sonnet and o1-mini in all but one test.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: