- fp8 instead of fp32 precision training = 75% less memory
- multi-token prediction to vastly speed up token output
- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency
- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible
Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.
They also use some kind of factorized attention that somehow leads to compression of tokens (I still haven't read their papers, so I can't be clearer than this).
- fp8 instead of fp32 precision training = 75% less memory
- multi-token prediction to vastly speed up token output
- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency
- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible
Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.