(Summary from Reddit) - fp8 instead of fp32 precision training = 75% less memory...

(Summary from Reddit)

- fp8 instead of fp32 precision training = 75% less memory

- multi-token prediction to vastly speed up token output

- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency

- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible

Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.