prrathi's comments

prrathi · 2025-01-10T22:16:03 1736547363

This blog dives into system details of the recent DeepSeekv3 model, comparing it with Llama 3 405B. We cover training economics, use of FP8, and parallelization strategies on the way to a hypothesis for what's on everyone's mind- why route to 8 out of 256 experts?