Hacker News new | past | comments | ask | show | jobs | submit | prrathi's comments login

This blog dives into system details of the recent DeepSeekv3 model, comparing it with Llama 3 405B. We cover training economics, use of FP8, and parallelization strategies on the way to a hypothesis for what's on everyone's mind- why route to 8 out of 256 experts?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: