> If training and inference just got 40x more efficient
Did training and inference just get 40x more efficient, or just training? They trained a model with impressive outputs on a limited number of GPUs, but DeepSeek is still a big model that requires a lot of resources to run. Moreover, which costs more, training a model once or using it for inference across a hundred million people multiple times a day for a year? It was always the second one, and doing the training cheaper makes it even more so.
But this implies that we could use those same resources to train even bigger models, right? Except that you then have the same problem. You have a bigger model, maybe it's better, but if you've made inference cost linearly more because of the size and the size is now 40x bigger, you now need that much more compute for inference.
Actually inference got more efficient as well, thanks to the multi-head latent attention algorithm that compresses the key-value cache to drastically reduce memory usage.
That's a useful performance improvement but it's incremental progress in line with what new models often improve over their predecessors, not in line with the much more dramatic reduction they've achieved in training cost.
If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance,
it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.
The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.
This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.
Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.
Especially since he seems intent on everyone talking about him all the time. I find it questionable when a person wants to be the centre of attention no matter. Perhaps attention is not all we need.
He's like a broken smart network switch, smart as in managed. Packets with switch MAC on it are all broken, but erroneously forwarded ones often has valuable data. We through L3 don't know which one is which.
So not an actual DeepSeek-R1 model but a distilled Qwen or Llama model.
From DeepSeek-R1 paper:
> As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board.
and
> DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.
and
> These [Distilled Model Evaluation] results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.
Yes, but even that can still be run (slowly) on cpu-only systems down to about 32gb. Memory virtualization is a thing. If you get used to using it like email rather than chat, it’s still super useful even if you are waiting 1/2 hour for your reply. Presumably you have a fast distill on tap for interactive stuff.
I run my models in an agentic framework with fast models that can ask slower models or APIs when needed. It works perfectly, 60 percent of the time lol.
Did training and inference just get 40x more efficient, or just training? They trained a model with impressive outputs on a limited number of GPUs, but DeepSeek is still a big model that requires a lot of resources to run. Moreover, which costs more, training a model once or using it for inference across a hundred million people multiple times a day for a year? It was always the second one, and doing the training cheaper makes it even more so.
But this implies that we could use those same resources to train even bigger models, right? Except that you then have the same problem. You have a bigger model, maybe it's better, but if you've made inference cost linearly more because of the size and the size is now 40x bigger, you now need that much more compute for inference.