Hacker News new | past | comments | ask | show | jobs | submit login

> If training and inference just got 40x more efficient

Did training and inference just get 40x more efficient, or just training? They trained a model with impressive outputs on a limited number of GPUs, but DeepSeek is still a big model that requires a lot of resources to run. Moreover, which costs more, training a model once or using it for inference across a hundred million people multiple times a day for a year? It was always the second one, and doing the training cheaper makes it even more so.

But this implies that we could use those same resources to train even bigger models, right? Except that you then have the same problem. You have a bigger model, maybe it's better, but if you've made inference cost linearly more because of the size and the size is now 40x bigger, you now need that much more compute for inference.






I don’t think that it got more efficient. It’s that smaller models can train via larger ones cheaply. Think teacher/student relationship

https://en.m.wikipedia.org/wiki/Knowledge_distillation


Actually inference got more efficient as well, thanks to the multi-head latent attention algorithm that compresses the key-value cache to drastically reduce memory usage.

https://mlnotes.substack.com/p/the-valleys-going-crazy-how-d...


That's a useful performance improvement but it's incremental progress in line with what new models often improve over their predecessors, not in line with the much more dramatic reduction they've achieved in training cost.

If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance, it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.

The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.

I mean, take a look at the numbers:

https://www.fibermall.com/blog/nvidia-ai-chip.htm#A100_vs_A8...

This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.

Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.


Interesting how people keep calling it “the Chinese export ban”. Isn’t an American export ban?

What's surprising is anyone would repeat Elon musk related things.

Tech or politics related, he's off the deep end.


Especially since he seems intent on everyone talking about him all the time. I find it questionable when a person wants to be the centre of attention no matter. Perhaps attention is not all we need.

Yet another casualty of laypersons browsing arXiv. That paper was like flypaper to his narcissism.

The problem is he's only wrong some of the time and then people arguing about which one it is this time generates attention, a valuable commodity.

Maybe “some” applied in the past but his recent history might best be described as “almost always”.

Drugs. Dont do that much drugs for so long.

He's like a broken smart network switch, smart as in managed. Packets with switch MAC on it are all broken, but erroneously forwarded ones often has valuable data. We through L3 don't know which one is which.

I'm wrong some of the times.

He's a lucky mensch, no more, no less.


I think what got cheaper are models with up to date information.

You almost never reintegrate new information with training, its by far the most expensive way to do that.

...and that got cheaper? Not sure your point.

At some point, the models _have_ to do "continuous integration" to provide the "AGI" that's wanted out of this tech.


> DeepSeek is still a big model that requires a lot of resources to run

I can run the largest model at 4 tokens per second on a 64GB card. Smaller models are _faster_ than Phi-4.

I've just switched to it for my local inference.


Isn't the largest model still like 130GB after heavy quantization[1] and 4 tok/s borderline unusable for interactive sessions with those long outputs?

[1] https://unsloth.ai/blog/deepseekr1-dynamic


I told it to skip all reasoning and explanations and output just the code. It complied, saving a lot of time)

Wouldn't that also result in it skipping the "thinking" and thus in worse results?

OP probably means "the largest distilled model"

So not an actual DeepSeek-R1 model but a distilled Qwen or Llama model.

From DeepSeek-R1 paper:

> As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board.

and

> DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.

and

> These [Distilled Model Evaluation] results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.


Yes, but even that can still be run (slowly) on cpu-only systems down to about 32gb. Memory virtualization is a thing. If you get used to using it like email rather than chat, it’s still super useful even if you are waiting 1/2 hour for your reply. Presumably you have a fast distill on tap for interactive stuff.

I run my models in an agentic framework with fast models that can ask slower models or APIs when needed. It works perfectly, 60 percent of the time lol.


How are you running it, can you be more specific?

DeepSeek-R1-Distill-Llama-70B on triple 4090 cards.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: