> If training and inference just got 40x more efficient Did training *and* infer...

sghiassy · 2025-01-28T13:21:40 1738070500

I don’t think that it got more efficient. It’s that smaller models can train via larger ones cheaply. Think teacher/student relationship

https://en.m.wikipedia.org/wiki/Knowledge_distillation

mrtesthah · 2025-01-28T08:39:25 1738053565

Actually inference got more efficient as well, thanks to the multi-head latent attention algorithm that compresses the key-value cache to drastically reduce memory usage.

https://mlnotes.substack.com/p/the-valleys-going-crazy-how-d...

AnthonyMouse · 2025-01-28T09:03:55 1738055035

That's a useful performance improvement but it's incremental progress in line with what new models often improve over their predecessors, not in line with the much more dramatic reduction they've achieved in training cost.

pyentropy · 2025-01-28T09:10:59 1738055459

If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance, it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.

The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.

I mean, take a look at the numbers:

https://www.fibermall.com/blog/nvidia-ai-chip.htm#A100_vs_A8...

This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.

Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.

schubart · 2025-01-28T18:49:44 1738090184

Interesting how people keep calling it “the Chinese export ban”. Isn’t an American export ban?

cyanydeez · 2025-01-28T09:18:15 1738055895

What's surprising is anyone would repeat Elon musk related things.

Tech or politics related, he's off the deep end.

mnky9800n · 2025-01-28T09:27:44 1738056464

Especially since he seems intent on everyone talking about him all the time. I find it questionable when a person wants to be the centre of attention no matter. Perhaps attention is not all we need.

K0balt · 2025-01-28T14:36:15 1738074975

Yet another casualty of laypersons browsing arXiv. That paper was like flypaper to his narcissism.

AnthonyMouse · 2025-01-28T09:28:45 1738056525

The problem is he's only wrong some of the time and then people arguing about which one it is this time generates attention, a valuable commodity.

m-s-y · 2025-01-28T12:39:27 1738067967

Maybe “some” applied in the past but his recent history might best be described as “almost always”.

Muromec · 2025-01-28T12:52:36 1738068756

Drugs. Dont do that much drugs for so long.

numpad0 · 2025-01-28T12:55:47 1738068947

He's like a broken smart network switch, smart as in managed. Packets with switch MAC on it are all broken, but erroneously forwarded ones often has valuable data. We through L3 don't know which one is which.

cyanydeez · 2025-01-28T18:19:25 1738088365

I'm wrong some of the times.

He's a lucky mensch, no more, no less.

cyanydeez · 2025-01-28T09:16:35 1738055795

I think what got cheaper are models with up to date information.

hobs · 2025-01-28T14:10:40 1738073440

You almost never reintegrate new information with training, its by far the most expensive way to do that.

cyanydeez · 2025-01-28T18:20:47 1738088447

...and that got cheaper? Not sure your point.

At some point, the models _have_ to do "continuous integration" to provide the "AGI" that's wanted out of this tech.

cyberax · 2025-01-28T07:15:44 1738048544

> DeepSeek is still a big model that requires a lot of resources to run

I can run the largest model at 4 tokens per second on a 64GB card. Smaller models are _faster_ than Phi-4.

I've just switched to it for my local inference.

KeplerBoy · 2025-01-28T07:50:13 1738050613

Isn't the largest model still like 130GB after heavy quantization[1] and 4 tok/s borderline unusable for interactive sessions with those long outputs?

[1] https://unsloth.ai/blog/deepseekr1-dynamic

EVa5I7bHFq9mnYK · 2025-01-28T09:38:35 1738057115

I told it to skip all reasoning and explanations and output just the code. It complied, saving a lot of time)

OKRainbowKid · 2025-01-28T14:54:17 1738076057

Wouldn't that also result in it skipping the "thinking" and thus in worse results?

thot_experiment · 2025-01-28T07:57:16 1738051036

OP probably means "the largest distilled model"

menaerus · 2025-01-28T08:36:56 1738053416

So not an actual DeepSeek-R1 model but a distilled Qwen or Llama model.

From DeepSeek-R1 paper:

> As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board.

and

> DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.

and

> These [Distilled Model Evaluation] results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

K0balt · 2025-01-28T14:42:51 1738075371

Yes, but even that can still be run (slowly) on cpu-only systems down to about 32gb. Memory virtualization is a thing. If you get used to using it like email rather than chat, it’s still super useful even if you are waiting 1/2 hour for your reply. Presumably you have a fast distill on tap for interactive stuff.

I run my models in an agentic framework with fast models that can ask slower models or APIs when needed. It works perfectly, 60 percent of the time lol.

manojlds · 2025-01-28T07:57:02 1738051022

How are you running it, can you be more specific?

cyberax · 2025-01-28T18:40:37 1738089637

DeepSeek-R1-Distill-Llama-70B on triple 4090 cards.