Isn't the largest model still like 130GB after heavy quantization[1] and 4 tok/s...

EVa5I7bHFq9mnYK · 2025-01-28T09:38:35 1738057115

I told it to skip all reasoning and explanations and output just the code. It complied, saving a lot of time)

OKRainbowKid · 2025-01-28T14:54:17 1738076057

Wouldn't that also result in it skipping the "thinking" and thus in worse results?

thot_experiment · 2025-01-28T07:57:16 1738051036

OP probably means "the largest distilled model"

menaerus · 2025-01-28T08:36:56 1738053416

So not an actual DeepSeek-R1 model but a distilled Qwen or Llama model.

From DeepSeek-R1 paper:

> As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board.

and

> DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.

and

> These [Distilled Model Evaluation] results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

K0balt · 2025-01-28T14:42:51 1738075371

Yes, but even that can still be run (slowly) on cpu-only systems down to about 32gb. Memory virtualization is a thing. If you get used to using it like email rather than chat, it’s still super useful even if you are waiting 1/2 hour for your reply. Presumably you have a fast distill on tap for interactive stuff.

I run my models in an agentic framework with fast models that can ask slower models or APIs when needed. It works perfectly, 60 percent of the time lol.