Hacker News new | past | comments | ask | show | jobs | submit login

Isn't the largest model still like 130GB after heavy quantization[1] and 4 tok/s borderline unusable for interactive sessions with those long outputs?

[1] https://unsloth.ai/blog/deepseekr1-dynamic






I told it to skip all reasoning and explanations and output just the code. It complied, saving a lot of time)

Wouldn't that also result in it skipping the "thinking" and thus in worse results?

OP probably means "the largest distilled model"

So not an actual DeepSeek-R1 model but a distilled Qwen or Llama model.

From DeepSeek-R1 paper:

> As shown in Table 5, simply distilling DeepSeek-R1’s outputs enables the efficient DeepSeekR1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform nonreasoning models like GPT-4o-0513 across the board.

and

> DeepSeek-R1-14B surpasses QwQ-32BPreview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks.

and

> These [Distilled Model Evaluation] results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.


Yes, but even that can still be run (slowly) on cpu-only systems down to about 32gb. Memory virtualization is a thing. If you get used to using it like email rather than chat, it’s still super useful even if you are waiting 1/2 hour for your reply. Presumably you have a fast distill on tap for interactive stuff.

I run my models in an agentic framework with fast models that can ask slower models or APIs when needed. It works perfectly, 60 percent of the time lol.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: