Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: KTransformers:671B DeepSeek-R1 on a Single Machine-286 tokens/s Prefill (github.com/kvcache-ai)
14 points by sssummer 41 days ago | hide | past | favorite
Hey Hacker News! We are excited to share the new version of KTransformers, a flexible framework designed for cutting-edge LLM inference optimizations! Leveraging state-of-the-art kernels from llamafile and marlin, KTransformers seamlessly enhances the performance of HuggingFace Transformers, making it possible to operate large 671B MoE models or extremely long 1M context locally with promising speed.

KTransformers is a Python-centric framework designed with extensibility at its core. By implementing and injecting an optimized module with a single line of code, users gain access to a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and even a simplified ChatGPT-like web UI. For example, it allows you to integrate with all your familiar frontends, such as the VS Code plugin backed by Tabby.

To demonstrate its capability, we present two showcase demos:

- GPT-4/o1-level Local VSCode Copilot: It runs the huge 671B DeepSeek-Coder-V2's Q4_K_M variant using just 24GB VRAM and 382G DRAM (with two Xeon CPU) on a local machine, with a promising 286 tokens/s for prompt prefill and 14 tokens/s for generation, up to 3~28x speedup. The detailed tutorial is [here](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)

- 1M Context Local Inference:Achieves 15 tokens/s with nearly 100% accuracy on the "Needle In a Haystack" test via the InternLM2.5-7B-Chat-1M model, utilizing 24GB VRAM and 150GB DRAM, and is several times faster than llama.cpp.

Check it out on GitHub: https://github.com/kvcache-ai/ktransformers




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: