After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.
This should be fixed now. There were a number of bugs that kept the model from working correctly in different environments. Please let me know if you test again. :)
Hello everyone, thanks for the interest. I merged a number of significant performance improvements that increase speed and accuracy across CUDA, Metal, and WASM as well as improve stability.
Here are the latest benchmarks running on DGX Spark:
I tried the FP8 in vLLM on my Spark and although it fit in memory, I started swapping once I actually tried to run any queries, and, yeah, could not have a context larger than 8k.
I figured out later this is because vLLM apparently de-quantizes to BF16 at runtime, so pointless to run the FP8?
I get about 30-35 tok/second using llama.cpp and a 4-bit quant. And a 200+k context, using only 50GB of RAM.
yeah, what did you get for tok/sec there though? Memory bandwidth is the limitation with these devices. With 4 bit I didn't get over 35-39 tok/sec, and averaged more like 30 when doing actual tool use with opencode. I can't imagine fp8 being faster.
I ran a similar experiment last month and ported Qwen 3 Omni to llama cpp. I was able to get GGUF conversion, quantization, and all input and output modalities working in less than a week. I submitted the work as a PR to the codebase and understandably, it was rejected.
The refusal because often AI writes suboptimal GGML kernels looks very odd, to me. It means that who usually writes manually GGML kernels, could very easily steer the model into writing excellent kernels, and even a document for the agents can be compiled with the instructions on how to do a great work. If they continue in this way, soon a llama.cpp fork will emerge that will be developed much faster and potentially even better: it is unavoidable.
The refusal is probably because OP said "100% written by AI" and didn't indicate an interest in actually reviewing or maintaining the code. In fact, a later PR comment suggests that the AI's approach was needlessly complicated.
Also because it's a large PR. Also because the maintainer has better things to do than taking longer and more energy to review than the author spent to write it, just to find that multiple optimisations will be requested, which the author may not be able to take on.
the creator of llama.cc can hardly be suspected to be reluctant or biased towards GenAI.
Absolutely -- it's perfectly understandable. I wanted to be completely upfront about AI usage and while I was willing and did start to break the PR down into parts, it's totally OK for the maintainers to reject that too.
I wanted to see if Claude Code could port the HF / MLX implementation to llama.cpp and it was successful -- in my mind that's wild!
I also learned a ton about GPU programming, how omni models work, and refined my approach to planning large projects with automated end to end integration tests.
The PR was mostly to let people know about the code and weights, since there are quite a few comments requesting support:
With usage on a daily basis since GPT-4 I have not once encountered a scenario where I was concerned about the output being complex enough and a verbatim copy to warrant such concerns.
Generally it would seem statistically unlikely to reconstruct a copyrighted work, rather the output should be a probabilistic average. Snippets are typically too common and short to be protected by copyright. Copyright challenges are likely to fail on the "substantial similarity" test.
I understand plaintiffs would need to show that code is virtually identical, not just similar, and that these parts represent a "substantial" portion of the original work's creative value.
After some performance improvements, it is realtime on my DGX Spark with an RTF of .416 -- now getting ~19.5 tokens per second. Check it out, see if it's better for you.