This is very similar to my setup. Pi in a container (I do let it have network ac...

Iolaum · 2026-06-16T07:43:49 1781595829

Haven't used for actual coding but was testing locally - for example running some swebench instances - whether qwen-3.6-35b-a3b@Q8 was better than qwen-3.5-122b-a10b@Q4. With MTP the former runs at around 55t/s and the latter at around 30t/s meaning the latter is also usable. It looked like qwen-3.5-122b-a10b@Q4 performed a bit better.

chakspak · 2026-06-15T19:22:11 1781551331

Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

lambda · 2026-06-15T19:35:00 1781552100

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}

There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

thefroh · 2026-06-16T05:03:03 1781586183

I'm a little surprised that preserve_thinking would matter here for cache purposes. for actual capabilities/intelligence, yes, I'd imagine it helps to have past reasoning traces in multi-turn setups.

but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.

stymaar · 2026-06-16T06:26:11 1781591171

> all you are doing is leaving off a fraction of the most recent assistant message generation

True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).

ndom91 · 2026-06-15T19:52:36 1781553156

+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.

I'll have to give the preserve_thinking a shot.

jderekw · 2026-06-15T23:12:26 1781565146

Thanks for sharing have been running ROCm primarily with Qwen 3.6 and Qwen Coder, on the runs much better statement is that a stability, performance or other capability your experiencing?

havfo · 2026-06-16T06:59:31 1781593171

I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with

  --chat-template-kwargs '{"preserve_thinking":true}'

dnautics · 2026-06-15T21:25:49 1781558749

> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?

lambda · 2026-06-15T22:31:43 1781562703

So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.

Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.

Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.

But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.

So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.

There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.

Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.

carterschonwald · 2026-06-15T23:37:55 1781566675

thats a harness issue not a model issue. eg i have my own reasoninf harness that forced persisted cot

thefossguy69 · 2026-06-16T06:10:35 1781590235

Would you mind sharing your harness for reasoning?

dnautics · 2026-06-15T23:06:13 1781564773

wait do sota models use mamba-like SSMs? this is the first im hearing this

nl · 2026-06-16T00:22:55 1781569375

Qwen 3.5 and above use Gated DeltaNet which alternate attention and SSM layers:

https://sebastianraschka.com/llms-from-scratch/ch04/08_delta...

LoganDark · 2026-06-15T20:11:52 1781554312

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)

verdverm · 2026-06-16T03:43:00 1781581380

There is a bug in llama-cpp for qwen/gemma models, use vLLM instead

pdyc · 2026-06-16T05:06:46 1781586406

what bug and it affects what?

mahadevank · 2026-06-16T04:35:24 1781584524

Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanks

fjdjshsh · 2026-06-16T00:57:35 1781571455

>I'm still a AI skeptic

What does this mean in June 2026 wrt coding?

To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.

femto113 · 2026-06-16T02:45:09 1781577909

For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.

luipugs · 2026-06-16T06:30:06 1781591406

Don't you read through all the output of the agent before committing them?

secult · 2026-06-16T06:53:21 1781592801

That's not the way how human brain works.

HWR_14 · 2026-06-16T02:57:41 1781578661

I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.