No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.
I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?
Looking awful has not prevented local dimming from becoming quite common on laptops. Apple has been doing an okay job of it in the MacBook Pro for several years. Lots of Windows laptops have been very hit-or-miss about it, but at least with those you often have an OLED option. I've seen multiple Windows laptops from more than one OEM where opening a terminal window with light text on a dark background means you can easily spot a single line of text getting much dimmer toward the center of the dark window, and lighter near the perimeter where it's close to other light content. And that's for static content; as you mentioned motion can bring more problems as the backlight lags behind the LCD.
In my test case (a feature all models got stuck on a few months ago) it just gets stuck in a thinking loop and never gets anywhere. Not a super amazing test, but it happened a few times in a row, so...
What Strix Halo system has unified memory? A quick google says it's just a static vram allocation in ram, not that CPU and GPU can actively share memory at runtime
You can get tablets, laptops, and desktops. I think windows is more limited and might require static allocation of video memory, not because it's a separate pool, just because windows isn't as flexible.
With linux you can just select the lowest number in bios (usually 256 or 512MB) then let linux balance the needs of the CPU/GPU. So you could easily run a model that requires 96GB or more.
Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)
128GB on a mac with unified memory. The model itself takes something like 110 of that and then I have ~16 left over to hold a reasonably sized context and 2 for the OS.
I do have a dedicated machine for it though because I can't run an IDE at the same time as that model.
Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.
I think the hype around Qwen and even Gemma4 often floated for views/attention glosses over that these models have clear gaps behind what closed models offer.
In short, it has its uses but it would/should not be the main driver. Will it get better, I'm sure of it, but there is too much hype and exaggeration over open source models, for one the hardware simply isn't enough at a price point where we can run something that can seriously compete with today's closed models.
If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.
I say "if we got $CURRENT_MODEL that can run under local hardware" claims are postproning BS.
What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.
Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.
I had issues with Qwen thinking endlessly when I didn’t know I wasn’t using the temp/top_k/min_p/etc settings specified in the readme. I’ve never had an issue with Gemma 4 thinking endlessly but could possibly be the same.
I use Ollama and kinda just assumed that Ollama would have everything except for context length (which I've explicitly overwritten) setup properly for me.
reply