Hacker Newsnew | past | comments | ask | show | jobs | submit | 3836293648's commentslogin

No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.

The trace goes back fine, that's not the issue.

The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.

So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.


>and doing that will cause a huge one-time hit against your token limit if the session has grown large.

Anthropic already profited from generating those tokens. They can afford subsidize reloading context.


No they can't, that's what you don't seem to get.

Reloading those tokens takes around the same effort as processing them in the first place.

It's ok to be ignorant of how the infrastructure for LLMs work, just don't be proud of it.


They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.

I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?

They are sending it back to the cache, the part you are missing is they were charging you for it.

The blog post says they prune them now not to charge you. That’s the change they implemented.

right. they were charging you for it, now they aren't because they are just dropping your conversation history.

That's not British, that's just old people


No, I'm claiming your source is outdated. It has become an old people thing now

Not laptops. Local dimming zones look awful when you have a white cursor moving around, so it's mostly still just a TV-feature

Looking awful has not prevented local dimming from becoming quite common on laptops. Apple has been doing an okay job of it in the MacBook Pro for several years. Lots of Windows laptops have been very hit-or-miss about it, but at least with those you often have an OLED option. I've seen multiple Windows laptops from more than one OEM where opening a terminal window with light text on a dark background means you can easily spot a single line of text getting much dimmer toward the center of the dark window, and lighter near the perimeter where it's close to other light content. And that's for static content; as you mentioned motion can bring more problems as the backlight lags behind the LCD.

GPT started off open? They just closed before anyone else even joined the space

Maths is like physics

No, but it has a lot of very intentional limitations

In my test case (a feature all models got stuck on a few months ago) it just gets stuck in a thinking loop and never gets anywhere. Not a super amazing test, but it happened a few times in a row, so...

What Strix Halo system has unified memory? A quick google says it's just a static vram allocation in ram, not that CPU and GPU can actively share memory at runtime

All. Keep in mind strix != strix halo.

You can get tablets, laptops, and desktops. I think windows is more limited and might require static allocation of video memory, not because it's a separate pool, just because windows isn't as flexible.

With linux you can just select the lowest number in bios (usually 256 or 512MB) then let linux balance the needs of the CPU/GPU. So you could easily run a model that requires 96GB or more.


> What Strix Halo system has unified memory?

All of them. The static VRAM allocation is tiny (512MB), most of the memory is unified


How much VRAM do you need for that?

Not OP, but I ran 122b successfully with normal RAM offloading. You dont need all that much VRAM, which is super expensive. I used 96gb ram + 16gb vram gpu. But it's not very fast in that setup, maybe 15 token per second. Still, you can give it a task and come back later and its done. (Disclaimer: I build that PC before stuff got expensive)

128GB on a mac with unified memory. The model itself takes something like 110 of that and then I have ~16 left over to hold a reasonably sized context and 2 for the OS.

I do have a dedicated machine for it though because I can't run an IDE at the same time as that model.


I squeeze Qwen3.5-122B-A10B at Q6 into 128GB. It's a great model.

Wow what kind of hardware do you have? Mac Studio, dgx spark, strix halo? How fast is it?

Strix Halo, I'm seeing performance inline with these results[0].

I'm interested to investigate the claimed gains from the lemonade-sdk port of Apple MLX inference[1].

[0]https://kyuz0.github.io/amd-strix-halo-toolboxes/

[1]https://github.com/lemonade-sdk/lemonade/issues/1642


Qwen3.6 and Gemma4 have the same issue of never getting to the point and just getting stuck in never ending repeating thought loops. Qwen3.5 is still the best local model that works.

I think the hype around Qwen and even Gemma4 often floated for views/attention glosses over that these models have clear gaps behind what closed models offer.

In short, it has its uses but it would/should not be the main driver. Will it get better, I'm sure of it, but there is too much hype and exaggeration over open source models, for one the hardware simply isn't enough at a price point where we can run something that can seriously compete with today's closed models.

If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.


I say "if we got $CURRENT_MODEL that can run under local hardware" claims are postproning BS.

What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.

Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.


Give it 6 months

Quantization can introduce these issues, and Gemma 4 also had issues because the prompt tokens that Gemma used was new and not well supported yet.

I had issues with Qwen thinking endlessly when I didn’t know I wasn’t using the temp/top_k/min_p/etc settings specified in the readme. I’ve never had an issue with Gemma 4 thinking endlessly but could possibly be the same.

I use Ollama and kinda just assumed that Ollama would have everything except for context length (which I've explicitly overwritten) setup properly for me.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: