You mentioned "on local agents". I've noticed this too. How do ChatGPT and the o...

bluecoconut · 2025-08-05T22:15:46 1754432146

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

mike_hearn · 2025-08-06T09:13:23 1754471603

They cache the intermediate data (KV cache).