Show HN: Stateful LLM inference (no cost for input tokens, not prompt-caching)

NitpickLawyer · 2025-08-18T14:05:49 1755525949

> Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.

Perhaps you misspoke / misquoted some internal copy, but that doesn't mean what you think it means, and "caching" in kv caching doesn't mean what you imply it means here. The model doesn't "jump" on anything because of kv caching.

> From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive

How is this related to LLM inference?! What are cookies doing there? What?

(from your docs) > OpenAI optimizes by processing every single request on randomly selected GPUs - but in the process most of the state is lost because only the final assistant reply is kept. Ark allows users to have a session during which all requests are processed on the same set of GPUs and the full internal state is maintained between requests. Depending on use case, this approach can improve both model's response quality and performance.

Yeah, except no. Every model builder so far has emphasised that this is not how you want to do it. With "thinking" models, you want to NOT include thinking steps for earlier messages, since that degrades the models outputs.

----

If you want to convince people about a better way of doing things, when the entire industry is doing another thing, you have to come up with data supporting your stance. Can you show such data? Do you have qualitative studies / benchmarks on your methods? Can you show that whatever state you hold is actually helping? That would go against the current practices of every inference engine out there currently, so it would be quite a thing to show.

arkonrad · 2025-08-18T15:07:54 1755529674

On cookies: we use an HTTP cookie (ark_session_id) purely as an opaque session identifier. The cookie is how the client ties subsequent requests to the same pinned session/worker/GPUs on the provider side so the provider can keep the model activations/state in GPU memory between calls. Not a magic for the model; it’s a routing key that enables true session affinity.

On “thinking steps” and contamination: good point - naively persisting raw chain-of-thought tokens can degrade outputs. ARKLABS Stateful approach is not a blanket “store everything” policy.

And my criticism targets higher-level provider practices: things like response caching, aggressive prompt-matching / deduplication heuristics, or systems that return previously generated outputs when a new prompt is “similar enough.” Those high-level caches absolutely can produce the behaviour I described - a subtle prompt change that nevertheless gets routed to a cached reply.

The platform has been launched — we’re collecting data, but early results are very promising: we’re seeing linear complexity, lower latency, and ~80% input-token savings. At the same time we’d love to hear more feedback on whether this approach could be useful in real-world projects.

And about going against the grain, as you mentioned at the end… well — if startups didn’t think differently from everyone else, what would be the point of being a startup?