Hi HN,
I’ve been frustrated for a while with how LLM inference works in the cloud today. Every API call starts from scratch: you resend your entire prompt + conversation history, and you’re charged for every input token, even if the model has already “seen” that context before.
This leads to two big problems:
Performance & cost – constantly resending input tokens is wasteful.
Quality loss – because the state is rebuilt on a new GPU each time, the model loses a lot of internal context beyond just your text.
Most “optimizations” offered in the industry are really just prompt-caching. That’s useful for cutting repeated input costs, but we’ve all seen the side-effects: outputs that don’t match subtle variations in the prompt, or the model confidently “jumping” to the wrong cached response because it thought your query was a near-duplicate.
We’re taking a different approach with ark-labs.cloud:
True stateful inference – when you start a session, all requests are processed on the same set of GPUs, and the full internal state of the model (prompt, history, reasoning traces) is preserved between calls.
Zero input token cost – because the model doesn’t need you to resend your input on each request. You pay only for generated output.
Better responses, not just cheaper ones – maintaining the internal state can improve consistency and reasoning quality, not just save money.
From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive (ark_session_id). No SDK magic, no hacks. Sessions do expire after inactivity to free resources, but while they’re active, you’re talking to a model that actually remembers internally, not just through string concatenation of prompts.
Docs https://ark-labs.cloud/documentation/
We’d love your thoughts — especially from those who’ve wrestled with the “why am I paying 10x for tokens I already sent” problem, or who’ve hit caching systems that mismatched prompts to outputs.
Does this approach make sense to you?
Perhaps you misspoke / misquoted some internal copy, but that doesn't mean what you think it means, and "caching" in kv caching doesn't mean what you imply it means here. The model doesn't "jump" on anything because of kv caching.
> From a developer perspective, it’s simple: enable cookies, and the API will keep a session alive
How is this related to LLM inference?! What are cookies doing there? What?
(from your docs) > OpenAI optimizes by processing every single request on randomly selected GPUs - but in the process most of the state is lost because only the final assistant reply is kept. Ark allows users to have a session during which all requests are processed on the same set of GPUs and the full internal state is maintained between requests. Depending on use case, this approach can improve both model's response quality and performance.
Yeah, except no. Every model builder so far has emphasised that this is not how you want to do it. With "thinking" models, you want to NOT include thinking steps for earlier messages, since that degrades the models outputs.
----
If you want to convince people about a better way of doing things, when the entire industry is doing another thing, you have to come up with data supporting your stance. Can you show such data? Do you have qualitative studies / benchmarks on your methods? Can you show that whatever state you hold is actually helping? That would go against the current practices of every inference engine out there currently, so it would be quite a thing to show.
reply