The funny thing is that even in this area Anthropic is behind other 3 labs (Google, OpenAI, xAI). It's the only one out of those 4 that requires you to manually set cache breakpoints, and the initial cache costs 25% more than usual context. The other 3 have fully free implicit caching. Although Google also offers paid, explicit caching.
I don't understand why we're paying for caching at all (except: model providers can charge for it). It's almost extortion - the provider stores some data for 5min on some disk, and gets to sell their highly limited GPU resources to someone else instead (because you are using the kv cache instead of GPU capacity for a good chunk of your input tokens).
They charge you 10% of their GPU-level prices for effectively _not_ using their GPU at all for the tokens that hit the cache.
If I'm missing something about how inference works that explains why there is still a cost for cached tokens, please let me know!
Deepseek pioneered automatic prefix caching and caches on SSD. SSD reads are so fast compared to LLM inference that I can't think of a reason to waste ram on it.
But that doesn't make sense? Why would they keep the cache persistent in the VRAM of the GPU nodes, which are needed for model weights? Shouldn't they be able to swap in/out the kvcache of your prompt when you actually use it?
Your intuition is correct and the sibling comments are wrong. Modern LLM inference servers support hierarchical caches (where data moves to slower storage tiers), often with pluggable backends. A popular open-source backend for the "slow" tier is Mooncake: https://github.com/kvcache-ai/Mooncake
OK that's pretty fascinating, turns out Mooncake includes a trick that can populate GPU VRAM directly from NVMe SSD without it having to go through the host's regular CPU and RAM first!
> Transfer Engine also leverages the NVMeof protocol to support direct data transfer from files on NVMe to DRAM/VRAM via PCIe, without going through the CPU and achieving zero-copy.
I vastly prefer the manual caching. There are several aspects of automatic caching that are suboptimal, with only moderately less developer burden. I don’t use Anthropic much but I wish the others had manual cache options
Lots of situations, here are 2 I’ve faced recently (cannot give too much detail for privacy reasons, but should be clear enough)
1) low latency desired, long user prompt
2) function runs many parallel requests, but is not fired with common prefix very often. OpenAI was very inconsistent about properly caching the prefix for use across all requests, but with Anthropic it’s very easy to pre-fire
Is it wherever the tokens are, or is it the N first tokens they've seen before? Ie if my prompt is 99% the same, except for the first token, will it be cached?
The prefix has to be stable. If you are 99% the same but the first token is different it won't cache at all. You end up having to design your prompts to accommodate this.
which is important to bear in mind if people are introducing a "drop earliest messages" sliding window for context management in a "chat-like" experience.
once you're at that context limit and start dropping the earliest messages, you're guaranteeing every message afterwards will be a cache miss.
a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.
if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.
I thought OpenAI would still handle case? Their cache would work up to the end of the file and you would then pay for uncached tokens for the user's question. Have I misunderstood how their caching works?
https://docs.claude.com/en/docs/build-with-claude/prompt-cac...
https://ai.google.dev/gemini-api/docs/caching
https://platform.openai.com/docs/guides/prompt-caching
https://docs.x.ai/docs/models#cached-prompt-tokens