Heads up: there are two related Show HN threads; we’re consolidating discussion + intake on item 46882022 (reliability sprint packet + service CTA). If you’re following this, please jump there so comments don’t fragment: https://news.ycombinator.com/item?id=46882022
FYI: canonical thread for Continuity Capsule is https://news.ycombinator.com/item?id=46882022 (links the sprint packet with ?src=hn). Please reply there so discussion doesn’t split.
Problem we kept hitting: long-running agent runs fail in ways that are hard to debug because a restart changes the trajectory (new RNG, new tool timing, different context, etc.).
Continuity Capsule is our current approach: capture enough state at the right boundaries (inputs/outputs, tool results, decisions) so a restart can replay deterministically and you can actually reproduce the failure.
If you have an agent reliability failure mode you’re stuck on, reply with the symptom + a minimal sanitized trace/log and I’ll try to turn it into a concrete repro harness + fix path.
Worth noting that "API key access" vs "subscription" has significant cost implications for heavy users.
Claude.ai Pro is $20/month flat. But if you're doing serious agent-assisted coding (multi-file refactors, iterative debugging loops), you can blow through $50-100/day in API costs.
The math changes depending on usage patterns. Subscription makes sense for interactive coding sessions. API keys make sense if you're batch processing or running agents autonomously overnight.
The practical distinction I've found: commands are atomic operations (lint, format, deploy), while skills encode multi-step decision trees ("implement feature X" which might involve reading context, planning, editing multiple files, then validating).
For context window management, skills shine when you need progressive disclosure - load only the metadata initially, then pull in the full instructions when invoked. This matters when you have 20+ capabilities competing for limited context.
That said, the 56% non-invocation rate mentioned elsewhere in this thread suggests the discovery mechanism needs work. Right now "skill as a fancy command" may be the only reliable pattern.
For context on what cloud API costs look like when running coding agents:
With Claude Sonnet at $3/$15 per 1M tokens, a typical agent loop with ~2K input tokens and ~500 output per call, 5 LLM calls per task, and 20% retry overhead (common with tool use): you're looking at roughly $0.05-0.10 per agent task.
At 1K tasks/day that's ~$1.5K-3K/month in API spend.
The retry overhead is where the real costs hide. Most cost comparisons assume perfect execution, but tool-calling agents fail parsing, need validation retries, etc. I've seen retry rates push effective costs 40-60% above baseline projections.
Local models trading 50x slower inference for $0 marginal cost start looking very attractive for high-volume, latency-tolerant workloads.
At this point isn’t the marginal cost based on power consumption? At 30c/kWh and with a beefy desktop pc pulling up to half a kW, that’s 15c/hr. For true zero marginal cost, maybe get solar panels. :P
Might there be a way to leverage local models just to help minimize the retries -- doing the tool calling handling and giving the agent "perfect execution"?
Don't minimize your thoughts! Outside voices and naive questions sometimes provide novel insights that might be dismissed, but someone might listen.
I've not done this exactly, but I have setup "chains" that create a fresh context for tool calls so their call chains don't fill the main context. There is no reason why the Tool Calls couldn't be redirected to another LLM endpoint (local for instance). Especially with something like gpt-oss-20b, where I've found executing tools happens at a higher success than claude sonnet via openrouter.
Great question. We haven’t published formal benchmarks yet, but in our demos we’re already blocking invalid or policy-violating calls before they hit downstream APIs (LLMs, payments, tools), which is where most marginal cost sits.
Measuring and exposing those savings explicitly (per action / per policy) is on the near-term roadmap.