Hacker Newsnew | past | comments | ask | show | jobs | submit | agentplaybooks's commentslogin

The failure modes that bit hardest in my production deployments were #4 and #5 -- context limit surprises and cascade failures.

Context overflow is insidious because agents don't error out. They just quietly make worse decisions as the window fills. We only caught it by noticing sudden quality drops around turn 40 in long sessions. No error logs. Just degraded output.

Cascade failures we now handle with explicit checkpoint gates: after each tool call, the orchestrator checks for a failure signal before proceeding. One bad tool call used to silently corrupt 3-4 downstream steps. Adding gates cost ~20 lines and caught 6 production bugs in the first two weeks.

A failure mode I don't see discussed enough: cross-session memory drift. Not prompt injection, not context overflow -- just gradual entropy as file-based memory accumulates noise over weeks. After 3-4 weeks of operation, briefs degrade because agents are drawing on stale context from past sessions.

Fix: weekly memory audits. Review what agents actually wrote down. Prune aggressively. Intentional compression beats automated recall every time.

I wrote up the full framework (including brief formats that prevent your #1 failure mode) here if useful: https://bleavens-hue.github.io/ai-agent-playbook/


Context overflow is the hardest one to test because you can't reproduce it in unit tests. It shows up in production after the Nth turn.

One mitigation that actually helps: keep your system prompt compact and explicitly structured. Plain prose instructions balloon over iterations as you add edge cases inline. When you decompose the system prompt into typed semantic blocks (role, objective, constraints, output_format), each addition goes to the right section and you can see the size growing. It also makes it easier to identify which sections are eating the most tokens.

I've been building flompt (github.com/Nyrok/flompt) for this, a visual prompt builder that decomposes prompts into 12 typed blocks and compiles to Claude-optimized XML. Keeping instructions structured is partly about quality, but the token budget discipline is a real side benefit.


Cross-session memory drift is a great addition -- we've seen exactly this. We run agents with file-based episodic memory and after about 3 weeks the recall quality drops noticeably. The agent starts referencing stale context that was relevant in week 1 but contradicts current state.

Our current fix is similar to yours: scheduled compression passes that summarize older memories and prune anything that's been superseded. We also track access frequency on stored facts -- cold facts (not accessed in 2+ weeks) get demoted from active context but stay searchable. That alone cut our context pollution by roughly 40%.

The checkpoint gates for cascade failures are smart. We do something similar -- after each tool call, validate the output shape before passing it downstream. Caught a case where a failed API call returned HTML error pages that the agent then tried to parse as JSON, corrupting 3 subsequent steps.

Will check out the playbook. Thanks for sharing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: