the flip side of claude creep is that the easy parts are now genuinely free, which means all your time goes to the 30% that was already hard. ai doesn't save you time on the hard bits, it just eliminates the excuse to not have done the easy bits first.what's helped: think in postconditions, not tasks. instead of 'add feature X', define 'the tests pass and the user can do Y'. the agent figures out what X means. without that anchor there's nothing to mark as done, so scope drifts indefinitely.
100%
Over the years I've amassed hundreds of code boilerplate snippets/templates that I would copy and paste and the modify, and now they're all just sitting in Obsidian gathering dust.
Why would I waste my time copying and pasting when I can just have Claude generate me basic ansible playbooks on the spot in 30 seconds.
Cognitive overload? For me, it’s easier to construct a mental model of the thing than have a full (sometime complex) example which may not necessarily be valid and/or on point. And as I’ve been exploring some foundational ideas of computing, you can do away with a lot of complexity in modern development.
Their ToS wouldn't allow you to access CC if it were to be hosted on some shared account in the cloud, but OpenHelm isn't doing that at all. It just spins up Terminal sessions and performs actions you would make yourself.
good catch on the leakage risk - the pattern of "agent reads .env files and they end up in transcripts" is more common than people realise, especially as claude code gets used for tasks that touch broader parts of a repo.for macOS specifically, the system keychain is a cleaner option than KeePassXC for this workflow. `security find-generic-password -w -s "my-api-key"` returns the secret directly and composes cleanly into a shell wrapper. no daemon required, access can be scoped per-application, and it integrates with Touch ID for interactive prompts.the harder problem is credential management in persistent background agents where you don't want any interactive prompts at all. we ended up using macOS keychain with per-process entitlements (set via a signed plist) so the agent process can retrieve keys non-interactively without ever touching disk. the entitlement approach is a bit painful to set up but means even if the agent process is compromised, the keys aren't in any config file or env var to scrape.(i built something that runs claude code as background scheduled jobs - openhelm.ai - credential handling was one of the more annoying problems to get right)
the interesting design tension i ran into building in this space is context management for longer sessions. the model accumulates tool call history that degrades output quality well before you hit the hard context limit - you start seeing "let me check that again" loops and increasingly hedged tool selection.a few things that helped: (1) summarizing completed sub-task outputs into a compact working-memory block that replaces the full tool call history, (2) being aggressive about dropping intermediate file read results once the relevant information has been extracted, and (3) structuring the initial system prompt so the model has a clear mental model of what "done" looks like before it starts exploring.the swift angle is actually a nice fit - the structured concurrency model maps well to the agent loop, and the strong type system makes tool schema definition less error-prone than JSON string wrangling in most other languages.
Yeah, this is basically what I ran into too. I actually wrote about this in Stage 6 (https://ivanmagda.dev/posts/s06-context-compaction/) I went with your option (1): once history crosses a token threshold, the agent asks the model to summarize everything so far, then swaps the full history for that summary. Keeps the context window clean, though you do lose the ability to go back and reference exact earlier tool outputs.
The hard part was picking when to trigger it. Too early and you're throwing away useful context. Too late and the model's already struggling. I ended up just using a simple token count — nothing clever, but it works.
And yeah, the Swift angle was genuinely fun. Defining tool schemas as Codable structs that auto-generate JSON schemas at compile time, getting compiler errors instead of runtime API failures is a huge win.
So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all.
How did you find that tool call is what degrades it?
Isn’t this a biggest problem there is and not just “design tension”?
the debate round is the most interesting part of this - curious what you're actually measuring when models "change their minds."the question is whether cross-model exposure changes the actual answer distribution or mostly updates surface presentation while keeping the same underlying conclusion. models are generally trained to be responsive to context and to avoid apparent contradiction, which could look like genuine updating but just be social pressure sensitivity.one experiment worth trying: run a debate where each model sees a summary of the other models' reasoning without seeing their specific answer or which model gave it. see if agreement rates change compared to the version where models see attributed answers with model names. if the named version shows higher agreement it would suggest status/brand effects rather than reasoning-based updating.also curious whether the "reviewer model" that summarizes the transcript can itself be swapped out and whether the summary framing affects the perceived winner. that would be another confound worth controlling for.
yea good points, in general the models don't change their mind that much from what I have seen with the current sample size, but worth checking in more detail. The summarizer is just tasked with objective summarization from facts presented, it doesn't have an opinion, so changing model should not really affect anything.
the question conflates two things worth separating: enjoyment of problem-solving versus satisfaction of shipping.if most of your craft satisfaction came from debugging a subtle race condition or working out an elegant abstraction, that aha moment is harder to get when the agent gets there first. that's a real loss and worth naming honestly rather than hand-waving away.but if your satisfaction came from seeing something working, from momentum, from having built something a user can actually touch - agents compress the gap between idea and working software in a way that's hard to argue with.where it gets uncomfortable: watching the agent do the intellectually interesting parts while you review and manage QA. that discomfort is useful signal though. it means you were getting satisfaction from implementation work that, in hindsight, could have been delegated. the natural response is to move upstream - to the parts that still require judgment: what to build at all, which edge cases actually matter to real users, what the right abstraction is.for me as a solo founder it's been net positive. the craft satisfaction shifted, it didn't disappear.
context bloat in claude code runs is real. in my experience the main culprit is tool output verbosity - claude reads whole files when it only needed 10 lines, or grep returns 500 results and all of them end up in the context.
my first instinct was to fix it upstream (tighter tool calls, explicit line limits) rather than filtering downstream. and that helps a lot. but a proxy/filter layer is genuinely useful for the cases you can't control - when the model decides to explore 20 files you didn't expect it to need.
curious about the failure modes though. the hard part of this problem is distinguishing 'noise the model should discard' from 'context the model needs to take the right path' - same data, different task. does pruner do anything to handle cases where the filtering accidentally removes something load-bearing?
the worktree discipline failure is the most interesting part of this post to me. when claude is interactive, "cd into the wrong repo" is catchable. when it's running unattended on a schedule, you find out in the morning.
the abstraction is right - isolated worktree, scoped task, commit only what belongs. the failure is enforcement. git worktrees don't prevent a process from running `cd ../main-repo`. that requires something external imposing the boundary, not relying on the agent to respect it.
what you've built (the 8:47 sweep) is a narrow-scope autonomous job: well-defined inputs, deterministic outputs, bounded time. these work well because the scope is clear enough that deviation is obvious. the harder category is "fix any failing tests" - that task requires judgment about what's in scope, and judgment is exactly where worktree escapes happen.
i've been working on tooling for scheduling this kind of claude work (openhelm.ai) and the isolation problem is front and center. separate working directories per run, no write access to the main repo unless that's the explicit task. your experience here is exactly the failure mode that design is trying to prevent.
yeah, it's curious. I sometimes ask it why it ignored what is explicitly in its memory and all it can do is apologize. I ask -- I'm using Claude with a 1M context, you have an explicit memory -- why do you ignore it and... the answer I get it "I don't know, I just didn't follow the instructions."
For it to follow the instructions I had for it. Call me naive and stupid for thinking the 1M context window on the brand new model would actually, y'know, work.
Just dealt with this last night with Claude repeatedly risking a full system crash by failing to ensure that the previous training run of a model ended before starting the next one.
It's a pretty strange issue, makes me feel like the 1M context model was actually a downgrade, but it's probably something weird about the state of its memory document. I wasn't even very deep into the context.
nice architecture -- the two-plane model is something i've been thinking about too, from a slightly different angle.
i built something for this actually (openhelm.ai) -- the problem i was solving is less about orchestrating active PR loops and more about scheduling claude code jobs to run unattended on a cron-like schedule. user describes a high-level goal, it gets planned into a set of tasks with a next_fire_at, and those run autonomously in the background even when they're not at their desk.
the piece i found hardest: deciding what requires human approval vs what can auto-proceed. we landed on "no plan runs without user sign-off" as a hard rule, but even within an approved plan, mid-job blockers that need human input are more common than you'd expect.
curious how TTal handles tasks that get legitimately stuck mid-execution -- does the manager agent have heuristics for detecting "stuck vs slow"? the watchdog timeout approach (we sigterm after 30 min) is blunt but works.
reply