maxbeech's comments

maxbeech · 2026-03-28T23:06:46 1774739206

the flip side of claude creep is that the easy parts are now genuinely free, which means all your time goes to the 30% that was already hard. ai doesn't save you time on the hard bits, it just eliminates the excuse to not have done the easy bits first.what's helped: think in postconditions, not tasks. instead of 'add feature X', define 'the tests pass and the user can do Y'. the agent figures out what X means. without that anchor there's nothing to mark as done, so scope drifts indefinitely.

KetoManx64 · 2026-03-29T00:20:45 1774743645

100% Over the years I've amassed hundreds of code boilerplate snippets/templates that I would copy and paste and the modify, and now they're all just sitting in Obsidian gathering dust. Why would I waste my time copying and pasting when I can just have Claude generate me basic ansible playbooks on the spot in 30 seconds.

nextaccountic · 2026-03-29T01:29:15 1774747755

An idea is to have the AI ingest your templates, it might be useful

skydhash · 2026-03-29T02:09:53 1774750193

Cognitive overload? For me, it’s easier to construct a mental model of the thing than have a full (sometime complex) example which may not necessarily be valid and/or on point. And as I’ve been exploring some foundational ideas of computing, you can do away with a lot of complexity in modern development.

nunez · 2026-03-29T05:08:26 1774760906

basically readme driven development at that point.

maxbeech · 2026-03-28T16:50:47 1774716647

All questions welcome!

Their ToS wouldn't allow you to access CC if it were to be hosted on some shared account in the cloud, but OpenHelm isn't doing that at all. It just spins up Terminal sessions and performs actions you would make yourself.

maxbeech · 2026-03-27T18:05:55 1774634755

Hey!

OpenHelm doesn't handle any OAuth itself. It simply inherits the auth from the user's existing Claude Code CLI installation

Yes, full streaming. OpenHelm uses --output-format stream-json when invoking Claude Code, which emits newline-delimited JSON events in real time.

maxbeech · 2026-03-25T14:49:11 1774450151

good catch on the leakage risk - the pattern of "agent reads .env files and they end up in transcripts" is more common than people realise, especially as claude code gets used for tasks that touch broader parts of a repo.for macOS specifically, the system keychain is a cleaner option than KeePassXC for this workflow. `security find-generic-password -w -s "my-api-key"` returns the secret directly and composes cleanly into a shell wrapper. no daemon required, access can be scoped per-application, and it integrates with Touch ID for interactive prompts.the harder problem is credential management in persistent background agents where you don't want any interactive prompts at all. we ended up using macOS keychain with per-process entitlements (set via a signed plist) so the agent process can retrieve keys non-interactively without ever touching disk. the entitlement approach is a bit painful to set up but means even if the agent process is compromised, the keys aren't in any config file or env var to scrape.(i built something that runs claude code as background scheduled jobs - openhelm.ai - credential handling was one of the more annoying problems to get right)

maxbeech · 2026-03-25T14:46:31 1774449991

the interesting design tension i ran into building in this space is context management for longer sessions. the model accumulates tool call history that degrades output quality well before you hit the hard context limit - you start seeing "let me check that again" loops and increasingly hedged tool selection.a few things that helped: (1) summarizing completed sub-task outputs into a compact working-memory block that replaces the full tool call history, (2) being aggressive about dropping intermediate file read results once the relevant information has been extracted, and (3) structuring the initial system prompt so the model has a clear mental model of what "done" looks like before it starts exploring.the swift angle is actually a nice fit - the structured concurrency model maps well to the agent loop, and the strong type system makes tool schema definition less error-prone than JSON string wrangling in most other languages.

vanyaland · 2026-03-25T15:14:09 1774451649

Yeah, this is basically what I ran into too. I actually wrote about this in Stage 6 (https://ivanmagda.dev/posts/s06-context-compaction/) I went with your option (1): once history crosses a token threshold, the agent asks the model to summarize everything so far, then swaps the full history for that summary. Keeps the context window clean, though you do lose the ability to go back and reference exact earlier tool outputs.

The hard part was picking when to trigger it. Too early and you're throwing away useful context. Too late and the model's already struggling. I ended up just using a simple token count — nothing clever, but it works.

And yeah, the Swift angle was genuinely fun. Defining tool schemas as Codable structs that auto-generate JSON schemas at compile time, getting compiler errors instead of runtime API failures is a huge win.

dostick · 2026-03-26T05:32:49 1774503169

So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all. How did you find that tool call is what degrades it? Isn’t this a biggest problem there is and not just “design tension”?

maxbeech · 2026-03-25T08:23:10 1774426990

the debate round is the most interesting part of this - curious what you're actually measuring when models "change their minds."the question is whether cross-model exposure changes the actual answer distribution or mostly updates surface presentation while keeping the same underlying conclusion. models are generally trained to be responsive to context and to avoid apparent contradiction, which could look like genuine updating but just be social pressure sensitivity.one experiment worth trying: run a debate where each model sees a summary of the other models' reasoning without seeing their specific answer or which model gave it. see if agreement rates change compared to the version where models see attributed answers with model names. if the named version shows higher agreement it would suggest status/brand effects rather than reasoning-based updating.also curious whether the "reviewer model" that summarizes the transcript can itself be swapped out and whether the summary framing affects the perceived winner. that would be another confound worth controlling for.

felix089 · 2026-03-25T09:45:45 1774431945

yea good points, in general the models don't change their mind that much from what I have seen with the current sample size, but worth checking in more detail. The summarizer is just tasked with objective summarization from facts presented, it doesn't have an opinion, so changing model should not really affect anything.

maxbeech · 2026-03-25T08:20:54 1774426854

the question conflates two things worth separating: enjoyment of problem-solving versus satisfaction of shipping.if most of your craft satisfaction came from debugging a subtle race condition or working out an elegant abstraction, that aha moment is harder to get when the agent gets there first. that's a real loss and worth naming honestly rather than hand-waving away.but if your satisfaction came from seeing something working, from momentum, from having built something a user can actually touch - agents compress the gap between idea and working software in a way that's hard to argue with.where it gets uncomfortable: watching the agent do the intellectually interesting parts while you review and manage QA. that discomfort is useful signal though. it means you were getting satisfaction from implementation work that, in hindsight, could have been delegated. the natural response is to move upstream - to the parts that still require judgment: what to build at all, which edge cases actually matter to real users, what the right abstraction is.for me as a solo founder it's been net positive. the craft satisfaction shifted, it didn't disappear.

zane__chen · 2026-03-26T01:06:35 1774487195

Very on spot. I wont say AI never bring satisfaction. The satifaction of shipping is actually doubled, but the feeling of vanity is huge after that.

maxbeech · 2026-03-23T04:40:19 1774240819

context bloat in claude code runs is real. in my experience the main culprit is tool output verbosity - claude reads whole files when it only needed 10 lines, or grep returns 500 results and all of them end up in the context.

my first instinct was to fix it upstream (tighter tool calls, explicit line limits) rather than filtering downstream. and that helps a lot. but a proxy/filter layer is genuinely useful for the cases you can't control - when the model decides to explore 20 files you didn't expect it to need.

curious about the failure modes though. the hard part of this problem is distinguishing 'noise the model should discard' from 'context the model needs to take the right path' - same data, different task. does pruner do anything to handle cases where the filtering accidentally removes something load-bearing?

maxbeech · 2026-03-22T22:22:57 1774218177

the worktree discipline failure is the most interesting part of this post to me. when claude is interactive, "cd into the wrong repo" is catchable. when it's running unattended on a schedule, you find out in the morning.

the abstraction is right - isolated worktree, scoped task, commit only what belongs. the failure is enforcement. git worktrees don't prevent a process from running `cd ../main-repo`. that requires something external imposing the boundary, not relying on the agent to respect it.

what you've built (the 8:47 sweep) is a narrow-scope autonomous job: well-defined inputs, deterministic outputs, bounded time. these work well because the scope is clear enough that deviation is obvious. the harder category is "fix any failing tests" - that task requires judgment about what's in scope, and judgment is exactly where worktree escapes happen.

i've been working on tooling for scheduling this kind of claude work (openhelm.ai) and the isolation problem is front and center. separate working directories per run, no write access to the main repo unless that's the explicit task. your experience here is exactly the failure mode that design is trying to prevent.

cmeiklejohn · 2026-03-22T22:57:31 1774220251

yeah, it's curious. I sometimes ask it why it ignored what is explicitly in its memory and all it can do is apologize. I ask -- I'm using Claude with a 1M context, you have an explicit memory -- why do you ignore it and... the answer I get it "I don't know, I just didn't follow the instructions."

seba_dos1 · 2026-03-23T02:46:31 1774233991

Genuine question - what else did you expect?

fragmede · 2026-03-23T03:46:14 1774237574

For it to follow the instructions I had for it. Call me naive and stupid for thinking the 1M context window on the brand new model would actually, y'know, work.

quesera · 2026-03-23T05:03:45 1774242225

That's a bit anthropomorphic though.

When LLMs become able to reflectively examine their own premises and weight paths, they will exceed the self-awareness of ordinary humans.

hgoel · 2026-03-23T11:26:35 1774265195

Just dealt with this last night with Claude repeatedly risking a full system crash by failing to ensure that the previous training run of a model ended before starting the next one.

It's a pretty strange issue, makes me feel like the 1M context model was actually a downgrade, but it's probably something weird about the state of its memory document. I wasn't even very deep into the context.

Natfan · 2026-03-23T03:59:33 1774238373

why would further chance at context pollution be a good thing? i feel like it is easier for data to get lost in a larger context

grey-area · 2026-03-23T06:54:57 1774248897

It doesn’t reason or explicitly follow instructions, it generates plausible text given a context.

maxbeech · 2026-03-22T15:58:46 1774195126

nice architecture -- the two-plane model is something i've been thinking about too, from a slightly different angle.

i built something for this actually (openhelm.ai) -- the problem i was solving is less about orchestrating active PR loops and more about scheduling claude code jobs to run unattended on a cron-like schedule. user describes a high-level goal, it gets planned into a set of tasks with a next_fire_at, and those run autonomously in the background even when they're not at their desk.

the piece i found hardest: deciding what requires human approval vs what can auto-proceed. we landed on "no plan runs without user sign-off" as a hard rule, but even within an approved plan, mid-job blockers that need human input are more common than you'd expect.

curious how TTal handles tasks that get legitimately stuck mid-execution -- does the manager agent have heuristics for detecting "stuck vs slow"? the watchdog timeout approach (we sigterm after 30 min) is blunt but works.