Small models are getting good but I don't think they are quite there yet for this use case. For ok results we are looking at 12-14GB of vram committed to models to make this happen. My MacBook with 24GB of total ram runs fine with a 14B model running but I don't think most people have quite enough ram yet. Still I think it's something we are going to need.
We are also going to want the opposite. A way for an LLM to request tool calls so that it can drive an arbitrary application. MCP exists, but it expects you to preregister all your MCP servers. I am not sure how well preregistering would work at the scale of every application on your PC.
AI coding improved a lot over 2025. In early 2025 LLMs still struggled with counting. Now they are capable of tool calling so they can just use a calculator. Frankly, I'd say AI coding may as well have not existed before mid-2025. The output wasn't really that good. Sure you could generate code but couldn't rely on a coding agent to make 2 line edits to a 1000 line file.
I don't doubt that they have improved a lot this year, but the same claims were being made last year as well. And the year before that. I still haven't seen anything that proves to me that people are truly that much more productive. They certainly _feel_ more productive, though.
Hell, the GP spent more than $50,000 this year on API calls alone and the results are... what again? Where is the innovation? Where are the tools that wouldn't have been possible to build pre-ChatGPT?
I'm constantly reminded of the Feynman quote: "The first principle is that you must not fool yourself, and you are the easiest person to fool."
LLMs writing test cases, LLMs writing Selenium tests, LLMs doing exploratory testing, LLMs used for canary deployments. All that testing that people didn't do before because it was too hard and took too long? LLMs will be used to do it.
Pushed a new plugin elevator-notifications to the repo/marketplace. I'm seeing notifications in notification center (had to turn mac notifications back on to test). Looks like I could fine tune the actual notification content a bit more but it's working on my machine.
Can you setup automated integration/end-to-end tests and find a way to feed that back into your AI agents before a human looks at it? Either via an MCP server or just a comment on the pull request if the AI has access to PR comments. Not only is your lack of an integration testing pipeline slowing you down, it's also slowing your AI agents down.
"AFAICT, there’s no service that lets me"... Just make that service!
We do integration testing in a preview/staging env (and locally), and can do it via docker compose with some GitHub workflow magic (and used to do it that way, but setup really slowed us down).
What I want is a remote dev env that comes up when I create a new agent and is just like local. I can make the service but right now priorities aren’t that (as much as I would enjoy building that service, I personally love making dev tooling).
Claude Code latency is at the unfortunate balance where the wait is long enough for me to go on twitter, but not long enough to do anything really valuable. Would be more productive if it took minutes or under 5-10 seconds.
If AI is good enough to write formal verification, why wouldn't it be good enough to do QA? Why not just have AI do a full manual test sweep after every change?
I guess I am luddite-ish in that I think people still need to decide what must always be true in a system. Tests should exist to check those rules.
AI can help write test code and suggest edge cases, but it shouldn’t be trusted to decide whether behavior is correct.
When software is hard to test, that’s usually a sign the design is too tightly coupled or full of side effects, or that the architecture is unnecessarily complicated. Not that the testing tools are bad.
You get confidence in things by doing them. If you don't have experience doing something, you aren't going to be confident at it. Try vibe coding a few small projects. See how it works out. Try different ways of structuring your instructions to the 'agents'.
Are there public examples of "good instruction" and an iteration process? I have tried and have not been very successful at getting Claude Code to generate correct code for medium sized projects or features.
I had Claude write a piano webapp (https://webpiano.jcurcioconsulting.com) as a "let's see how this thing works" project. I was pleasantly surprised by the ease of it.
We are also going to want the opposite. A way for an LLM to request tool calls so that it can drive an arbitrary application. MCP exists, but it expects you to preregister all your MCP servers. I am not sure how well preregistering would work at the scale of every application on your PC.
reply