Very much agree. Gave a presentation on AI to a group earlier this week and I spent a third of the time talking about the Opus 4.5 inflection point in AI history. First time using that model the day it was released it was so clear that it knew what it was doing at a different level. People still jump around to different models or tools or time frames when talking about AI and usefulness, but those have no meaning if they’re not using the Opus 4.5 and 4.6 models and anthropic harnesses of Claude code or cowork.
I’m interested in the studies along with the history of AI and if they’re going to realize that was the point when things changed, because for us devs, that was the moment.
I gave a similar presentation in January which covers the AI features that emerged in 2025 that culminated in the step-function in capability in Nov'25 and where I went from there.... (certainly my GitHub activity is bright green since)
The presentation was created with Claude Code to prove itself; never going back to Keynote/PowerPoint. Press 'X' key to disable "safe mode". Prompts are in the repo.
Massivly better and I cannot understand how many comments online say that they're comparable (other than paid actors which now fits the right wing angle that OpenAI takes because right wing paid online comments seems quite common overall).
I remember on the Opus 4.5 release data watching what it can do to my test app I wanted it to build and saying outloud to myself "oh shit" because of how much better it was at the conversation, planning, understanding, and building. Posts like this[0] say similar things, where Opus 4.5 release + Claude Code was the tipping point and the gap is widening and Anthropic has infinite more momentum and going in the better direction with useful models that aren't fully aligned with bad actors.
> It has entered what mountaineers call the death zone: the altitude above 8,000 metres at which the human body consumes itself faster than it can be repaired.
> Over the past four years the Russian economy has bifurcated into two distinct metabolic systems... The body is metabolising its own muscle tissue for energy.
> A recession is like fatigue: rest and you recover. Russia’s condition is like altitude sickness: the longer you stay, the worse it gets, regardless of rest.
> But Vladimir Putin is not only watching his own oxygen gauge. He is watching the other climbers.
Always a fan of the writing style the Economist promotes.
I found the metaphors / analogons rather disturbing e.g. metabolism, etc. Does the human body have a Chinese assistent with an oxigen tent at 8000 metres?
I wonder how much this might change in the coming years purely from GLP-1s. Articles like this[0] (which yes, Betteridge's law applies) talk about how it’s pretty likely they’ll be able to be used by everyone. But even now, taking people with cardiovascular high probabilities and dropping that risk way down purely by giving them the feeling that they’re more full more frequently is crazy to think about. Not sure opinions here but I’m at the point of telling my parents they should both be on these right now in their upper 60s.
Some people shrug it off or claim that they’re higher status because they lost weight via diet and exercise, but I map that to people who think they’re better programmers because they don’t use llms for coding, when the real result is what matters. Similar to people thinking AI slop, there are news articles about what happens if you stop GLP-1s and gain the weight back. But the stories of people who either continue to microdose, or also learn the feelings of their body and how it differs have long term success. Similar to those who know how to work with llms get good results, but the news is about how smarter people don’t use it.
All very interesting subjects. What a world we’re in.
Then why the fuck hasn't the US just added it to Medicare / Medicaid coverage? It makes no sense. These healthcare schemes are costly, and covering this medication would make it... less costly.
What does "less muscle-mass" mean in terms of mortality statistics?
We already know women live longer than men on average, and also have less muscle-mass than men on average, so clearly it's not having too much of an impact on women.
Without looking into actual statistics here, Japan is known for having a high life expectancy, and stereotypically Japan's population is both relatively thin, and has relatively little muscle, so that also seems to defy that expectation.
And specifically GLP-1 usage is associated with significant loss of lean mass:
https://pubmed.ncbi.nlm.nih.gov/38937282/
In some studies, reductions in lean mass range between 40% and 60% as a proportion of total weight lost ...
This might be a good start. There is quite a bit of material here and as might be expected much of it is fairly recent and gets a lot of this kind of skinny equals long life feedback that isn't strongly supported by clinical data.
A person who does it naturally is still higher status. Staying thin naturally, especially if also fit, indicates a level of will, health focus, and self respect that I appreciate. I wouldn't like to start dating a woman and learn that she has a "thinness subscription". That's a lot of money being spent to avoid discipline, and lack of discipline also is just unattractive. I would consider GLP1 use in a potential partner equivalent to him/her ordering food all the time; it wastes money and indicates that he/she may be lazy or struggling with executive function.
I literally did this yesterday and had the same thought. Older computer (8 gigs ram) with crappy windows I never used and I thought huh, I wonder how good these models can take me through installing linux with goal of docker deploys of relatively basic things like cron tasks, personal postgres, and minio that I can used for self shared data.
Took a couple hours with some things I ran across, but the model had me go through the setup for debian, how to go through the setup gui, what to check to make it server only, then it took me through commands to run so it wouldn't stop when I closed the laptop, helped with tailscale, getting the ssh keys all setup. Heck it even suggested doing daily dumps of the database and saving to minio and then removing after that. Also knows about the limitations of 8 gigs of ram and how to make sure docker settings for the difference self services I want to build don't cause issues.
Give me a month and true strong intention and ability to google and read posts and find the answer on my own and I still don't think I would have gotten to this point with the amount of trust I have in the setup.
I very much agree with this topic about self hosting coming alive because these models can walk you through everything. Self building and self hosting can really come alive. And in the future when open models are that much better and hardware costs come down (maybe, just guessing of course) we'll be able to also host our own agents on these machines we have setup already. All being able to do it ourselves.
Reread Story of Your Life again just now, and all it made me want to do is learn Heptapod B and their senagram style of written communication.
Reading “Mathematica - A secret world of intuition and curiosity” as well and a part stuck out in a section called The Language Trap. Example author gives is about for a recipe for making banana bread, that if you’re familiar with bananas, it’s obvious that you need to peel them before mashing. Bit of you haven’t seen a banana, you’d have no clue what to do. Does a recipe say peel a banana or should that be ignored? Questions like these are clear coming up more with AI and context, but it’s the same for humans. He ends that section saying most people prefer a video for cooking rather than a recipe.
Other quote from him:
“The language trap is the belief that naming things is enough to make them exist, and we can dispense with the effort of really imagining them.”
There was a man who was afraid of his shadow and disliked his footprints. So he tried to get away from them.
He ran, but the faster he ran, the more numerous his footprints became, and his shadow kept up with him without lagging behind.
Thinking he was going too slowly, he ran faster and faster, until he collapsed and died of exhaustion.
He did not realize that if he had simply stayed in the shade, his shadow would have disappeared, and if he had sat still, there would have been no footprints.
And another one [0]:
My hut lies in the middle of a dense forest;
Every year the green ivy grows longer.
No news of the affairs of men,
Only the occasional song of a woodcutter.
The sun shines and I mend my robe;
When the moon comes out I read Buddhist poems.
I have nothing to report, my friends.
If you want to find the meaning, stop chasing after so many things.
Infinitely agree with all. I was skeptical, and then tried Opus 4.5 and was blown away. Codex with 5.0 and 5.1 wasn't great, but 5.2 is big improvement. I can't do code without it because there's no point. Time and quality with the right constraints, you're going to get better code.
And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.
You can code without it. Maybe you don't want to, but if you're a programmer, you can
(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)
Sure, but the end of this post [0] is where I'm at. I don't feel the need or want to write the code when I can spend my time doing the other parts that are much more interesting and valuable.
> Emil concluded his article like this:
> JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent.
> But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.
> That’s probably the right division of labor.
>I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time.
But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”
Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly).
If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.
In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.
Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.
Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...
Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.
People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.
Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.
It's the semantics of "can", where it is used to suggest feasibility. When I moved and got a new commute, I still "could" bike to work, but it went from 30min to an hour and a half each way. While technically possible, I would have had to sacrifice a lot when losing two hours a day- laundry, cooking dinner, downtime. I always said I "can't really" bike to work, but there is a lot of context lost.
"Can" is too overloaded a word even with context provided, ranging from places like "could conceivably be achieved" to "usually possible".
The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy).
Summary is that for agents to work well they need clear vision into all things, and putting the data behind a gui or not well maintained CLI is a hinderance. Combined with how structured crud apps are an how the agents can for sure write good crud apps, no reason to not have your own. Wins all around with not paying for it, having a better understanding of processes, and letting agents handle workflows.
Same. Never used worktrees before, but mapping a worktrees to tickets I’m assigned to for Claude to work on is really great.
Heck with the ai, I even have it spin up a dev and test db for that worktree in a docker container. Each has their own so they don’t conflict on that front either. And I won’t lie, I didn’t write those scripts. The models did it all and can make adjustments when I find different work patterns that I like.
This is all to the point of me wondering why I never did this for myself in the past. With the number of times I’m doing multiple parts of a codebase and the annoyance of committing, stashing, checking out different branch and not being able to go more quickly between when blockers are resolved.
I’m interested in the studies along with the history of AI and if they’re going to realize that was the point when things changed, because for us devs, that was the moment.
reply