I agree completely. I haven't noticed much improvement in coding ability in the last year. I'm using frontier models.
What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.
I write a lot of C++ and QML code. Codex 5.3, only released in Feb, is the the first model I've used that would regularly generate code that passes my 25 years expert smell test and has turned generative coding from a timesap/nuisance into a tool I can somewhat rely on not to set me back.
Claude still wasn't quite there at the time, but I haven't tried 4.6 yet.
QML is a declarative-first markup language that is a superset of the JavaScript syntax. It's niche and doesn't have a giant amount of training data in the corpus. Codex 5.3 is the first model that doesn't super botch it or prefers to write reams of procedural JS embeds (yes, after steering). Much reduced is also the tendency to go overboard on spamming everything with clouds of helper functions/methods in both C++ and QML. It knows when to stop, so to speak, and is either trained or able to reason toward a more idiomatic ideal, with far less explicit instruction / AGENTS.md wrangling.
It's a huge difference. It might be the result of very specific optimization, or perhaps simultaneous advancements in the harness play a bigger role, but in my books my kneck of the woods (or place on the long tail) only really came online in 2026 as far as LLMs are concerned.
Maybe n=1, but I disagree? I notice that Sonnet 4.6 follows instructions much better than 4.5 and it generates code much closer to our already in-place production code.
It's just a point release and it isn't a significant upgrade in terms of features or capabilities, but it works... better for me.
Are you using a tool like Claude Code or Codex or windsurf? I ask because I've found their ability to pull in relevant context improves tasks in exactly the way you're describing.
My own experience is that some things get better and some things get worse in perceived quality at the micro-level on each point release. i.e. 4.5->4.6
Ok, but the real issue with kids looking up porn is how it warps general expectations around sex. Singling out specific fetishes and taboos that involve consenting adults seems a little bit like misdirected moral panic.
To be more specific, the idea that step-cest warps children's minds is laughable when the larger issue is that 95% of porn portrays women as submissive sex dolls that exist for male pleasure. Don't forget the unrealistic expectations around body and beauty standards
Hah, the idea is to have an example on the site that is not offensive -- we're not going to write something offensive down -- but where you can understand what it would be or could be. It lets you infer / understand the point without us actually writing something awful. (Maybe we can do it better, though.)
Bears seemed a pretty inoffensive target, plus our backend uses Python with beartype and that library is all about bear jokes.
> The thing with coding agents is that it seems now that you can eat your cake and have it too. We are all still adapting, but results indicate that given the right prompts and processes harnessing LLMs quality code can be had in the cheap.
It's cheaper but not cheap
If you're building a variation of a CRUD web app, or aggregating data from some data source(s) into a chart or table, you're right. It's like magic. I never thought this type of work was particularly hard or expensive though.
I'm using frontier models and I've found if you're working on something that hasn't been done by 100,000 developers before you and published to stackoverflow and/or open source, the LLM becomes a helpful tool but requires a ton of guidance. Even the tests LLMs will write seem biased to pass rather than stress its code and find bugs.
It's quite cheap if you consider developer time. But it's only as cheap as you can effectively drive the model, otherwise you are just wasting tokens on garbage code.
> LLM becomes a helpful tool but requires a ton of guidance
I think this is always going to be the case. You are driving the agent like you drive a bike, it'll get you there but you need to be mindful of the clueless kid crossing your path.
For some projects I had good results just letting the agent loose. For others I'd have to make the tasks more specific and granular before offloading to the LLM. I see nothing wrong with it.
> I never thought this type of work was particularly hard or expensive though.
Maybe not intrinsically hard, but hard because it's so boring you can't concentrate.
> the LLM becomes a helpful tool but requires a ton of guidance. Even the tests LLMs will write seem biased to pass rather than stress its code and find bugs.
ISTR some have had success by taking responsibility for the tests and only having the LLM work on the main code. But since I only seem to recall it, that was probably a while ago, so who knows if it's still valid.
How far can Claude can take this beyond a cool demo.
Does it become exponentially harder to add the missing features or can you add everything else you need in another two days? I'm guessing the former but would be interested to see what happens.
Are you going to continue trying? I ask because it's only been two days and you're already on Show HN. It seems like if you waited for it to be more complete, it would have been more impressive.
Jira has had free competitors that do at least 75% of what it does since it's inception. You could find a dozen on github that actually look good right now.
What's been the game changer are tools like Claude Code. Automatic agentic tool loops purpose built for coding. This is what I have seen as the impetus for mainstream adoption rather than noticeable improvements in ability.
reply