I added Opus 4.5 to my benchmark of 30 alternatives to your now-classic pelican-bicycle prompt (e.g., “Generate an SVG of a dragonfly balancing a chandelier”). Nine models are now represented:
I was about to say the same; suspiciously good, even. Feels like it's either memorised a bunch of SVG files, or has a search tool and is finding complete items off the web to include either in whole or in part.
Given that it also sometimes goes weird, I suspect it's more likely to be the former.
While the latter would be technically impressive, it's also the whole "this is just collage!" criticism that diffusion image generators faced from people that didn't understand diffusion image generators.
I agree with your sentiment, this incremental evolution is getting difficult to feel when working with code, especially with large enterprise codebases. I would say that for the vast majority of tasks there is a much bigger gap on tooling than on foundational model capability.
Also came to say the same thing. When Gemini 3 came out several people asked me "Is it better than Opus 4.1?" but I could no longer answer it. It's too hard to evaluate consistently across a range of tasks.
> Thinking blocks from previous assistant turns are preserved in model context by default
This seems like a huge change no? I often use max thinking on the assumption that the only downside is time, but now there’s also a downside of context pollution
Opus 4.5 seems to think a lot less than other models, so it’s probably not as many tokens as you might think. This would be a disaster for models like GPT-5 high, but for Opus they can probably get away with it.