I find it struggles to even refactor codebases that aren't that large. If you have a somewhat complicated change that spans the full stack, and has some sort of wrinkle that makes it slightly more complicated than adding a data field, then even the most modern LLMs seem to trip on themselves. Even when I tell it to create a plan for implementation and write it to a markdown file and then step through those steps in a separate prompt.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
I haven’t used GPT5 yet, but even on a 1000 line code base I found Opus 4, o3, etc. to be very hit or miss. The trouble is I can’t seem to predict when these models will hit. So the misses cost time, reducing their overall utility.
I'm exclusively using sonnet via claude-code on their max plan (opting to specify sonnet so that opus isn't used). I just wasn't pleased with the opus output, but maybe I just need to use it differently. I haven't bothered with 4.1 yet. Another thing I noticed is opus would eat up my caps super quick, whereas using sonnet exclusively I never hit a cap.
I'd really just love incremental improvements over sonnet. Increasing the context window on sonnet would be a game changer for me. After auto-compact the quality may fall off a cliff and I need to spend some time bringing it back up to speed.
When I need a bit more punch for more reasoning / architecture type evaluations, I have it talk to gemini pro via zen mcp and OpenRouter. I've been considering setting up a subagent for architecture / system design decisions that would use the latest opus to see if it's better than gemini pro (so far I have no complaints though).
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.