GPT-5 is like the guy on the baseball team that's really good at hitting home runs but can't do basic shit in the outfield.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
> it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
I've gotten the `git reset --hard` with Claude Code as well, just not immediately after (1)) explicitly pushing back against the idea or (2) it talking a bunch of shit about another agent's totally reasonable analysis.
I exclusively used sonnet when I used Claud Code and never ran into this, so maybe it's an Opus thing, or I just got lucky? Definitely has happened to me a few times with Codex (which is what I'm currently using).
I've seen sonnet undo changes I've made while it was working quite a few times. Now I just don't edit concurrently with it, and make sure to inform of it of changes I've made before letting it work on its own
I do it as well. I have a Claude code instance running in my backend repo, and one running in my frontend repo. If there is required coordination, I have the backend agent write a report for the front end agent about the new backend capabilities, or have the front end agent write a report requesting a new endpoint that would simplify the code.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
Sure. But at no point do you need to talk about the existence of other agents. You talk about making a plan, and you talk about implementing the plan. There's no need to talk about where the plan came from.
Because the plan involves using multiple agents with different roles and I don't want them conflicting.
Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
It's not a whole conversation it's like "hey I'm using claude code to do analysis and this is what it said" or "gemini just used its large context window to get a bird's eye view of the code and this is what it saw".
All of these perform better if you say "a reviewer recommended" or something. The role statement provides the switch vs the implementation. You have to be careful, though. They all trust "a reviewer" strongly but they'll be more careful with "a static analysis tool".
My favorite evaluation prompt which, I've found, tends to have the right level of skepticism is as follows (you have to tack it on to whatever idea/proposal you have):
"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
That's great given that the goal of OAI is to train artificial superintelligence first, hoping that the previous version of the AI will help us control the bigger AI.
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”
The only exaggeration is in that the way I asked GPT-5 to leave claude to do its thing was to say "why don't we just let claude cook"? I later checked with ChatGPT about the whole exchange and it confirmed that it was well aware of the meaning of this slang, and it's first reaction was that whole thing just sounded like a funny programmer joke, all in jest. But then I reminded it that I'd explicitly pushed back on a hard reset twice.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
I strongly disagree with the personified way you interact with LLMs from a standpoint of “I’ve rarely gotten the best output from the LLM when I interact casually with them”.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
That's fair. I enjoy the playfulness of it and for me it feels almost like a video game or something, and also like I'm using my own natural language directly.
Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
That’s such a great analogy. I always say GPT is like the genius that completely lacks common sense. One of my favorite things is when I asked it why the WiFi wasn’t working, and showed it a photo of our wiring. It said that I should tell support:
>
“My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
My subjective personal experience is the exact opposite of yours, GPT-5-codex is super slow and the results are mediocre at best. I would probably stop using AI for coding if I was forced to use GPT-5-codex.
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
I don’t think it’s just a quirk. I’ve tested Claude across Java, Python, TypeScript and several other projects. The results are consistent, regardless of language or project structure, though it definitely performs better with smaller codebases. For larger ones, it really helps if you’re familiar with the project architecture and can guide it to the right files or modules, that saves a lot of time.
GPT-5-high (haven’t tried codex yet) is dog slow, but IME if you start with asking it for detailed requirements in a markdown doc with alternatives for each major decision and pseudocode implementations with references to relevant files, it makes a great prompt for faster a model like sonnet.
Opposite for me…5-codex high ran out of tokens extremely quickly and didn’t adhere as well to the agents.md as Claude did to the Claude.md, perhaps because it insists on writing extremely complicated bash scripts or whole python programs to execute what should be simple commands.
Codex was a miserable experience for me until I learned to compact after every feature. Now it is a cut above CC, although the latter still has an edge at TODO scaffolding and planning.
I don't even compact, I just start from scratch whenever I get down below 40%, if I can. I've found Codex can get back up to speed pretty well.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
/new (codex) or /clear (claude code) are much better than compact after every feature, but of course if there is context you need to retain you should put it (or have the agent put it) in either claude/agents.md or a work log file or some other file.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
Cursor does this automatically, although I wish there was a command for it as well. All AIs start shitting the bed once their context goes above 80% or so.
I always wonder how absolute in performance a given model is. Sometimes i ask for Claude-Opus and the responses i get back are worse than the lowest end models of other assistants. Other times it surprises me and is clearly best in class.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
Well knowing the state of the tech industry they probably have a different, legal-team approved definition of “reducing model quality” than face value.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
The Anthropic models have been vibe-coding tuned. They're beasts at simple python/ts programs, but they definitely fall apart with scientific/difficult code and large codebases. I don't expect that to change with the new Sonnet.
In my experience Gemini 2.5 Pro is the star when it comes to complex codebases. Give it a single xml from repomix and make sure to use the one at the aistudio.
In my experience, G2.5P can handle so much more context and giving an awesome execution plan that is implemented by CC so much better than anything G2.5P will come up with. So; I give G2.5P the relevant code and data underneath and ask it to develop an execution plan and then I feed that result to CC to do the actual code writing.
This has been outstanding for what I have been developing AI assisted as of late.
I would believe this. In regular conversational use with the Gemini family of models, I've noticed they regularly have issues with context blending.. i.e. confusing what you said and they said and causality.
I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
On the plus side, GPT5 is very malleable, so you CAN prompt it away from that, whereas it's very hard to prompt Claude into producing hard code: even with a nearly file by file breakdown of a task, it'll occasionally run into an obstacle and just give up and make a mock or top implementation, basically diverge from the entire plan, then do its own version.
Absolutely, sometimes you want, or indeed need such complexity. Some work in settings where they would want it all of the time. IMHO, most people, most of the time don't really want it, and don't want to have to prompt it every time to avoid it. That's why I think it's still very useful to build up experience with the three frontier models, so you can choose according to the situation.
I think a lot of it has to do with the super long context that it has. For extended sessions and/or large codebases that can fill up surprisingly quickly.
That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".
One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.
Yup. In fact every deep research tool on the market is just a wrapper for gemini, their "secret sauce" is just how they partition/pack the codebase to feed it into gemini.
Even with Serena and detailed plans crafted by Gemini that lay out file-by-file changes, Claude will sometimes go off the rails. Claude is very task-completion driven, and it's willing to relax the constraints of the task to complete in the face of even slight adversity. I can't tell you the number of times I've had Claude try to install a python computational library, get an error, then either try to hand-roll the algorithm (in PYTHON) or just return a hard coded or mock result. The worst part is that Claude will tell you that it completed the task as instructed in the final summary; Claude lying is a meme for a reason.
I have to agree with pretty much all of this. Specifically, I've had Claude fail at creating a database migration using tooling then go on to create the migration manually. My only reaction to anyone doing this, be it human or computer, is "You did WHAT!?".
Well, they seem to benchmark better only when giving the model "parallel test time compute" which AFAIU is just reasoning enabled? Whereas the GPT5 numbers are not specified to have any reasoning mode enabled.
For unity gamedev code reviews, I much preferred the gpt5 code. Claude gave me a bunch of bad recommendations for code changes, and also an incorrect formula for completion percentage.
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.