We once saw GPT4o spend something like 100 repeated interactions trying the action known not to work (before snapping out of it). My intuition here is that this is a result of target fixation - the more repetitions of something it does, the more likely it is to keep repeating it, because it occupies more of the context.
* It'll output a broken script
* I tell it what's wrong and how to fix it
* It tells me I'm absolutely right and that it will correct it
* It outputs a script with the exact same brokenness