We’ve been testing the upgraded models in the API (where you can control when th...

We’ve been testing the upgraded models in the API (where you can control when the upgrade happens), and the newer ones perform significantly worse than the older ones on the same tasks. Tweaking the prompts helps some but not enough. We’re staying on the older models for now in production.

Hope OpenAI figures this out because quality has been their biggest moat up until now.