I think the selection of models is a bit off. Haiku instead of Sonnet for example. Kimi K2's capabilities are closer to Sonnet than to Haiku. GPT-5 might be in the non-reasoning mode, which routes to a smaller model.
I had my suspicions about the GPT-5 routing as well. When I first looked at it, the clock was by far the best; after the minute went by and everything refreshed, the next three were some of the worst of the group. I was wondering if it just hit a lucky path in routing the first time.