But it doesn't evaluate the area that I am most eager to see improvements in LLM...

But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.

If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.

It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.

Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.