But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.
If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.
It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.
Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.
If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.
It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.
Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.