Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.
[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...




It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.

If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.


But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.

If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.

It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.

Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: