We're at the point where LLMs and coding agents are supposed to do higher-level work. It makes sense to benchmark them against top human performance, rather than average human performance, because at specialized tasks, average human performance isn't enough.
The issues you described seem like they're actually strengths of the benchmark.
I think the keybinding suggestions are really nice. My shell is configured by default such that Alt+Left and Alt+Right move by a word, but having things that work out of the box, basically always, is really useful whenever I need to do things inside a docker container
The issues you described seem like they're actually strengths of the benchmark.
reply