Is there a human-play benchmark (even informally) for this style of interface? Not saying it's necessary or even relevant, I'm just curious to know what programmatic Factorio feels like -- I imagine spatial reasoning around text prompts would be fairly challenging for human players to navigate as well.
Human benchmarks for Factorio are speed runners — rushing to launch the first rocket. The current record is just over 4 hours for one player, and 90 minutes for a team. You can see just from that that a multi-tasking LLM has room to outperform humans.
The current 4h12m hour record is for 100% (where you have to get every single achievement in the game, in the one run), any% (where you just need to launch a rocket) is under 2 hours (1h42 for the latest factorio v2.x, 1h18 for v1.x). There are a few other differences between the categories regarding map selection and blueprint use as well.
I think he is talking about a human using the programatic API the LLMs are using to play the game. I think that would be a whole lot slower than normal playthrough