The point is that "best model you can train in 5 minutes" is hardware dependent,...

The point is that "best model you can train in 5 minutes" is hardware dependent, the answer will be different depending on the hardware available. So it's necessarily a single-player game.

"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.

Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.

tl;dr: more dimensionally correct