This is very insightful. Another one of those "obvious in retrospect" ideas. We tend to think of addition as only correct or not but has anyone ever claimed that LLMs have a "discrete" output (might be using the wrong terminology)? It would then make sense that you need to measure performance in a continuous, not discrete way. Otherwise you'll end up with a sort of "aliasing"-type error.
Not quite in retrospect -- the earlier paper on emergence addresses this
> It is also important to consider the evaluation metrics used to measure emergent abilities. For instance, using exact string match as the evaluation metric for long-sequence targets may disguise compounding incremental improvements as emergence. Similar logic may apply for multi-step or arithmetic
reasoning problems, where models are only scored on whether they get the final answer to a multi-step problem correct, without any credit given to partially correct solutions.
They say that cross entropy loss in these cases goes down with model size incrementally, well before "emergent" capabilities appear. If so the model is improving (in a sense) even though these capabilities aren't observable below some critical size.