This is very insightful. Another one of those "obvious in retrospect" ideas. We ...

woopsn · 2024-03-25T06:43:23 1711349003

Not quite in retrospect -- the earlier paper on emergence addresses this

> It is also important to consider the evaluation metrics used to measure emergent abilities. For instance, using exact string match as the evaluation metric for long-sequence targets may disguise compounding incremental improvements as emergence. Similar logic may apply for multi-step or arithmetic reasoning problems, where models are only scored on whether they get the final answer to a multi-step problem correct, without any credit given to partially correct solutions.

- Page 7, https://arxiv.org/pdf/2206.07682.pdf

They say that cross entropy loss in these cases goes down with model size incrementally, well before "emergent" capabilities appear. If so the model is improving (in a sense) even though these capabilities aren't observable below some critical size.