Someone else can confirm, but from my understanding, no they did not know sentiment analysis, reasoning, few shot learning, chain of thought, etc would emerge at scale. Sentiment analysis was one of the first things they noticed a scaled up model could generalize. Remember, all they were trying to do was get better at next-token prediction, there was no concrete idea to achieve "instruction following", for example. We can never truly say going up another order of magnitude on the number of params won't achieve something (it could, for reasons unknown, just like before).
It is somewhat parallel to the story of Columbus looking for India but ending up in America.
The Schaeffer et al. "Mirage" paper showed that many claimed emergent abilities disappear when you use different metrics, what looked like sudden capability jumps were often artifacts of using harsh/discontinuous measurements rather than smooth ones.
But I'd go further: even abilities that do appear "emergent" often aren't that mysterious when you consider the training data. Take instruction following - it seems magical that models can suddenly follow instructions they weren't explicitly trained for, but modern LLMs are trained on massive instruction-following datasets (RLHF, constitutional AI, etc.). The model is literally predicting what it was trained on. Same with chain-of-thought reasoning - these models have seen millions of examples of step-by-step reasoning in their training data.
The real question isn't whether these abilities are "emergent" but whether we're measuring the right things and being honest about what our training data contains. A lot of seemingly surprising capabilities become much less surprising when you audit what was actually in the training corpus.
Didn't it just get better at next token prediction? I don't think anything emerged in the model itself, what was surprising is how really good next token prediction itself is at predicting all kind of other things no?
Nobody ever hypothesized it before it happened? Hard to believe.