Because the training data/model size/compute tradeoff derived from that paper is highly suboptimal (too many parameters) compared to the ones from the later Deepmind scaling laws [1]. And then Meta researchers recommended using even smaller models, to trade-off training- and inference-time compute [2] (which I thought was pretty obvious if you care about more than just benchmarks).