See figure 1: https://arxiv.org/pdf/2208.07339.pdf Outliers appear at model size...

janalsncm · on July 24, 2023

Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.

WithinReason · on July 24, 2023

The second heading in the TFA is "It’s All About Outliers"

PoignardAzur · on July 24, 2023

6.7B isn't "needs a datacenter" scale.

WithinReason · on July 24, 2023

It's in the million dollar range. XLnet which is a 1.3B model cost $245,000 to train for example.