Hacker News new | past | comments | ask | show | jobs | submit login

See figure 1:

https://arxiv.org/pdf/2208.07339.pdf

Outliers appear at model size 6.7B and are not present at 2.7B




Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.


The second heading in the TFA is "It’s All About Outliers"


6.7B isn't "needs a datacenter" scale.


It's in the million dollar range. XLnet which is a 1.3B model cost $245,000 to train for example.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: