> Seems like we still have a long way to go after Adam... A preprint in arxiv su...

> Seems like we still have a long way to go after Adam...

A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].

0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347