Hacker News new | past | comments | ask | show | jobs | submit login

> Seems like we still have a long way to go after Adam...

A preprint in arxiv suggests that Adam works better than SGD for training LLMs due to the issue of class-imbalance [0]. It appears that scaling the gradient step helps with the training, for example, see another approach suggested in [1].

0. https://arxiv.org/pdf/2402.19449 1. https://arxiv.org/pdf/2402.02347




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: