Does anyone actually use the 'normal gradient descent' with the whole training s... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

eru 51 days ago | parent | context | favorite | on: Beyond Gradient Averaging in Parallel Optimization

Does anyone actually use the 'normal gradient descent' with the whole training set? I only ever see it as a sort of straw man to make explanation easier.

jey 51 days ago | [–]

Generally yes, vanilla gradient descent gets plenty of use. But for LLMs: no, it’s not really used, and stochastic gradient descent provides a form of regularization, so it probably works better in addition to being more practical.

bravura 51 days ago | [–]

Full batch with L-BFGS, when possible, is wildly underappreciated.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact