Hacker News new | past | comments | ask | show | jobs | submit login
Low-Memory Neural Network Training: A Technical Report (arxiv.org)
58 points by wildermuthn 23 days ago | hide | past | web | favorite | 3 comments

This it the best explanation I've read of how memory is allocated to training deep learning models, and what possible solutions there are to reducing that footprint.

Some things I already knew about, such as gradient checkpointing and FP16. What was new to me is microbatching (gradient accumulation).

Many of the large models appearing these days (Transformers in particular) are really costly to train from scratch. What I've noticed, about BERT in particular, is that none of the memory-saving techniques are used. I suppose a large corporation doesn't mind spending more money on compute, but for a startup, these techniques could be quite useful on a limited budget. The cost to accuracy appears small, given the right mix of methods.

note: Xlnet (successor of BERT) has a pull request allowing FP16 use.

And yes it's sad that Google don't prioritize this. More generally, researchers works in isolation, they don't integrate other researchers synergetic ideas...

I feel like Tensorflow tends to allocate scratch space in GPU RAM for the result of every math operation.

So if you have a big tensor and want to do math on it, you need space to effectively store multiple copies of the tensor.

I think TF does this for speed reasons.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact