
Low-Memory Neural Network Training: A Technical Report - wildermuthn
https://arxiv.org/abs/1904.10631
======
wildermuthn
This it the best explanation I've read of how memory is allocated to training
deep learning models, and what possible solutions there are to reducing that
footprint.

Some things I already knew about, such as gradient checkpointing and FP16.
What was new to me is microbatching (gradient accumulation).

Many of the large models appearing these days (Transformers in particular) are
really costly to train from scratch. What I've noticed, about BERT in
particular, is that none of the memory-saving techniques are used. I suppose a
large corporation doesn't mind spending more money on compute, but for a
startup, these techniques could be quite useful on a limited budget. The cost
to accuracy appears small, given the right mix of methods.

~~~
The_rationalist
note: Xlnet (successor of BERT) has a pull request allowing FP16 use.

And yes it's sad that Google don't prioritize this. More generally,
researchers works in isolation, they don't integrate other researchers
synergetic ideas...

------
etaioinshrdlu
I feel like Tensorflow tends to allocate scratch space in GPU RAM for the
result of every math operation.

So if you have a big tensor and want to do math on it, you need space to
effectively store multiple copies of the tensor.

I think TF does this for speed reasons.

