Some things I already knew about, such as gradient checkpointing and FP16. What was new to me is microbatching (gradient accumulation).
Many of the large models appearing these days (Transformers in particular) are really costly to train from scratch. What I've noticed, about BERT in particular, is that none of the memory-saving techniques are used. I suppose a large corporation doesn't mind spending more money on compute, but for a startup, these techniques could be quite useful on a limited budget. The cost to accuracy appears small, given the right mix of methods.
And yes it's sad that Google don't prioritize this.
More generally, researchers works in isolation, they don't integrate other researchers synergetic ideas...
So if you have a big tensor and want to do math on it, you need space to effectively store multiple copies of the tensor.
I think TF does this for speed reasons.