
Microsoft Zero and DeepSpeed: Memory Efficient Large Neural Network Training - bobrenjc93
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/?OCID=msr_blog_zerodeep_tw
======
skwb
I work in deep learning for 3D imaging, and memory has constantly been the
primary bottleneck for our group. U-net for example tends to be fairly
"chonky", and isn't really super great in terms of parameter efficiency (but
it is nice when you need an out of the box network that just "works"...). This
has led medical imaging to use a lot of "patching" and other sliding window
sort of techniques to help get over this burden.

I tend to think that a lot of this is due to Facebook/Google/Etc being more
interested in 2D picture images, and hasn't really put a ton of effort into
developing approaches that are exponentially harder in terms of parameters.
While I don't think I can comment on if parallelism is the future to solve
(vs. single massive GPU memory chips vs. more efficient NN design vs. data
compression techniques), I think this is where a lot of the bleeding technical
edge will come from.

~~~
andbberger
I love to hate on U-net. It works but it's just so inelegant. That is not a
true convolution and only works for particular 'patch' sizes bothers me to no
end.

I am not super up to date with the field, but has anyone caught on to using
'wavenet' like architectures yet? That is, dialated convolutions.

You have to be a little clever to get residual connections to work properly,
but it's a true convolution that works for any patch size, is super-parameter
efficient, and captures the same multi-scale features U-net was designed for.

Anecdotally, I used such an arch for some (unfortunately proprietary) 3D
imaging work and achieved some nice results.

~~~
skwb
> "It works".

Well that's _sorta_ the point. Personally I'm not a super huge fan of creating
a super specific network architecture and resulting in 2-3% difference in
performance. Certainly if you're doing something where a configuration makes
sense (LSTM for time series for example), but I think there needs to be a
rethinking of the Grand Theory of Deep Learning Architecture TM.

And frankly I think a unsaid reason why U-net is so popular is that it does
generalize reasonably well with limited data, which in many fields is not as
massive as COCO.

I realize it's sorta asking too much (I both want a NN that works both out of
the box, super easily, and doesn't require a TON of data), but I think that's
where the current pains are for really explosive growth in AI.

~~~
andbberger
> I think there needs to be a rethinking of the Grand Theory of Deep Learning
> Architecture TM.

strong agree. Although perhaps not so much a rethinking as a theory of all.
Huge dearth of theory in the field. Daily practition involves regular use of
black magic intuition for arch, problem posing and debugging. Weird times.

------
choppaface
Even from the paper, it's hard to tell what this library actually does:
Section 5 in
[https://arxiv.org/pdf/1910.02054.pdf](https://arxiv.org/pdf/1910.02054.pdf)

The paper talks about parameter partitioning and overlapped communication, but
doesn't actually give many details on how those things happen.

The library appears to be an implementation of some common algos for solving
the 'pebble game,' as explained decently here:
[https://medium.com/tensorflow/fitting-larger-networks-
into-m...](https://medium.com/tensorflow/fitting-larger-networks-into-
memory-583e3c758ff9)

The essential point is that:

(1) model parallelism is hard to do and has historically been done manually to
scale _wide_ models across GPUs

(2) inter-GPU I/O is expensive for vanilla data-parallel jobs (that typically
use naive mirroring strategies)

(3) researchers have figured out now how to 'compile' a _deep_ model so that
layers span GPUs and save on both memory usage and I/O

(4) so scaling _wide_ models is still hard, but now we have better tools for
_deep_ models

Existing all-reduce-based data-parallel problems have already been well-
studied (see e.g.
[https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf](https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf)
), so it's really nice to see gains through new techniques.

Definitely like seeing this 'compilation' being wrapped up into a library.
Just wish they did a better job of communicating key ideas.

~~~
jeffra
We tried to communicate the key ideas in the video released with the blog
post. It shows how DeepSpeed and the ZeRO optimizer save memory, and shows
exactly what happens during each iteration of training. It is quite different
from standard data or model parallelism.

The ZeRO optimizer helps scale large models regardless of the model topology.
It works equally well for wide or deep models. Please let us know if you have
specific questions that we can address.

~~~
choppaface
Oh sorry I didn't make it to the video because the blog post intro made me
bounce straight to the paper. I agree the video is a big help versus what's
given in the paper.

It looks like your approach plays the 'pebble counting' game described in the
OpenAI article I linked. Or maybe you'd like to explain what's different.

What would really help in the video (and paper) is a grounded example (like
Resnet10 or AlexNet or just a 2-layer MLP) and drawing the connection between
GPU buffers and layers. I feel the video covers details of the memory savings
in way too much precision while the intuition behind the method (and how it
translates to a graphical model of a NN) is essentially absent.

------
jeffra
I'm from the DeepSpeed team, we're happy to answer questions if people have
them.

~~~
Tenoke
This is great and looks very easy to use! I'd expect it to have a huge impact
given how easy it makes for people to leverage a few or a few thousand GPUs. I
do have a few questions, of course.

Is it getting a lot of internal use already (beyond the example we just heard
about)?

Is it possible to do inference using a CPU and a lot of RAM using a model
trained on multiple GPUs via DeepSpeed?

Does it work with TPUs right out of the box? It looks like maybe not - if not,
any plans to support them?

Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?

~~~
jeffra
> Is it getting a lot of internal use already (beyond the example we just
> heard about)?

We have hundreds of internal users of DeepSpeed using it to train production
ready models, many of which have been already shipped.

> Is it possible to do inference using a CPU and a lot of RAM using a model
> trained on multiple GPUs via DeepSpeed?

It is definitely possible to do inference on CPU using a model trained on
multiple GPUs via DeepSpeed. For models trained without model parallelism,
this is straight forward. The tricky part is if the model was trained using
model parallelism, which would require merging checkpoints corresponding to
different pieces of the model into a single one.

> Does it work with TPUs right out of the box? It looks like maybe not - if
> not, any plans to support them?

The ZeRO technology is compatible with TPU or any accelerator in a cluster
setting, but we have not tested it with the TPUs. It likely would require some
small refactoring to get DeepSpeed to work with TPUs. We do not have any
internal plans to support them yet, but of course completely open to
contribution from the community.

> Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?

It is possible to use DeepSpeed to train using a lot of CPUs. The major
limitation of the approach is that CPUs can be an order of magnitude slower
than GPUs in terms of computational performance.

~~~
tixocloud
Are you able to share the use cases for production ready models?

------
liuliu
Looks like what it does is similar to what Alex did a few years back with One
Weird Trick paper:
[https://arxiv.org/abs/1404.5997](https://arxiv.org/abs/1404.5997)

When attempting to train transformers, I do notice a lot of time spend on
allreduce more than with CNN models, probably due the parameter sizes. OWT
seems to be natural to exploit for this situation (a lot of GEMMs, lot time
spent on allreduce).

Edit:

Read the paper. The implementation is much less tricky than OWT, but for a
good reason probably. Language model's GEMMs are smaller, therefore, partition
the model would have efficiency impact (smaller GEMM will be slower). This
does require much better interconnects, which NVLink / infiniband conveniently
provides, that is also not available on consumer grade hardware anywhere
(2-way NVLink is not meaningful).

------
easysnap
ZeRO is mainly a clever improvement that moves optimizer computation into the
2 phases of Ring-AllReduce. It greatly helps Adam and similar optimizers to
reduce per-GPU memory overhead.

The naive approach, as used in the well-known Megatron, completes Ring-
AllReduce first so that each GPU has a full set of aggregated gradients. Then
it does the same optimizer computation for all parameters on each GPU. That's
OK for vanilla SGD because vanilla SGD has no optimizer state variable. But
for Adam, the naive approach has to store a copy of full set of Adam m/v
storage on each GPU, which is super memory consuming. Actually, after the 1st
phase of All-Reduce each GPU has its subset of gradients. Each GPU can do Adam
SGD for that subset, and importantly, it just need to keep m/v corresponding
to that subset of gradients. After the Adam optimizer completes, the 2nd phase
of Ring-AllReduce will scatter the updated parameters to all GPUs. Therefore,
that's memory saving and computation saving. (The naive approach is more
general as it allows optimizers to use optimizer variables of different
network layers. But most optimizers, like Adam, don't really need that
capability. ZeRO cleverly leveraged that locality.)

------
minimaxir
Link to GitHub:
[https://github.com/microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed)

------
bitforger
I wrote a blog post on the difficulties of memory-efficient training, which
seems relevant: [http://mitchgordon.me/machine/learning/2020/01/13/do-we-
real...](http://mitchgordon.me/machine/learning/2020/01/13/do-we-really-need-
model-compression.html)

The methods discussed there take a different angle at the problem.

------
tixocloud
While on the surface, this looks interesting, can anyone help me understand
who exactly needs to do and redo neural network training that will take
advantage of these optimizations? I’m struggling to understand which
companies/data scientists would use this.

