Also, the specific problem described in that Issue was due to a bug I found in DeepSpeed that has since been corrected.
> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency
In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.
In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.
For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important.
If someone could ELI5 how ZeRO actually works, that would be nice.
I found that this earlier blog post  has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper  also has far more detail about that part of the pipeline.
This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)
> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.
Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.
See also https://github.com/microsoft/fastformers
I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.